HDFS - create a new file
TL;DR
When you create a file using FileSystem#create
a new thread is spawned to handle real data transfer. Remember to call FSDataOutputStream#flush
and FSDataOutputStream#sync
.
Overview
Creating new files on HDFS using hadoop-hdfs classes is quite simple. You have to call FileSystem#getFileSystem
with proper configuration object to create an instance of DistributedFileSystem
. Then call its #create method to get FSDataOutputStream
instance and use it to write down your data. See the snippet below:
Configuration conf = new Configuration();
conf.set("fs.default.name","hdfs://localhost:54310");
FileSystem fs = FileSystem.get(conf);
FSDataOutputStream stream = fs.create(new Path("/file.txt"));
stream.write(“test”.getBytes());
stream.flush();
stream.sync();
stream.close();
But how data get from your machine to HDFS if #write
method does not block the current thread for data transfer?
In details
Let's look at it a little bit closer how it works. DistributedFileSystem#create
first expands all relative paths into absolute ones. Then resolve all possible symlinks in those paths into absolute paths.
After that it calls DFSClient#create
which forward call to DFSOutputStream#newStreamForCreate
. This particular method is responsible for the magic.
It calls namenode via RPC to create file entry in the file system. At this moment, created file is visible and readable for all HDFS users!
Next, still in body of #newStreamForCreate
, a new DFSOutputStream
is created. Its constructor spawns new DataStreamer
(a child of Thread
) which is immediately started.
DFSClient
returns DFSOutputStream
instance.
DistributedFileSystem
wraps DFSOutputStream
in HdfsDataOutputStream
before return it. Also, it calls DataStreamer#start
to run streamer as a thread. That’s how you data is streamed to HDFS.
Additional notes
If you pass a Progressable
instance to DistributedFileSystem#create
then the method #progress
would be called by DFSOutputStream
in data uploading loop. However, there is no direct way to get to know how much data has been already sent.
Legend
org.apache.hadoop.fs.FileSystem
org.apache.hadoop.hdfs.DistributedFileSystem
org.apache.hadoop.fs.DFSOutputStream
org.apache.hadoop.hdfs.DFSOutputStream.DataStreamer
org.apache.hadoop.hdfs.client.HdfsDataOutputStream
org.apache.hadoop.hdfs.DFSClient
org.apache.hadoop.hdfs.FSDataOutputStream
org.apache.hadoop.util.Demon
org.apache.hadoop.util.Progressable