When you create a file using
FileSystem#create a new thread is spawned to handle real data transfer. Remember to call
Creating new files on HDFS using hadoop-hdfs classes is quite simple. You have to call
FileSystem#getFileSystem with proper configuration object to create an instance of
DistributedFileSystem. Then call its #create method to get
FSDataOutputStream instance and use it to write down your data. See the snippet below:
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
FSDataOutputStream stream = fs.create(new Path("/file.txt"));
But how data get from your machine to HDFS if
#write method does not block the current thread for data transfer?
Let's look at it a little bit closer how it works.
DistributedFileSystem#create first expands all relative paths into absolute ones. Then resolve all possible symlinks in those paths into absolute paths.
After that it calls
DFSClient#create which forward call to
DFSOutputStream#newStreamForCreate. This particular method is responsible for the magic.
It calls namenode via RPC to create file entry in the file system. At this moment, created file is visible and readable for all HDFS users!
Next, still in body of
#newStreamForCreate, a new
DFSOutputStream is created. Its constructor spawns new
DataStreamer (a child of
Thread) which is immediately started.
HdfsDataOutputStream before return it. Also, it calls
DataStreamer#start to run streamer as a thread. That’s how you data is streamed to HDFS.
If you pass a
Progressable instance to
DistributedFileSystem#create then the method
#progress would be called by
DFSOutputStream in data uploading loop. However, there is no direct way to get to know how much data has been already sent.