Google Distributed File System

We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients…more

Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to user applications. In a large cluster, thousands of servers both host directly attached storage and execute user application tasks. By distributing storage and computation across many servers, the resource can grow with demand while remaining economical at every size. We describe the architecture of HDFS and report on experience using HDFS to manage 25 petabytes of enterprise data at Yahoo!… more

Hadoop tutorial - The Hadoop Distributed File System

The Hadoop Distributed File System, Robert Chansler, Hairong Kuang, Sanjay Radia, Konstantin Shvachko, and Suresh Srinivas

The following code sample is is based on Hadoop in Action, Chuck Lam, 2010, section 3.1.2:

  1. import java.io.InputStream;
  2. import java.io.InputStream;
  3. import java.io.IOException;
  4. import org.apache.hadoop.conf.Configuration;
  5. import org.apache.hadoop.fs.FSDataInputStream;
  6. import org.apache.hadoop.fs.FSDataOutputStream;
  7. import org.apache.hadoop.fs.FileStatus;
  8. import org.apache.hadoop.fs.FileSystem;
  9. import org.apache.hadoop.fs.Path;
  10.  
  11. public class MergeFiles {
  12.     
  13.     public static void main(String[] args) throws IOException {
  14.         Configuration conf = new Configuration();
  15.         FileSystem hdfs = FileSystem.get(conf);
  16.         
  17.         FileSystem local = FileSystem.getLocal(conf);
  18.         Path inputDir = new Path(args[0])
  19.         Path hdfsFile = new Path(args[1]);
  20.         try {
  21.             FileStatus[] inputFiles = local.listStatus(inputDir)
  22.             OutputStream out = hdfs.create(hdfsFile)
  23.             for (int i=0i<inputFiles.lengthi++) {
  24.                 System.out.println(inputFiles[i].getPath().getName());
  25.                 InputStream in = local.open(inputFiles[i].getPath());
  26.                 byte buffer[] = new byte[256];    
  27.                 int bytesRead = 0;
  28.                 while( (bytesRead = in.read(buffer)) > 0) 
  29.                     out.write(buffer0bytesRead);
  30.                 in.close();
  31.             }
  32.             out.close();
  33.         } catch (IOException e) {
  34.             e.printStackTrace();
  35.         }
  36.     }
  37. }

>javac -cp ../hadoop-0.20.2-core.jar MergeFiles.java
>jar cvf mergefiles.jar MergeFiles.class 
>../bin/hadoop jar mergefiles.jar MergeFiles in out