Hadoop – HDFS put merge

Today I am showing a tool to merge files ‘on-the-fly’ an put them to your HDFS.

The default filesystem operations, that come with every Hadoop release, include only functions to get multiple files from your HDFS and write them into one merged local file. The command used for this is getmerge. You can see an example here:

./hadoop fs -getmerge /user/hadoop/hdfs-testfiles/ /tmp/local-merged.log

If you want to to do it the other way round you must merge the files first on your local filesystem and then put them into your HDFS. Usually you put many large files (eg. apache log files). If you merge them first you need twice the harddisk space (the single files + the merged file), some (long) time to merge and time to put into HDFS. To save time and extensive io operations, you can merge them on-the-fly with a little tool.

The tool only needs two arguments:
- the directory you want to merge and put to HDFS
- the target path and filename


To run the tool your Hadoop environment must be started. Then type in the following command:

./hadoop jar ../PutMerge.jar org.fahlke.hadoop.io.PutMerge /opt/logfiles/ hdfs-merged.log

The full source is listed below:

package org.fahlke.hadoop.io;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

public class PutMerge {
	public static void main(String[] args) throws IOException {
		if(args.length != 2) {
			System.out.println("Usage PutMerge <dir> <outfile>");
			System.exit(1);
		}

		Configuration conf = new Configuration();
		FileSystem hdfs = FileSystem.get(conf);
		FileSystem local = FileSystem.getLocal(conf);
		int filesProcessed = 0;

		Path inputDir = new Path(args[0]);
		Path hdfsFile = new Path(args[1]);

		try {
			FileStatus[] inputFiles = local.listStatus(inputDir);
			FSDataOutputStream out = hdfs.create(hdfsFile);
			for(int i = 0; i < inputFiles.length; i++) {
				if(!inputFiles[i].isDir()) {
					System.out.println("\tnow processing <" + inputFiles[i].getPath().getName() + ">");
					FSDataInputStream in = local.open(inputFiles[i].getPath());

					byte buffer[] = new byte[256];
					int bytesRead = 0;
					while ((bytesRead = in.read(buffer)) > 0) {
						out.write(buffer, 0, bytesRead);
					}
					filesProcessed++;
					in.close();
				}
			}
			out.close();
			System.out.println("\nSuccessfully merged " + filesProcessed + " local files and written to <" + hdfsFile.getName() + "> in HDFS.");
		} catch (IOException ioe) {
			ioe.printStackTrace();
		}
	}
}
Tagged , , , ,

6 thoughts on “Hadoop – HDFS put merge

  1. daigoumee says:

    It’s really a nice and helpful piece of information. I’m glad that you shared this helpful info with us. Please keep us informed like this. Thanks for sharing.

  2. krovere says:

    Hi. I’ve read your post well.
    But, I can’t find “PutMerge.jar”.
    May I get Your source files?
    Thank you.

  3. I will not spread any binaries here. You have to build your own jar-file.

    Just configure your eclipse environment to build hadoop applications and make a new project with the code expressed here.

    Alex

  4. Lacey Gearn says:

    Hi mate! Where can I read more resources on this topic?

  5. admin says:

    I recommend you to subscribe to the hadoop mailinglist for users.
    More resources will be here and here.
    Or try to google for “hadoop tutorial” and search for books online.

  6. Clay B. says:

    Why not just use the cat(1) command? I often use:
    cat big_directory_of_files/* | hadoop fs -put – hdfs_file

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>