Monthly Archives: October 2013

Java HTTPClient

Just a quick post with an example of using the Java HTTPClient to make requests to remote web servers. I did not find much in the way of succint examples, so here is one:

import org.apache.commons.httpclient.*;
import org.apache.commons.httpclient.methods.*;
import org.apache.commons.httpclient.params.HttpMethodParams;

import java.io.*;

public class MyClient {

	public MyClient() {
	}

	private String url;
	
	public String getUrl() {
		return url;
	}

	public void setUrl(String url) {
		this.url = url;
	}

	public byte[] grok() {
		// Create an instance of HttpClient.
		HttpClient client = new HttpClient();

		// Create a method instance.
		GetMethod method = new GetMethod(url);

		// Provide custom retry handler is necessary
		method.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,
				new DefaultHttpMethodRetryHandler(3, false));

		try {
			// Execute the method.
			int statusCode = client.executeMethod(method);

			if (statusCode != HttpStatus.SC_OK) {
				System.err.println("Method failed: " + method.getStatusLine());
			}

			// Read the response body.
			byte[] responseBody = method.getResponseBody();

			// Deal with the response.
			// Use caution: ensure correct character encoding and is not binary
			// data
			return responseBody;

		} catch (HttpException e) {
			System.err.println("Fatal protocol violation: " + e.getMessage());
			e.printStackTrace();
		} catch (IOException e) {
			System.err.println("Fatal transport error: " + e.getMessage());
			e.printStackTrace();
		} finally {
			// Release the connection.
			method.releaseConnection();
		}
		return null;
	}
}

Done.

Why Apache Oozie rocks

First off, what is Oozie?

From the official website at http://oozie.apache.org/ we get:

Apache Oozie Workflow Scheduler for Hadoop

Overview

Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.

Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).

Oozie is a scalable, reliable and extensible system.

How does that make life easier, and you more attractive to potential mates? Well, I’ll tell you how.

  • It makes chaining together more complex Hadoop jobs really easy
  • It allows your code to be way more maintainable, since you can simply chain together a bunch of much simpler jars to create more complex workflows
  • It allows you to fork jobs, so that you can get more than one thing done at once.
  • It runs all your code natively as MapReduce jobs within Hadoop, so making best use of your infrastructure
  • Defining actions is dead easy. You can use a web based Oozie editor where you drag and drop your tasks together, or you can simply hack up an easy to use XML document to define your workflow(s).
  • It has additional actions built in to do cool things like execute shell scripts, do Java based actions, use the filesystem (create and delete dirs, files etc), email you when done, and many more. All done, all free (no extra work)
  • It allows you to schedule jobs. This is way cooler than it sounds.

An example of an Oozie workflow may be something like:

  1. Copy data file from ftp server
  2. Fork
  3. Parse data file. 2.1 Load relevant data to HBase
  4. Join
  5. Run MapReduce on Data
  6. Process data further and pass to another Jar to populate website data
  7. Email you to tell you the job is done

If that doesn’t seem cool, then I don’t know what is!

xlogo1.png.pagespeed.ic.TNJvehIQj4

Tech4Africa 2013

I delivered my talk at Tech4Africa (http://www.tech4africa.com) yesterday, on Geospatial MongoDB.
It went quite well, I think, with lots of folks coming to chat to me about it from many different spheres.

If you would like a copy of my talk, as well as the example code and SA data, it is available from https://github.com/paulscott56/tech4africa2013

You are more than welcome to download, remix and distribute the stuff with attribution, under a Creative Commons By attribute ShareAlike license.

I do hope that everyone in the room found my talk at least somewhat useful and enlightening!

Thanks to Tech4Africa for giving me the opportunity to speak!

Downloading files via HDFS and the Java API

Last post covered uploading files, so I thought it would be useful to do a quick download client as well. Again, we are using DFSClient and BufferedInput and BufferedOutputStreams to do the work. I split the file into 1024 byte chunks in the byte array, but for larger files, I guess you may want to modify that too.

Enough jabbering, to the code!

public void downloadFile() {
		try {
			Configuration conf = new Configuration();
			conf.set("fs.defaultFS", this.hdfsUrl);
			DFSClient client = new DFSClient(new URI(this.hdfsUrl), conf);
			OutputStream out = null;
			InputStream in = null;
			try {
				if (client.exists(sourceFilename)) {
					in = new BufferedInputStream(client.open(sourceFilename));
					out = new BufferedOutputStream(new FileOutputStream(
							destinationFilename, false));

					byte[] buffer = new byte[1024];

					int len = 0;
					while ((len = in.read(buffer)) > 0) {
						out.write(buffer, 0, len);
					}
				}
				else {
					System.out.println("File does not exist!");
				}
			} finally {
				if (client != null) {
					client.close();
				}
				if (out != null) {
					out.close();
				}
			}
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

I use simple getters() and setters() to set the source and destination filenames, and have set the hdfsUrl to my namenode URI on the correct port.

Uploading files through Java API to HDFS – the easy way

So you want to upload files to HDFS hosted remotely? Well, if you look at the sparse documentation and examples, this may be harder than you think. If you skip to the chase and look here, you will be laughing in less than a minute!

The code:

package za.co.paulscott.hdfstest;

import java.io.BufferedInputStream;
import java.io.BufferedOutputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.net.URI;
import java.net.URISyntaxException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hdfs.DFSClient;

public class HDFSUpload {

	private static String hdfsUrl = "<your hdfs NameNode endpoint>";

	private String sourceFilename;
	private String destinationFilename;

	public String getSourceFilename() {
		return sourceFilename;
	}
	public void setSourceFilename(String sourceFilename) {
		this.sourceFilename = sourceFilename;
	}
	public String getDestinationFilename() {
		return destinationFilename;
	}

	public void setDestinationFilename(String destinationFilename) {
		this.destinationFilename = destinationFilename;
	}

	public void uploadFile()
			throws IOException, URISyntaxException {
		Configuration conf = new Configuration();
		conf.set("fs.defaultFS", this.hdfsUrl);
		DFSClient client = new DFSClient(new URI(this.hdfsUrl), conf);
		OutputStream out = null;
		InputStream in = null;
		try {
			if (client.exists(destinationFilename)) {
				System.out.println("File already exists in hdfs: " + destinationFilename);
				return;
			}
			out = new BufferedOutputStream(client.create(destinationFilename, false));
			in = new BufferedInputStream(new FileInputStream(sourceFilename));
			byte[] buffer = new byte[1024];

			int len = 0;
			while ((len = in.read(buffer)) > 0) {
				out.write(buffer, 0, len);
			}
		} finally {
			if (client != null) {
				client.close();
			}
			if (in != null) {
				in.close();
			}
			if (out != null) {
				out.close();
			}
		}
	}
}

Done. The hardest part will be to find the NameNode port and URL. This you can get from your hdfs-site.xml file.