Category Archives: Java

An introduction to Apache Spark

What is Apache Spark?

Apache Spark is a fast and general engine for large-scale data processing.

Related documents

You can find the latest Spark documentation, including a programming
guide, on the project webpage at


Spark needs to be downloaded and installed on your local machine
Spark requires Scala 2.10. The project is built using Simple Build Tool (SBT),
which can be obtained ( If SBT is installed we
will use the system version of sbt otherwise we will attempt to download it
automatically. To build Spark and its example programs, run:

./sbt/sbt assembly

Once you’ve built Spark, the easiest way to start using it is the shell:


Or, for the Python API, the Python shell (`./bin/pyspark`).

Spark also comes with several sample programs in the `examples` directory.
To run one of them, use `./bin/run-example <class> <params>`. For example:

./bin/run-example org.apache.spark.examples.SparkLR local[2]

will run the Logistic Regression example locally on 2 CPUs.

Each of the example programs prints usage help if no params are given.

All of the Spark samples take a `<master>` parameter that is the cluster URL
to connect to. This can be a mesos:// or spark:// URL, or “local” to run
locally with one thread, or “local[N]” to run locally with N threads.


Testing first requires building Spark. Once Spark is built, tests
can be run using:

`./sbt/sbt test`

Hadoop versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
storage systems. Because the protocols have changed in different versions of
Hadoop, you must build Spark against the same version that your cluster runs.
You can change the version by setting the `SPARK_HADOOP_VERSION` environment
when building Spark.

For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
versions without YARN, use:

# Apache Hadoop 1.2.1
$ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly

# Cloudera CDH 4.2.0 with MapReduce v1
$ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly

For Apache Hadoop 2.2.X, 2.1.X, 2.0.X, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
with YARN, also set `SPARK_YARN=true`:

# Apache Hadoop 2.0.5-alpha
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

# Cloudera CDH 4.2.0 with MapReduce v2
$ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly

# Apache Hadoop 2.2.X and newer
$ SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly

When developing a Spark application, specify the Hadoop version by adding the
“hadoop-client” artifact to your project’s dependencies. For example, if you’re
using Hadoop 1.2.1 and build your application using SBT, add this entry to

“org.apache.hadoop” % “hadoop-client” % “1.2.1″

If your project is built with Maven, add this to your POM file’s `<dependencies>` section:


Spark could be very well suited for more in depth data mining from social streams like Twitter/Facebook
Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
Write applications quickly in Java, Scala or Python.
Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala and Python shells.
Combine SQL, streaming, and complex analytics.
Spark powers a stack of high-level tools including Shark for SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these frameworks seamlessly in the same application.
Spark can run on Hadoop 2′s YARN cluster manager, and can read any existing Hadoop data.
If you have a Hadoop 2 cluster, you can run Spark without any installation needed. Otherwise, Spark is easy to run standalone or on EC2 or Mesos. It can read from HDFS, HBase, Cassandra, and any Hadoop data source.


Once Spark is built, open an interactive Scala shell with


You can then start working with the engine

We will do a quick analysis of an apache2 log file (access.log)

// Load the file up for analysis
val textFile = sc.textFile("/var/log/apache2/access.log")
// Count the number of lines in the file
// Display the first line of the file
// Display the Number of lines containing PHP
val linesWithPHP = textFile.filter(line => line.contains("PHP"))
// Count the lines with PHP
val linesWithPHP = textFile.filter(line => line.contains("PHP")).count()
// Do the classic MapReduce WordCount example
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)

Apps should be run as either Maven packaged Java apps, or Scala apps. Please refer to documentation for a HOWTO


Overall a good product, but some Hadoop expertise is required for successful set up and working
Mid level or senior level developer required
Some Scala language expertise is advantageous

Android Volley – HTTP async Swiss Army Knife

This serves as a post to help you get started with Android Volley. Volley is used for all sorts of HTTP requests, and supports a whole bang of cool features that will make your life way easier.

It is a relatively simple API to implement, and allows request queuing, which comes in very useful.

The code below is a simple example to make a request to this blog and get a JSON response back, parse it and display it in a simple TextView widget on the device.


import org.json.JSONObject;

import android.os.Bundle;
import android.util.Log;
import android.view.LayoutInflater;
import android.view.Menu;
import android.view.MenuItem;
import android.view.View;
import android.view.ViewGroup;
import android.widget.TextView;


public class MainActivity extends ActionBarActivity {

    private TextView txtDisplay;
    protected void onCreate(Bundle savedInstanceState) {

        if (savedInstanceState == null) {
                    .add(, new PlaceholderFragment())
        txtDisplay = (TextView) findViewById(;

		RequestQueue queue = Volley.newRequestQueue(this);
		String url = "";

		JsonObjectRequest jsObjRequest = new JsonObjectRequest(Request.Method.GET, url, null, new Response.Listener<JSONObject>() {

			public void onResponse(JSONObject response) {
				String txt = response.toString();
				Log.i("volleytest", txt);
		}, new Response.ErrorListener() {

			public void onErrorResponse(VolleyError error) {
				// TODO Auto-generated method stub



    public boolean onCreateOptionsMenu(Menu menu) {
        // Inflate the menu; this adds items to the action bar if it is present.
        getMenuInflater().inflate(, menu);
        return true;

    public boolean onOptionsItemSelected(MenuItem item) {
        // Handle action bar item clicks here. The action bar will
        // automatically handle clicks on the Home/Up button, so long
        // as you specify a parent activity in AndroidManifest.xml.
        int id = item.getItemId();
        if (id == {
            return true;
        return super.onOptionsItemSelected(item);

     * A placeholder fragment containing a simple view.
    public static class PlaceholderFragment extends Fragment {

        public PlaceholderFragment() {

        public View onCreateView(LayoutInflater inflater, ViewGroup container,
                Bundle savedInstanceState) {
            View rootView = inflater.inflate(R.layout.fragment_main, container, false);
            return rootView;


Spring data Neo4j using Java based config

There are very few examples in the wild on using Spring’s Java config methods to configure and start a Neo4j Graph Database service to use in your application. This post will serve as a primer to get you started on your Neo4j application and hopefully save you some bootstrap time as well!

The first thing that we need to do is make sure that you have a running Neo4j server up and ready for action, as well as a new Spring project that you can start with.

In your project POM XML file, you need to add in a few dependencies to work with Neo4j. In this example, I have used Neo4j-1.9.5-community (Spring Data Neo4j for Neo4j-2.x was not available at the time of writing). I have used Spring Framework 3.2.3-RELEASE as the Spring version, and Sping-data-Neo4j-2.3.2-RELEASE.

		<!-- Spring and Transactions -->

		<!-- Logging with SLF4J & LogBack  // clipped... -->
                <!-- JavaConfig needs this library -->

		<!-- Test Artifacts // clipped -->

NOTE: I have clipped some of the less relevant bits for testing and standard Spring dependencies, but if you would like a full POM example, please just let me know!

The next big thing is that you now need to define your graphDatabaseService as a bean that you can then use via the @Autowired annotation in the rest of your code.

import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.factory.GraphDatabaseFactory;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.context.annotation.Import;
import org.springframework.context.annotation.aspectj.EnableSpringConfigured;
import org.springframework.transaction.annotation.EnableTransactionManagement;
import org.springframework.transaction.jta.JtaTransactionManager;

public class AppConfig extends Neo4jConfiguration {

	public GraphDatabaseService graphDatabaseService() {
                // if you want to use Neo4j as a REST service
		//return new SpringRestGraphDatabase("http://localhost:7474/db/data/");
                // Use Neo4j as Odin intended (as an embedded service)
		GraphDatabaseService service = new GraphDatabaseFactory().newEmbeddedDatabase("/tmp/graphdb");
		return service;

Great! You are just about done! Now create a simple entity with the @NodeEntity annotation and save some data to it! You now have a working graph application!

This is really easy once you know how, sometimes, though, getting to know how is hard!

If you enjoyed this post, or found it useful, please leave a comment and I will start a new series on Neo4j in Java/Spring!


I needed to connect, with authentication, to a SAMBA shared drive to download and process data files for one of my applications. Thinking VFS2 would do the job, I had a go at it.Results were good, so now documented here:

private static void GetFiles() throws IOException {

		NtlmPasswordAuthentication auth = new NtlmPasswordAuthentication(
				prop.getProperty("smbDomain"), prop.getProperty("smbUser"),

        StaticUserAuthenticator authS = new StaticUserAuthenticator(
				prop.getProperty("smbDomain"), prop.getProperty("smbUser"),

		FileSystemOptions opts = new FileSystemOptions();
	    SmbFile smbFile = new SmbFile(prop.getProperty("smbURL"),auth);
		FileSystemManager fs = VFS.getManager();
		String[] files = smbFile.list();
		for(String file:files) {
			SmbFile remFile = new SmbFile(prop.getProperty("smbURL") + file, auth);
			SmbFileInputStream smbfos = new SmbFileInputStream(remFile);
			OutputStream out = new FileOutputStream(file);
			byte[] b = new byte[8192];  
            int n;  
            while ((n = > 0) {  
                out.write(b, 0, n);  

As you can see from the above, this simply copies the files to local, preserving the filename, which is exactly what I want. If you need to write files, it is very similar, just use the SmbFileOutputStream instead!

Nice! VFS2 comes through once again!


I decided to have a crack at writing a BSON based data store. I know, there are many around already, but I really wanted to see how BSON worked from a lower level, and, of course, if I could actually pull something off.

The code that I came up with today is hosted at It is nothing really much (yet), but I hope to be able to add more to it soon.

Basically, the way that it works is that you supply a key as a String, and a value as a document. The document itself can be just about anything, including JSON documents. The documents then get serialized to BSON and stored (for now) in an in-memory HashMap. I realise that this is not the best approach (nor the fastest), but I have limited the document sizes to a maximum of 200KB in order to make it as efficient, and as useful, as possible in the few hours that I have dedicated to it so far.

The code includes a “local” store, which you can simply include in your Java projects and use with the dead simple API, as well as a “server” mode, which spawns a socket connection (Threaded) on the specified port. You can then telnet to the server and use the server API to put and get documents by key index.


[email protected]:~$ telnet localhost 11256
Connected to localhost.
Escape character is '^]'.
put/1/Hello World!
Put 1 Hello World!
Getting key 1
Hello World!

Very simple!

The next steps will be to:

  • add persistent storage mechanism
  • make it faster
  • optimise the code a bit
  • write some tests
  • probably rewrite it in C

Java HTTPClient

Just a quick post with an example of using the Java HTTPClient to make requests to remote web servers. I did not find much in the way of succint examples, so here is one:

import org.apache.commons.httpclient.*;
import org.apache.commons.httpclient.methods.*;
import org.apache.commons.httpclient.params.HttpMethodParams;


public class MyClient {

	public MyClient() {

	private String url;
	public String getUrl() {
		return url;

	public void setUrl(String url) {
		this.url = url;

	public byte[] grok() {
		// Create an instance of HttpClient.
		HttpClient client = new HttpClient();

		// Create a method instance.
		GetMethod method = new GetMethod(url);

		// Provide custom retry handler is necessary
				new DefaultHttpMethodRetryHandler(3, false));

		try {
			// Execute the method.
			int statusCode = client.executeMethod(method);

			if (statusCode != HttpStatus.SC_OK) {
				System.err.println("Method failed: " + method.getStatusLine());

			// Read the response body.
			byte[] responseBody = method.getResponseBody();

			// Deal with the response.
			// Use caution: ensure correct character encoding and is not binary
			// data
			return responseBody;

		} catch (HttpException e) {
			System.err.println("Fatal protocol violation: " + e.getMessage());
		} catch (IOException e) {
			System.err.println("Fatal transport error: " + e.getMessage());
		} finally {
			// Release the connection.
		return null;


Why Apache Oozie rocks

First off, what is Oozie?

From the official website at we get:

Apache Oozie Workflow Scheduler for Hadoop


Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.

Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.

Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).

Oozie is a scalable, reliable and extensible system.

How does that make life easier, and you more attractive to potential mates? Well, I’ll tell you how.

  • It makes chaining together more complex Hadoop jobs really easy
  • It allows your code to be way more maintainable, since you can simply chain together a bunch of much simpler jars to create more complex workflows
  • It allows you to fork jobs, so that you can get more than one thing done at once.
  • It runs all your code natively as MapReduce jobs within Hadoop, so making best use of your infrastructure
  • Defining actions is dead easy. You can use a web based Oozie editor where you drag and drop your tasks together, or you can simply hack up an easy to use XML document to define your workflow(s).
  • It has additional actions built in to do cool things like execute shell scripts, do Java based actions, use the filesystem (create and delete dirs, files etc), email you when done, and many more. All done, all free (no extra work)
  • It allows you to schedule jobs. This is way cooler than it sounds.

An example of an Oozie workflow may be something like:

  1. Copy data file from ftp server
  2. Fork
  3. Parse data file. 2.1 Load relevant data to HBase
  4. Join
  5. Run MapReduce on Data
  6. Process data further and pass to another Jar to populate website data
  7. Email you to tell you the job is done

If that doesn’t seem cool, then I don’t know what is!

An introduction to Apache Mesos

What is Apache Mesos you ask? Well, from their web site at it is

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes.

What exactly does that mean? Well, to me, it means that I can deploy applications and certain frameworks into a cluster of resources (VM’s) and it will look after my resource allocation for me. For example, if you have a particularly computationally intensive task, like a Hadoop Map/Reduce job that may require additional resources, it can allocate more to it temporarily so that your job finishes in the most efficient way possible.

Installing Mesos is not too difficult, but the docs are a little sparse. I set up a vanilla Ubuntu-12.04 LTS VM on my localhost to experiment. I will assume the same for you, so you may not need all the packages etc. but please do bear with me.

Disclaimers aside, let’s get cracking on the installation!

The first step will be to download the Apache Mesos tarball distribution. I used wget on my VM to grab the link from a local mirror, but you should check which mirror to use via ttp://

Next up, you will need to prepare your machine to compile and host Mesos. On Ubuntu, you need the following packages:

apt-get update && apt-get install python-dev libunwind7-dev libcppunit-dev openjdk-7-jdk autoconf autopoint libltdl-dev libtool autotools-dev make gawk g++ curl libcurl4-openssl-dev

Once that is complete, unpack your Mesos tarball with:

tar xzvf mesos-0.13.0.tar.gz

Change directory to the newly created mesos directory and run the configure script

cd mesos-0.13.0

All the configure options should check out, but if not, make sure that you have all the relevant packages installed beforehand!
Next, in the same directory, compile the code


Depending on how many resources you gave your VM, this could take a while, so go get some milk and cookies…

Once the make is done, you should check everything with

make check

This will run a bunch of unit tests and checks that will ensure that there are no surprises later.

After you have done this, you can also set up a small Mesos cluster and run a job on it as follows:
In your Mesos directory, use


to start the master server. Make a note of the IP and port that the master is running on, so that you can use the web based UI later on!

Open up a browser and point it to or http://yourIP:5050. As an example, mine is running on

Go back to your VM’s terminal and type

bin/ --master=

and refresh your browser. You should now notice that a slave has been added to your cluster!

Run the C++ test framework (a sample that just runs five tasks on the cluster) using

src/test-framework --master=localhost:5050 

It should successfully exit after running five tasks.
You can also try the example python or Java frameworks, with commands like the following:


If all of that is running OK, you have successfully completed the Mesos setup. Congratulations!

Follow @ApacheMesos on twitter as well as @DaveLester for more information and goodness!

Cloudera Hadoop and HBase example code

Earlier, I posted about connecting to Hadoop via a Java based client. I decided to try out Cloudera’s offering ( where they provide a manager app as well as an easy way to set up Hadoop for both Enterprise (includes support) and a free version.

I downloaded the free version of the Cloudera Manager, and quickly set up a 4 node Hadoop cluster using their tools. I must say, that as far as easy to use goes, they have done an awesome job!

Once everything was up and running, I wanted to create a Java based remote client to talk to my shiny new cluster. This was pretty simple, once I had figured out to use the Cloudera Maven repositories and which versions and combinations of packages to use.

I will save you the trouble and post the results here.

Versions in use are latest

hadoop version
Hadoop 2.0.0-cdh4.4.0
Subversion file:///var/lib/jenkins/workspace/generic-package-ubuntu64-12-04/CDH4.4.0-Packaging-Hadoop-2013-09-03_18-48-35/hadoop-2.0.0+1475-1.cdh4.4.0.p0.23~precise/src/hadoop-common-project/hadoop-common -r c0eba6cd38c984557e96a16ccd7356b7de835e79
Compiled by jenkins on Tue Sep  3 19:33:54 PDT 2013
From source with checksum ac7e170aa709b3ace13dc5f775487180
This command was run using /usr/lib/hadoop/hadoop-common-2.0.0-cdh4.4.0.jar

With this information, we now know which versions of the packages to use from the Cloudera Maven repository.


I also make sure to add the Cloudera Maven repository in my pom.xml file


That is pretty much the hard part. If you don’t need HBase, then leave it off, the “hadoop-client” should do most of what you want.