Category Archives: Whirr

Introducing Apache Whirr

I was a little disturbed to see that so few people had even heard of Apache Whirr, so I decided to write this as a bit of an introduction.

Apache Whirr is very basically a set of libraries to manage all of your cloud installations and set ups. It takes all the pain out of deploying clusters and apps to any one of the major cloud providers, including Rackspace and Amazon Elastic compute clouds.

The trick is that it provides a common API across all the platforms in a way that almost anyone can use. You may be thinking “Apache = Java”, but there are SDK’s and API’s in a few languages, including Java, C++ and Python. Whirr started out as a set of BASH scripts to manage Hadoop clusters, but quickly become a bigger project, which we now know as Whirr.

To get started with Whirr, you will need to download it from a local mirror. http://www.apache.org/dyn/closer.cgi/whirr/ should get you there. You could also grab the source at https://cwiki.apache.org/confluence/display/WHIRR/How+To+Contribute#HowToContribute-Gettingthesourcecode and then build it in Eclipse as per the instructions. I would suggest grabbing Whirr-0.8.2 (about 26MB).

You will also need Java 6 (or later, I use openjdk-7-jdk), an SSH client, and an account with either Rackspace or Amazon EC2.

I usually put stuff like this in my /opt/ directory, so once you have extracted the archive and ensured the dependencies are met, you can check everything is working with:

bin/whirr version

which should print out

Apache Whirr 0.8.2
jclouds 1.5.8

The next step is to set up your crdentials. First off copy the sample credentials file to your home directory, and then modify it to suit you.

mkdir -p ~/.whirr/
/opt/whirr-0.8.2/conf# cp credentials.sample ~/.whirr/credentials

I prefer using Rackspace (OK Rackspace, you may now send me gifts), so my config looks something like this

PROVIDER=cloudservers-us
IDENTITY=yourUsername
CREDENTIAL=someLongApiKey

Now to define what you want to deploy. The canonical examples are Hadoop cluster and Mahout cluster, so here we will start with a Hadoop cluster and let you figure out the rest!
In your /home/ directory, create a properties file. It doesn’t really matter too much what you call it, but we will call it hadoop.properties.

As you would have seen from the config credentials file, properties files override the base config, so you can actually do quite a lot in userland there. Let’s set up for our Hadoop cluster now:

whirr.cluster-name=testhadoopcluster 
whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode,1 hadoop-datanode+hadoop-tasktracker 
whirr.provider=cloudservers-us
whirr.private-key-file=${sys:user.home}/.ssh/id_rsa
whirr.public-key-file=${sys:user.home}/.ssh/id_rsa.pub

You need to now generate an SSH keypair with

ssh-keygen -t rsa -P ''

Note: You should use only RSA SSH keys, since DSA keys are not accepted yet.

OK, so now comes the fun part – setting up our Hadoop cluster!

/opt/whirr-0.8.2/bin# ./whirr launch-cluster --config /home/paul/hadoop.properties

You should start seeing some output almost immediately that looks like

Bootstrapping cluster
Configuring template for bootstrap-hadoop-datanode_hadoop-tasktracker
Starting 1 node(s) with roles [hadoop-datanode, hadoop-tasktracker]
Configuring template for bootstrap-hadoop-jobtracker_hadoop-namenode
Starting 1 node(s) with roles [hadoop-jobtracker, hadoop-namenode]
Starting 1 node(s) with roles [hadoop-datanode, hadoop-tasktracker]
Starting 1 node(s) with roles [hadoop-jobtracker, hadoop-namenode]

if something goes wrong you will get something along the lines of

Unable to start the cluster. Terminating all nodes.
Finished running destroy phase scripts on all cluster instances
Destroying testhadoopcluster cluster
Cluster testhadoopcluster destroyed

in which case you will need to review all your settings and try again… Hint: Usually this error indicates some sort of connectivity issues.
Whirr is unable to connect over SSH to the machines, assumes the bootstrap process failed and tries to start new ones.

For security reasons, traffic from the network your client is running on is proxied through the master node of the cluster using an SSH tunnel (a SOCKS proxy on port 6666).
A script to launch the proxy is created when you launch the cluster, and may be found in ~/.whirr/. Run it as a follows (in a new terminal window):

. ~/.whirr/myhadoopcluster/hadoop-proxy.sh

You will also need to configure your browser to use the proxy, to view all the pages served by your cluster. When you want to kill the proxy, just Ctrl-C it to kill it.

You can now run a map/reduce job on your shiny new cluster
After you launch a cluster, a hadoop-site.xml file is created in the directory ~/.whirr/. You can use this to connect to the cluster by setting the HADOOP_CONF_DIR environment variable. (It is also possible to set the configuration file to use by passing it as a -conf option to Hadoop Tools):

export HADOOP_CONF_DIR=~/.whirr/myhadoopcluster

You should now be able to browse HDFS:

hadoop fs -ls /

Note that the version of Hadoop installed locally should match the version installed on the cluster. You should also make sure that the HADOOP_HOME environment variable is set.

Here’s how you can run a MapReduce job:

hadoop fs -mkdir input 
hadoop fs -put $HADOOP_HOME/LICENSE.txt input 
hadoop jar $HADOOP_HOME/hadoop-*examples*.jar wordcount input output 
hadoop fs -cat output/part-* | head

Once you are done, you can then simply destroy your cluster with:

bin/whirr destroy-cluster --config hadoop.properties

Note of warning! This will destroy ALL data on your cluster!

Once your cluster is destroyed, don’t forget to kill your proxy too…

That is about it as an intro to Apache Whirr. Very easy to use and very powerful!