Category Archives: Mesos

An introduction to Apache Spark

What is Apache Spark?

Apache Spark is a fast and general engine for large-scale data processing.

Related documents

You can find the latest Spark documentation, including a programming
guide, on the project webpage at


Spark needs to be downloaded and installed on your local machine
Spark requires Scala 2.10. The project is built using Simple Build Tool (SBT),
which can be obtained ( If SBT is installed we
will use the system version of sbt otherwise we will attempt to download it
automatically. To build Spark and its example programs, run:

./sbt/sbt assembly

Once you’ve built Spark, the easiest way to start using it is the shell:


Or, for the Python API, the Python shell (`./bin/pyspark`).

Spark also comes with several sample programs in the `examples` directory.
To run one of them, use `./bin/run-example <class> <params>`. For example:

./bin/run-example org.apache.spark.examples.SparkLR local[2]

will run the Logistic Regression example locally on 2 CPUs.

Each of the example programs prints usage help if no params are given.

All of the Spark samples take a `<master>` parameter that is the cluster URL
to connect to. This can be a mesos:// or spark:// URL, or “local” to run
locally with one thread, or “local[N]” to run locally with N threads.


Testing first requires building Spark. Once Spark is built, tests
can be run using:

`./sbt/sbt test`

Hadoop versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
storage systems. Because the protocols have changed in different versions of
Hadoop, you must build Spark against the same version that your cluster runs.
You can change the version by setting the `SPARK_HADOOP_VERSION` environment
when building Spark.

For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
versions without YARN, use:

# Apache Hadoop 1.2.1
$ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly

# Cloudera CDH 4.2.0 with MapReduce v1
$ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly

For Apache Hadoop 2.2.X, 2.1.X, 2.0.X, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
with YARN, also set `SPARK_YARN=true`:

# Apache Hadoop 2.0.5-alpha
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

# Cloudera CDH 4.2.0 with MapReduce v2
$ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly

# Apache Hadoop 2.2.X and newer
$ SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly

When developing a Spark application, specify the Hadoop version by adding the
“hadoop-client” artifact to your project’s dependencies. For example, if you’re
using Hadoop 1.2.1 and build your application using SBT, add this entry to

“org.apache.hadoop” % “hadoop-client” % “1.2.1″

If your project is built with Maven, add this to your POM file’s `<dependencies>` section:


Spark could be very well suited for more in depth data mining from social streams like Twitter/Facebook
Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
Write applications quickly in Java, Scala or Python.
Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala and Python shells.
Combine SQL, streaming, and complex analytics.
Spark powers a stack of high-level tools including Shark for SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these frameworks seamlessly in the same application.
Spark can run on Hadoop 2′s YARN cluster manager, and can read any existing Hadoop data.
If you have a Hadoop 2 cluster, you can run Spark without any installation needed. Otherwise, Spark is easy to run standalone or on EC2 or Mesos. It can read from HDFS, HBase, Cassandra, and any Hadoop data source.


Once Spark is built, open an interactive Scala shell with


You can then start working with the engine

We will do a quick analysis of an apache2 log file (access.log)

// Load the file up for analysis
val textFile = sc.textFile("/var/log/apache2/access.log")
// Count the number of lines in the file
// Display the first line of the file
// Display the Number of lines containing PHP
val linesWithPHP = textFile.filter(line => line.contains("PHP"))
// Count the lines with PHP
val linesWithPHP = textFile.filter(line => line.contains("PHP")).count()
// Do the classic MapReduce WordCount example
val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)

Apps should be run as either Maven packaged Java apps, or Scala apps. Please refer to documentation for a HOWTO


Overall a good product, but some Hadoop expertise is required for successful set up and working
Mid level or senior level developer required
Some Scala language expertise is advantageous

An introduction to Apache Mesos

What is Apache Mesos you ask? Well, from their web site at it is

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, MPI, Hypertable, Spark, and other applications on a dynamically shared pool of nodes.

What exactly does that mean? Well, to me, it means that I can deploy applications and certain frameworks into a cluster of resources (VM’s) and it will look after my resource allocation for me. For example, if you have a particularly computationally intensive task, like a Hadoop Map/Reduce job that may require additional resources, it can allocate more to it temporarily so that your job finishes in the most efficient way possible.

Installing Mesos is not too difficult, but the docs are a little sparse. I set up a vanilla Ubuntu-12.04 LTS VM on my localhost to experiment. I will assume the same for you, so you may not need all the packages etc. but please do bear with me.

Disclaimers aside, let’s get cracking on the installation!

The first step will be to download the Apache Mesos tarball distribution. I used wget on my VM to grab the link from a local mirror, but you should check which mirror to use via ttp://

Next up, you will need to prepare your machine to compile and host Mesos. On Ubuntu, you need the following packages:

apt-get update && apt-get install python-dev libunwind7-dev libcppunit-dev openjdk-7-jdk autoconf autopoint libltdl-dev libtool autotools-dev make gawk g++ curl libcurl4-openssl-dev

Once that is complete, unpack your Mesos tarball with:

tar xzvf mesos-0.13.0.tar.gz

Change directory to the newly created mesos directory and run the configure script

cd mesos-0.13.0

All the configure options should check out, but if not, make sure that you have all the relevant packages installed beforehand!
Next, in the same directory, compile the code


Depending on how many resources you gave your VM, this could take a while, so go get some milk and cookies…

Once the make is done, you should check everything with

make check

This will run a bunch of unit tests and checks that will ensure that there are no surprises later.

After you have done this, you can also set up a small Mesos cluster and run a job on it as follows:
In your Mesos directory, use


to start the master server. Make a note of the IP and port that the master is running on, so that you can use the web based UI later on!

Open up a browser and point it to or http://yourIP:5050. As an example, mine is running on

Go back to your VM’s terminal and type

bin/ --master=

and refresh your browser. You should now notice that a slave has been added to your cluster!

Run the C++ test framework (a sample that just runs five tasks on the cluster) using

src/test-framework --master=localhost:5050 

It should successfully exit after running five tasks.
You can also try the example python or Java frameworks, with commands like the following:


If all of that is running OK, you have successfully completed the Mesos setup. Congratulations!

Follow @ApacheMesos on twitter as well as @DaveLester for more information and goodness!