What is Apache Spark?
Apache Spark is a fast and general engine for large-scale data processing.
Related documents
http://spark.apache.org
You can find the latest Spark documentation, including a programming
guide, on the project webpage at http://spark.apache.org/documentation.html.
Setup
Spark needs to be downloaded and installed on your local machine
Spark requires Scala 2.10. The project is built using Simple Build Tool (SBT),
which can be obtained (http://www.scala-sbt.org). If SBT is installed we
will use the system version of sbt otherwise we will attempt to download it
automatically. To build Spark and its example programs, run:
./sbt/sbt assembly
Once you’ve built Spark, the easiest way to start using it is the shell:
./bin/spark-shell
Or, for the Python API, the Python shell (`./bin/pyspark`).
Spark also comes with several sample programs in the `examples` directory.
To run one of them, use `./bin/run-example <class> <params>`. For example:
./bin/run-example org.apache.spark.examples.SparkLR local[2]
will run the Logistic Regression example locally on 2 CPUs.
Each of the example programs prints usage help if no params are given.
All of the Spark samples take a `<master>` parameter that is the cluster URL
to connect to. This can be a mesos:// or spark:// URL, or “local” to run
locally with one thread, or “local[N]” to run locally with N threads.
Tests
Testing first requires building Spark. Once Spark is built, tests
can be run using:
`./sbt/sbt test`
Hadoop versions
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
storage systems. Because the protocols have changed in different versions of
Hadoop, you must build Spark against the same version that your cluster runs.
You can change the version by setting the `SPARK_HADOOP_VERSION` environment
when building Spark.
For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
versions without YARN, use:
# Apache Hadoop 1.2.1
$ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly
# Cloudera CDH 4.2.0 with MapReduce v1
$ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly
For Apache Hadoop 2.2.X, 2.1.X, 2.0.X, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
with YARN, also set `SPARK_YARN=true`:
# Apache Hadoop 2.0.5-alpha
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
# Cloudera CDH 4.2.0 with MapReduce v2
$ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly
# Apache Hadoop 2.2.X and newer
$ SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly
When developing a Spark application, specify the Hadoop version by adding the
“hadoop-client” artifact to your project’s dependencies. For example, if you’re
using Hadoop 1.2.1 and build your application using SBT, add this entry to
`libraryDependencies`:
“org.apache.hadoop” % “hadoop-client” % “1.2.1″
If your project is built with Maven, add this to your POM file’s `<dependencies>` section:
<dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-client</artifactId> <version>1.2.1</version> </dependency>
Spark could be very well suited for more in depth data mining from social streams like Twitter/Facebook
Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
Write applications quickly in Java, Scala or Python.
Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala and Python shells.
Combine SQL, streaming, and complex analytics.
Spark powers a stack of high-level tools including Shark for SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these frameworks seamlessly in the same application.
Spark can run on Hadoop 2′s YARN cluster manager, and can read any existing Hadoop data.
If you have a Hadoop 2 cluster, you can run Spark without any installation needed. Otherwise, Spark is easy to run standalone or on EC2 or Mesos. It can read from HDFS, HBase, Cassandra, and any Hadoop data source.
Examples
Once Spark is built, open an interactive Scala shell with
bin/spark-shell
You can then start working with the engine
We will do a quick analysis of an apache2 log file (access.log)
// Load the file up for analysis val textFile = sc.textFile("/var/log/apache2/access.log") // Count the number of lines in the file textFile.count() // Display the first line of the file textFile.first() // Display the Number of lines containing PHP val linesWithPHP = textFile.filter(line => line.contains("PHP")) // Count the lines with PHP val linesWithPHP = textFile.filter(line => line.contains("PHP")).count() // Do the classic MapReduce WordCount example val wordCounts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b) wordCounts.collect()
Apps should be run as either Maven packaged Java apps, or Scala apps. Please refer to documentation for a HOWTO
Conclusion
Overall a good product, but some Hadoop expertise is required for successful set up and working
Mid level or senior level developer required
Some Scala language expertise is advantageous