Spark Streaming is an interesting extension to Spark that adds support for continuous stream processing to Spark. Spark Streaming is in active development at UC Berkeley's amplab alongside the rest of the Spark project. The group recently gave a presentation at AmpCamp 2012 and the video gives a pretty good overview. If you'd like to follow along with the video with your own copy of the slides you can obtain them here.
The full paper for Spark Streaming can also be obtained from this link for more detailed information.
Spark Streaming Motivation
As mentioned on the overview page, most Internet scale systems have real time data requirements along side their growing batch processing needs. Spark Streaming is designed with the goal of providing near real time processing with approximately one second latency to such programs. Some of the most commonly used applications with such requirements include web site statistics/analytics, intrusion detection systems, and spam filters.
Spark Streaming seeks to support these queries while maintaining fault tolerance similar to batch systems that can recover from both outright failures and stragglers. They additionally sought to maximize cost-effectiveness at scale and provide a simple programming model. Finally, they recognized the importance of supporting integration with batch and ad hoc query systems
What is Spark?
In order to understand Spark Streaming, it is important to have a good understanding on Spark itself. This section will include a basic architecture overview, but leaves out many of the finer details of how Spark functions. The best places for more information are the Spark Project site and their paper which won the best paper award at USENIX NSDI 2012. the Spark site even includes detailed documentation for setting up your own cluster on EC2, in standalone mode, or on your own cluster if you have the hardware.
Spark is an open source cluster computing system developed in the UC Berkeley AMP Lab. The system aims to provide fast computations, fast writes, and highly interactive queries. Spark significantly outperforms Hadoop MapReduce for certain problem classes and provides a simple Ruby-like interpreter interface. See Figure 1(right) for examples.
Spark beats Hadoop by providing primitives for in-memory cluster computing; thereby avoiding the I/O bottleneck between the individual jobs of an iterative MapReduce workflow that repeatedly performs computations on the same working set. Spark makes use of the Scala language, which allows distributed datasets to be manipulated like local collections and provides fast development and testing through it’s interactive interpreter (similar to Python or Ruby).
Spark was also developed to support interactive data mining, (in addition to the obvious utility for iterative algorithms). The system also has applicability for many general computing problems. Their site includes example programs for File Search, In-memory Search, and Estimating Pi, in addition to the two examples in Figure 1. Beyond these specifics Spark is well suited for most dataset transformation operations and computations.
Additionally, Spark is built to run on top of the Apache Mesos cluster manager. This allows Spark to operate on a cluster side-by-side with Hadoop, Message Passing Interface (MPI), HyperTable, and other applications. This allows an organization to develop hybrid workflows that can benefit from both dataflow models, with the cost, management, and interoperability concerns that would arise from using two independent clusters.
As “they” often say, there is no free lunch! How does Spark provide it’s speed and avoid Disk I/O while retaining the attractive fault-toleranc, locality, and scalability properties of MapReduce? The answer from the Spark team is a memory abstraction they call a Resilient Distributed Dataset (RDD).
Existing distributed memory abstractions such as key-value stores and databases allos fine-grained updates to mutable state. This forces the cluster to replicate data or log updates across machinces to maintain fault tolerance. Both of these approaches incur substantial overhead for a data-intensive workload. RDDs instead have a restricted interface that only allows coarse-grained updates that apply the same operation to many data items (such as map, filter, or join).
This allows Spark to provide fault-tolerance through logs that simply record the transformations used to build a dataset instead of all the data. This high-level log is called a lineage and the figure above shows a code snippet of the lineage being utilized. Since parallel applications, by their very nature, typically apply the same operations to a large portion of a dataset, the coarse-grained restriction is not as limiting as it might seem. In fact, the Spark paper showed that RDDs can efficiently express programming models from numerous separate frameworks including MapReduce, DryadLINQ, SQL, Pregel, and HaLoop.
Additionally, Spark also provides additional fault tolerance by allowing a user to specify a persistence condition on any transformation which causes it to immediately write to disk. Data locality is maintained by allowing users to control data partitioning based on a key in each record. (One example use of this partitioning scheme is to ensure that two datasets that will be joined are hashed in the same way.) Spark maintains scalability beyond memory limitations for scan-based operations by storing partitions that grow too large on disk.
As stated previously, Spark is primarily designed to support batch transformations. Any system that needs to make asynchronous fine grain updates to shared state (such as a datastore for a web application) should use a more traditional system such as a database, RAMCloud, or Piccolo. This figure gives an example of Spark's batch transformations and how the various RDDs and operations are grouped into stages.
Spark Streaming Architecture
Most traditional streaming systems involve an update and pass methodology that handles a single record at a time. These systems can handle fault tolerance by replication (fast, 2x hardware cost) or upstream backup/buffered records(slow,~ 1x hardware cost), but neither approach scales to hundreds of nodes.
Observations from Batch Systems
With these limitations in mind the Spark Streaming team looked to batch systems for inspiration on how to improve scaling and noted that:
- batch systems always divide processing into deterministic sets that can be easily replayed
- falied/slow tasks are re-run in parallel on other nodes
- lower batch sizes have much lower latency
This lead to the idea of using the same recovery mechanisms at a much smaller timescale. Spark became an ideal platform to build on because of it's in memory storage and optimization for iterative transformations on working sets.
Spark Streaming provides two types of operators to let users build streaming programs:
- Transformation operators, which produce a new DStream from one or more parent streams. These can
be either stateless (i.e., act independently on each interval) or stateful (share data across intervals).
- Output operators, which let the program write data to external systems (e.g., save each RDD to HDFS).
Spark Streaming supports the same stateless transformations available in typical batch frameworks, including
map, reduce, groupBy, and join (all operators supported in Spark are extended to this module).
Basic real-time streaming is a useful tool in itself, but being able the ability to look at aggregates within a given window of time is also very important for most business analytics. In order to support windowing and aggregation Spark Streaming development has created two evaluation models.
This figure compares the naive variant of reduceByWindow (a) with the incremental variant for invertible functions (b). Both versions compute a per-interval count only once, but the second avoids re-summing each window. Boxes denote RDDs, while arrows show the operations used to compute the window.
Spark also contains an operator that supports time-skewed joins that allows users to join a stream against its own RDDs from some time in the past to compute trends such as how current page view counts compare to those from five minutes ago.
For output Spark Streaming allows all RDDs to be written to a supported storage system using the save operator. Alternatively, the foreach operator can be used to process each RDD with a custom user code snippet.
Integration with Batch Processing
Since Spark Streaming is built on the same fundamental data structure (RDDs) as Spark integrating streaming data with other batch processing for analysis is possible and users will be able to develop algorithms that will be potentially portable across their entire analytics stack.
The main shortcoming in this regard is that Spark can't currently share RDDs across process boundaries, as of Tathagata's presentation in September no solution currently exists other than writing these RDDs to storage. He did mention that a tool to enable this is currently in development, but gave no timeframe for a release.
Spark Streaming Performance
Since RDDs are kept in memory Spark Streaming has impressive performance on the benchmarks included in their paper and presentation. In the video from AMP Camp Tathagata Das states that they had 4GB/s (40M records/s) on 100 node cluster with less than 1 second latency, in addition to results listed below from smaller cluster sizes.
The ability to recreate lost data in parallel from other RDDs in memory using the lineage also give impressive performance in the presence of failures. Average interval processing times only spike for WordCount for the next thirty seconds and remain near real-time in that interval.
Comparison To Other Systems
|Storm (Twitter)||10,000 records/s/node|
|Spark Streaming||400,000 records/s/node|
|Apache S4||7,000 records/s/node|
|Other Commericial Systems||100,000 records/s/node|
The main reason cited by Tathagata for Spark's performance gain over Storm is the aggregation of small records that occurs through the mechanics of RDDs. These results are meant as general comparison metrics, individual programs will have varying performance as shown in the Grep and WordCount benchmarks above.
Where Can I Get It?
The Spark Streaming source code is not publicly available, but it's release is expected in the near future. The Spark community has an active Google group, so this thread will be a good place to catch the release announcement. The main Spark Project site will also likely have a release statement. No specific date has been set, but based on developer comments in the thread and the fact that Spark 0.6.0 is out on schedule; Spark Streaming might be available as early as mid-November.
Since Spark Streaming has yet to be released it's difficult to find much about the project anywhere other than their main website, but I did come across a critical but fair review from Marteen Ectors at Telruptive.com. The article talks about Spark in general, but highlights some practical challenges that may limit user adoption if not addressed.
He first highlights Spark's use of Scala as the primary API language: "Scala and has a very straight forward syntax to run applications from the command line or via compiled code. The possibilities to run iterative operations over large datasets or very compute intensive operations in parallel, make it ideal for big data analytics and distributed machine learning."
From there he highlights the challenges involved in actually integrating Spark, Mesos, and Hadoop on a single cluster. Since Mesos is also in early development and just recently became an Apache Incubator project, the code and documentation are in a state of flux. He points out that Spark only requires HDFS and that a full Hadoop installation should not be made necessary unless it is desired.
He also points to the fact that Storm can essentially be installed on EC2 with one click and that complex installation requirements will always deter users. Thankfully, it seems the Spark team took this criticism to heart (or realized the problem on their own). As I mentioned at the top of the page, the freshly released Spark 0.6.0 includes robust documentation including an EC2 deployment guide and an automated EC2 installation script.
He also makes an interesting point about adding an integration SDK and points out that strong integration with a system like Cassandra could yield impressive scaling and performance due to improved write speed/throughput, especially for small RDDs. HBase would likely be a good option for similar extensions.
Below is a list of all the references I used in writing this system survey in order of their perceived usefulness