Data-Intensive Systems:

Real-time Stream Processing

Scaling Real-time Applications

The development of batch processing systems like Google MapReduce and Apache Hadoop has facilitated efficient data processing at Internet ("Big Data") scale. Since 2008 Hadoop has set several impressive sorting benchmark records and has seen widespread industry adoption and the rise of enterprise level support companies such as Cloudera and HortonWorks. Hadoop MapReduce alone is not suitable for all workflows, but the eco-system surrounding it includes tools that extend it's capabilities (and those of the underlying file system: HDFS) in numerous ways.

One such extension category is real-time stream processing. Batch processing systems provide impressive scalability as previously mentioned, but the MapReduce framework works efficiently by amortizing I/O costs by processing large blocks in parallel across machines and is optimized for jobs with an execution time ranging from minutes to days.

The purpose of this site is to provide a detailed survey of three real-time stream systems currently in development:

System Description
Spark Streaming (dstreams) This system is built on top of Spark and makes use of it's Resilient Distributed Dataset (RDD) memory abstraction to deliver real time performance.
Facebook Insights (Puma) Facebook's system utilizes HBase to process their streaming analytics data for social plugins on over 100,000 external sites
Hadoop Online Prototype (HOP) HOP attempts to provide real time processing support directly through MapReduce by manipulating the structure of Hadoop to enable pipelining without causing significant losses in reliability and recoverability.

Each of the links to the left points to a sub-page that aggregates and analyzes publicly available information about these systems, including links to the main project pages, video presentations, and code examples/instructions when available.