Data-Intensive Systems:

Real-time Stream Processing

Facebook Insights Tech Talk Video

This video provides a great overview of Facebook Insights, unfortunately the slides were not made available after the talk, but this related presentation is available. I highly recommend watching this presentation and then reading the SIGMOD'11 paper.

Real-time Hadoop at Facebook

Social plug-ins are an extremely popular Facebook extension and appear on over 100,000 sites. One of the driving design principles of Facebook is that better content means a better site with a more engaging user experience. To this end, Facebook provides a tool called Facebook Insights to content developers in the form of an interactive web portal that presents business analytics related to their plug-ins.

The original analytics model for this relied entirely on Hive and Hadoop and was only able to provide results within 48 hours (at a 75% SLA). Batch processing could handle the scale, but definitely not in real time. Facebook reports that the raw input for the Insights system consists of approximately 20 billion events a day. The data can be highly skewed by site and includes numerous event types including: impressions, likes, news feed impressions, news feed clicks. these event types are then used as input for more than a hundred different metrics. To clarify, Facebook uses the term impression to represent the total number of times an item is rendered in a browser (this includes content users read, and items that they may simply scroll past).

Being able to monitor historical trends does provide some business value when analyzing market campaigns or other major strategies. However, most data on social networks is consumed in near real-time. Facebook recognized that providing results to developers in closer to real time could help generate more engaging content by allowing more dynamic interaction with users. The finer grained details of a real-time system also provide much more information than a delayed 48 hour aggregate. Consequently, the primary goal of Facebook's re-designed system was to provide analytics results to developers with only 30 seconds of latency.

To achieve this goal, the Insights team considered several approaches. The first idea was to use simple MySQL DB Counters that would simply update based on raw data access. This model is used elsewhere in Facebook, but do the scale of Insights lead to extensive lock contention releted to counter writes and was very easy to overload. It simply faced the same problem that occur in most parallel database architectures. The next idea was to utilize in-memory counters, which eliminated scale issues, but replaced them with fault tolerance problems. Inaccurate metrics have little use, so this idea was also abandoned. As mentioned previously, they implemented a MapReduce/Hive based solution at first, which did provide high availability and accuracy, with no I/O (lock contention issues). However, processing dependencies in the MapReduce framework made it impossible to deliver results in real time and made it difficult for Facebook to provide a specific SLA to developers.

In the end, Facebook settled on a solution that relies on HBase to deliver results. The details of HBase are discussed in a later section of this article. In summary, Facebook chose HBase, over their own similar product Cassandra based on product maturity, high availability, and HBase's faster write rate (which was the main bottleneck to be solved for Insights).

HBase Architecture

Before diving into Insight itself I felt it would be helpful to review HBase itself. HBase was  developed by Apache to run on top of HDFS and work in conjunction with Hadoop.  HDFS was designed primarily to support sequential readings and writing of large files or large portions of files.  Consequently, the HDFS API doesn’t provide fast individual record lookups in files.  However, there are many cases where such lookups are quite useful within the context of a MapReduce job.  HBase was therefore developed as an extension to HDFS support fast individual and group/table operations.

HBase is broadly categorized as a “NoSQL” database and is based Google’s  BigTable system.  Some key features of the BigTable model are column group/family oriented storage, tables that are dynamically split across nodes into tablets as they grow, and a single master that coordinates tablet servers and replication.  Although “Base” is part of the name, the developers prefer to describe HBase as a datastore.  The main reason for the distinction is that HBase doesn’t include many of the features that would typically be found in a traditional RDBMS.  For instance, there are no typed columns or secondary indices.

However, HBase has gained impressive support for linear and modular scaling by breaking away from  RDBMS/ACID models.  As with HDFS and Hadoop, HBase is designed to run on commodity hardware and when a cluster expands by doubling its number of RegionServers, it double in storage and processing power.  Such linear scaling only holds up to the maximum capacity of a single machine in a traditional database (and typically involves the use of specialized hardware and storage devices).

The key feature of HBase are:

The most important caveat for using HBase is that it only reaches true efficiency at large scale.  The rough estimate provided in the architecture overview is that HBase is meant for tables with hundreds of millions or billions of rows.  Any tables of smaller size can likely be more efficiently managed by a traditional database.  Additionally, even with large tables HBase is only efficient in a cluster environment where parallel execution and I/O can occur.  HDFS (which HBase is built on top of) is not efficient for clusters with less than 5 DataNodes and a NameNode.
HBase Architecture
The above diagram illustrates how HBase actually stores its data.  The client on the host machine interacts with Zookeeper, which is a separate Apache project that is used by HBase and other Apache projects for coordination.  The Zookeeper provides the client with the host name for the root region, which in turn allows the client to locate and interact with the server hosting the .META. table.  The .META. table is used to determine which RegionServers hold the keys the client is looking for.  This information is cached by the client allowing it to directly interact with RegionServers from then forward.  The HMaster is the node responsible for assigning and managing this Region-to-RegionServer mapping.

Each region contains a write ahead log which supports HBases consistency guarantees.  Reads and updates from a client interact with Store objects (which correspond to HBase’s column family groupings).  These stores are further broken into actual HBase files which are then written to HDFS and replicated across multiple datanodes.

Key HBase Features

Here is a list of the specific HBase features that the Facebook team found most relevant to their needs:

Facebook Insights (Dataflow)

The blogging site highscalability.com has provided an excellent description of the Dataflow architecture of Facebook Insights, that I present below with some additional comments:

Initial Action

Scribe

Ptail

Puma

Rendering Engine

Facebook Insights

Caching

MapReduce

Facebook Messaging

In addition to the messaging system Facebook also decided to utilize HBase to support another real time application, their new Messaging system, which combined existing Facebook messages with e-mail, chat, and SMS. This system had similar volum requirements to Insights with millions of messages, and billions of instant messages being written every day. One of Facebook's design choices was to keep all message until explicitly deleted by users. The main consequence of this, is that old messages that are rarely looked at still need to be available with low latency at any time, which makes standard archiving of the data infeasible. HBase's high write throughput, low latency reads, persistent storage, and dynamic columns make it an ideal storage system for this application.

Performance Data & Source Code

Unlike Cassandra, Facebook's Puma program has not been made into an open source project and the paper only includes general performance trends. I expect that since this code and data is much more closely tied to Facebook's operations there is a low likelihood that it will be made publicly available in the near future, especially now that they are a publicly traded company. It would be interesting to see how the analytics system would fair if Spark or another alternative was swapped out with HBase.

The Insights team does mentioned that they wee able to meet their 30 second window with an SLA above 99% and that they've been able to handle more than 10,000 writes per node per second.

References

  1. Facebook Insights Tech Talk: video
  2. Realtime Apache Hadoop SIGMOD'11 Paper: pdf
  3. Realtime Apache Hadoop at Facebook SIGMOD'11 Presentation: pdf
  4. Facebook's New Realtime Analytics System: HBase to Process 20 Billion Events Per Day: html
  5. HBase. html
  6. Cassandra. html
  7. BigTable. html
  8. HBase Architecture 101 – Storage: html
  9. Analyzing New Facebook Insights – Display and Exported Data: html
  10. Facebook Engineering Blog - Building Realtime Insights: html
  11. Realtime Hadoop usage at Facebook: The Complete Story: html