Facebook Insights Tech Talk Video
This video provides a great overview of Facebook Insights, unfortunately the slides were not made available after the talk, but this related presentation is available. I highly recommend watching this presentation and then reading the SIGMOD'11 paper.
Real-time Hadoop at Facebook
Social plug-ins are an extremely popular Facebook extension and appear on over 100,000 sites. One of the driving design principles of Facebook is that better content means a better site with a more engaging user experience. To this end, Facebook provides a tool called Facebook Insights to content developers in the form of an interactive web portal that presents business analytics related to their plug-ins.
The original analytics model for this relied entirely on Hive and Hadoop and was only able to provide results within 48 hours (at a 75% SLA). Batch processing could handle the scale, but definitely not in real time. Facebook reports that the raw input for the Insights system consists of approximately 20 billion events a day. The data can be highly skewed by site and includes numerous event types including: impressions, likes, news feed impressions, news feed clicks. these event types are then used as input for more than a hundred different metrics. To clarify, Facebook uses the term impression to represent the total number of times an item is rendered in a browser (this includes content users read, and items that they may simply scroll past).
Being able to monitor historical trends does provide some business value when analyzing market campaigns or other major strategies. However, most data on social networks is consumed in near real-time. Facebook recognized that providing results to developers in closer to real time could help generate more engaging content by allowing more dynamic interaction with users. The finer grained details of a real-time system also provide much more information than a delayed 48 hour aggregate. Consequently, the primary goal of Facebook's re-designed system was to provide analytics results to developers with only 30 seconds of latency.
To achieve this goal, the Insights team considered several approaches. The first idea was to use simple MySQL DB Counters that would simply update based on raw data access. This model is used elsewhere in Facebook, but do the scale of Insights lead to extensive lock contention releted to counter writes and was very easy to overload. It simply faced the same problem that occur in most parallel database architectures. The next idea was to utilize in-memory counters, which eliminated scale issues, but replaced them with fault tolerance problems. Inaccurate metrics have little use, so this idea was also abandoned. As mentioned previously, they implemented a MapReduce/Hive based solution at first, which did provide high availability and accuracy, with no I/O (lock contention issues). However, processing dependencies in the MapReduce framework made it impossible to deliver results in real time and made it difficult for Facebook to provide a specific SLA to developers.
In the end, Facebook settled on a solution that relies on HBase to deliver results. The details of HBase are discussed in a later section of this article. In summary, Facebook chose HBase, over their own similar product Cassandra based on product maturity, high availability, and HBase's faster write rate (which was the main bottleneck to be solved for Insights).
HBase Architecture
Before diving into Insight itself I felt it would be helpful to review HBase itself. HBase was developed by Apache to run on top of HDFS and work in conjunction with Hadoop. HDFS was designed primarily to support sequential readings and writing of large files or large portions of files. Consequently, the HDFS API doesn’t provide fast individual record lookups in files. However, there are many cases where such lookups are quite useful within the context of a MapReduce job. HBase was therefore developed as an extension to HDFS support fast individual and group/table operations.
HBase is broadly categorized as a “NoSQL” database and is based Google’s BigTable system. Some key features of the BigTable model are column group/family oriented storage, tables that are dynamically split across nodes into tablets as they grow, and a single master that coordinates tablet servers and replication. Although “Base” is part of the name, the developers prefer to describe HBase as a datastore. The main reason for the distinction is that HBase doesn’t include many of the features that would typically be found in a traditional RDBMS. For instance, there are no typed columns or secondary indices.
However, HBase has gained impressive support for linear and modular scaling by breaking away from RDBMS/ACID models. As with HDFS and Hadoop, HBase is designed to run on commodity hardware and when a cluster expands by doubling its number of RegionServers, it double in storage and processing power. Such linear scaling only holds up to the maximum capacity of a single machine in a traditional database (and typically involves the use of specialized hardware and storage devices).
The key feature of HBase are:
- Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This makes it very suitable for tasks such as high-speed counter aggregation.
- Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.
- Automatic RegionServer failover
- Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system.
- MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink.
- Java Client API: HBase supports an easy to use Java API for programmatic access.
- Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends.
- Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.
- Operational Management: HBase provides build-in web-pages for operational insight as well as JMX metrics.
The most important caveat for using HBase is that it only reaches true efficiency at large scale. The rough estimate provided in the architecture overview is that HBase is meant for tables with hundreds of millions or billions of rows. Any tables of smaller size can likely be more efficiently managed by a traditional database. Additionally, even with large tables HBase is only efficient in a cluster environment where parallel execution and I/O can occur. HDFS (which HBase is built on top of) is not efficient for clusters with less than 5 DataNodes and a NameNode.

The above diagram illustrates how HBase actually stores its data. The client on the host machine interacts with Zookeeper, which is a separate Apache project that is used by HBase and other Apache projects for coordination. The Zookeeper provides the client with the host name for the root region, which in turn allows the client to locate and interact with the server hosting the .META. table. The .META. table is used to determine which RegionServers hold the keys the client is looking for. This information is cached by the client allowing it to directly interact with RegionServers from then forward. The HMaster is the node responsible for assigning and managing this Region-to-RegionServer mapping.
Each region contains a write ahead log which supports HBases consistency guarantees. Reads and updates from a client interact with Store objects (which correspond to HBase’s column family groupings). These stores are further broken into actual HBase files which are then written to HDFS and replicated across multiple datanodes.
Key HBase Features
Here is a list of the specific HBase features that the Facebook team found most relevant to their needs:
- Unlike a relational database you don't create mappings between tables.
- You don't create indexes. The only index you have a primary row key.
- From the row key you can have millions of sparse columns of storage. It's very flexible. You don't have to specify the schema. You define column families to which you can add keys at anytime.
- Additionally, different TTLs can be set per column family, (the TTL value is related HBase's versioning system), which allows flexible granularity of data over time.
- Key feature to scalability and reliability is the Write Ahead Log (WAL), which tracks the operations that are supposed to occur.
- Based on the key, data is sharded to a region server.
- Written to WAL first.
- Data is put into memory. At some point in time or if enough data has been accumulated the data is flushed to disk.
- If the machine goes down you can recreate the data from the WAL. So there's no permanent data loss.
- Using a combination of the log and in-memory storage they can handle an extremely high rate of IO reliably.
- HBase handles failure detection and automatically routes across failures.
- Currently HBase resharding is done manually:
- Automatic hot spot detection and resharding is on the roadmap for HBase, but it's not there yet.
- Every Tuesday someone looks at the keys and decides what changes to make in the sharding plan.
- Facebook uses MD5 hashing to create keys, this allows them to optimize for scanning and sharding and simplifies the process of weekly manual re-sharding
Facebook Insights (Dataflow)
The blogging site highscalability.com has provided an excellent description of the Dataflow architecture of Facebook Insights, that I present below with some additional comments:
Initial Action
- User clicks Like on a web page (or loads a page generating an impression).
- Fires AJAX request to Facebook.
Scribe
- Request is written to a log file using Scribe.
- Scribe handles issues like file roll over.
- Scribe is built on the same HTFS file store Hadoop is built on.
- Write extremely lean log lines. The more compact the log lines the more can be stored in memory.
Ptail
- Data is read from the log files using Ptail. Ptail is an internal tool built to aggregate data from multiple Scribe stores. It tails the log files and pulls data out.
- Ptail data is separated out into three streams so they can eventually be sent to their own clusters in different datacenters.
- Plugin impression
- News feed impressions
- Actions (plugin + news feed)
Puma
- Batch data to lessen the impact of hot keys. Even though HBase can handle a lot of writes per second they still want to batch data. A hot article will generate a lot of impressions and news feed impressions which will cause huge data skews which will cause IO issues. The more batching the better.
- Batch for 1.5 seconds on average. Would like to batch longer but they have so many URLs that they run out of memory when creating a hashtable.
- Wait for last flush to complete for starting new batch to avoid lock contention issues.
Rendering Engine
- UI Renders Data - Frontends are all written in PHP.
- The backend is written in Java and Thrift is used as the messaging format so PHP programs can query Java services.
- The below graph provides an example of how this information is presented to the user:

Caching
- Caching solutions are used to make the web pages display more quickly.
- Performance varies by the statistic. A counter can come back quickly. Find the top URL in a domain can take longer. Range from .5 to a few seconds.
- The more and longer data is cached the less realtime it is.
- Set different caching TTLs in memcache.
MapReduce
- The data is then sent to MapReduce servers so it can be queried via Hive.
- This also serves as a backup plan as the data can be recovered from Hive.
- Raw logs are removed after a period of time.
Facebook Messaging
In addition to the messaging system Facebook also decided to utilize HBase to support another real time application, their new Messaging system, which combined existing Facebook messages with e-mail, chat, and SMS. This system had similar volum requirements to Insights with millions of messages, and billions of instant messages being written every day. One of Facebook's design choices was to keep all message until explicitly deleted by users. The main consequence of this, is that old messages that are rarely looked at still need to be available with low latency at any time, which makes standard archiving of the data infeasible. HBase's high write throughput, low latency reads, persistent storage, and dynamic columns make it an ideal storage system for this application.
Performance Data & Source Code
Unlike Cassandra, Facebook's Puma program has not been made into an open source project and the paper only includes general performance trends. I expect that since this code and data is much more closely tied to Facebook's operations there is a low likelihood that it will be made publicly available in the near future, especially now that they are a publicly traded company. It would be interesting to see how the analytics system would fair if Spark or another alternative was swapped out with HBase.
The Insights team does mentioned that they wee able to meet their 30 second window with an SLA above 99% and that they've been able to handle more than 10,000 writes per node per second.
References
- Facebook Insights Tech Talk: video
- Realtime Apache Hadoop SIGMOD'11 Paper: pdf
- Realtime Apache Hadoop at Facebook SIGMOD'11 Presentation: pdf
- Facebook's New Realtime Analytics System: HBase to Process 20 Billion Events Per Day: html
- HBase. html
- Cassandra. html
- BigTable. html
- HBase Architecture 101 – Storage: html
- Analyzing New Facebook Insights – Display and Exported Data: html
- Facebook Engineering Blog - Building Realtime Insights: html
- Realtime Hadoop usage at Facebook: The Complete Story: html
