The ability to perform timely and scalable analytical processing of large datasets is now a critical ingredient for the success of most organizations. We are using Hadoop as the backbone to compose multiple computational and storage engines into a massively parallel data processor for analytical processing. Many of Hadoop's core components export interfaces that enable extensive customization or plugging in of new implementations tailored for specific needs. These components include the job scheduler, data storage formats, utilities for run-time extraction of keys and values from input data, schemes to control how a large job is partitioned into smaller tasks, policies for data block placement, and system instrumentation to collect monitoring data. Unfortunately, Hadoop's power and flexibility comes at a high cost because it relies heavily on the user or system administrator to optimize MapReduce jobs and the overall system at various levels. For example:
The Adaptive MapReduce (AMR) project is addressing these challenges using a combination of techniques from cost-based database query optimization, robust and adaptive query processing, static and dynamic program analysis, dynamic data sampling and run-time profiling, and statistical machine learning applied to streams of system instrumentation data.
As part of the AMR project, we introduce Starfish, a self-tuning system for big data analytics. Starfish builds on Hadoop while adapting to user needs and system workloads to provide good performance automatically, without any need for users to understand and manipulate the many tuning knobs in Hadoop. The novelty in Starfish's approach comes from how it focuses simultaneously on different workload granularities—overall workload, workflows, and jobs (procedural and declarative)—as well as across various decision points—provisioning, optimization, scheduling, and data layout.
The Adaptive MapReduce project is related closely to the .eX project where we are now developing a domain-specific language for declaratively specifying the information needed for a system administration task, and a run-time system that plans and conducts experiments to collect the needed data automatically and efficiently.
Current Project Members
On Tuning Hadoop Parameters
On Elastic Storage
On Partitioned Join Processing
Relevant Past Work
On Efficient Cost Modeling
On Robust and Adaptive Query Processing