Research Projects
Starfish: A Self-tuning System for Big Data Analytics
The ability to perform timely and scalable analytical processing of large datasets is now a critical ingredient for the success of most organizations. The Hadoop software stack (which consists of an extensible MapReduce execution engine, pluggable distributed storage engines, and a range of procedural to declarative interfaces) is a popular choice for big data analytics. Unfortunately, Hadoop's power and flexibility comes at a high cost because it relies heavily on the user or system administrator to optimize MapReduce jobs and the overall system at various levels. We introduce Starfish, a self-tuning system for big data analytics.
Starfish builds on Hadoop while adapting to user needs and system
workloads to provide good performance automatically, without any
need for users to understand and manipulate the
many tuning knobs in Hadoop.
Starfish is using a combination of techniques from cost-based
database query optimization, robust and adaptive query processing,
static and dynamic program analysis, dynamic data sampling and
run-time profiling, and statistical machine learning applied to
streams of system instrumentation data.
The novelty in Starfish's approach
comes from how it focuses simultaneously on
different workload granularities — overall workload,
workflows, and jobs (procedural and declarative) — as
well as across various decision points — provisioning,
optimization, scheduling, and data layout.
Cost-based Optimization of MapReduce Programs
MapReduce systems lack a feature that has been key to the historical success of database systems, namely, cost-based optimization. A major challenge here is that, to the MapReduce system, a program consists of black-box map and reduce functions written in some programming language like Java, C++, Python, or Ruby. Starfish includes, to our knowledge, the first Cost-based Optimizer for simple to arbitrarily complex MapReduce programs. The Optimizer enumerates and searches through the high-dimensional space of job configuration settings, in order to find the optimal configuration settings to use for executing a MapReduce job. Starfish also includes a Profiler to collect detailed statistical information from unmodified MapReduce programs, and a What-if Engine for fine-grained cost estimation.
Automatic Cluster Sizing on the Cloud
Infrastructure-as-a-Service cloud platforms enable now any (nonexpert) user to provision and configure a cluster of any size on the cloud to run her data-processing jobs. Hence, the user is now faced regularly with complex cluster sizing problems that involve finding the cluster size, the type of resources to use in the cluster, and the job configurations that best meet the performance needs of the workload. Starfish includes Elastisizer, a system component to which users can express cluster sizing problems as queries in a declarative fashion. The Elastisizer provides reliable answers to these queries using an automated technique that uses a mix of job profiling, estimation using black-box and white-box models, and simulation.
Software Release
The source code is distributed using the Software License Agreement for academic and research (non-commercial) purposes from Duke University. Starfish can be employed by users to:
- get a deep understanding of a MapReduce program's behavior during execution,
- ask hypothetical questions on how the program's behavior will change when parameter settings, cluster resources, or data properties change, and
- ultimately optimize the program.
For more information about Starfish and to download the latest release, visit the official Starfish webpage now!
Publications
-
H. Lim, H. Herodotou, and S. Babu.
Stubby: A Transformation-based Optimizer for MapReduce Workflows.
In Proc. of the 38th Intl. Conf. on Very Large Data Bases (VLDB '12), August 2012. -
H. Herodotou, F. Dong, and S. Babu.
No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics. (Slides)
In Proc. of the ACM Symposium on Cloud Computing 2011 (SOCC '11), October 2011. -
H. Herodotou and S. Babu.
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs. (Slides)
In Proc. of the 37th Intl. Conf. on Very Large Data Bases (VLDB '11), August 2011. -
H. Herodotou, F. Dong, and S. Babu.
MapReduce Programming and Cost-based Optimization? Crossing this Chasm with Starfish. (Poster)
Demonstration at the 37th Intl. Conf. on Very Large Data Bases (VLDB '11), August 2011. -
H. Herodotou.
Hadoop Performance Models.
Technical Report, CS-2011-05, Duke University, February 2011. -
H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong,
F. B. Cetin, and S. Babu.
Starfish: A Self-tuning System for Big Data Analytics. (Slides, Poster)
In Proc. of the Fifth Biennial Conf. on Innovative Data Systems Research (CIDR '11), January 2011.