Research Project

Automatic Optimization and Provisioning for Statistical Data Analysis in the Cloud

Speaker:Botong Huang
bhuang at
Date: Thursday, May 10, 2012
Time: 11:00am - 12:30pm
Location: D344 LSRC, Duke
Alvin Lebeck


Cloud computing provides us with interesting new options for performing statistical analysis tasks on big data. Cloud service providers offer a variety of choices in termsof machine types and cluster sizes in an on-demand, pay-as-you-go fashion. The MapReduce system, with its superior scalability, elasticity, and fault-tolerance features, serves as a nice massively data-parallel computing platform on these cloud-based clusters. In this project, we build an end-to-end system for large-scale statistical computing based on Hadoop MapReduce, with improved storage and execution engines and automatic optimization and provisioning for the cloud setting.

By introducing special support for big matrices at both storage and execution levels, we are able to run statistical workloads much faster by reducing the number of MapReduce jobs required per workflow to about a half compared with previous works. Furthermore, we automatically choose the desired cluster configuration for carrying out a given workload.

Jobs involving different statistical operations (CPU-, network-, or I/O-intensive) prefer different machine types in terms of time and cost. The best job execution plan also changes as the cluster changes. A clever choice of execution plan along with cluster could save us a lot of money. Manual plan selection is hard, thus we built a cost- based optimizer that can solve these optimizations systematically and provide the best execution plan and cluster provisioning plan to the user automatically. With this optimizer, we can answer interesting questions like: “what is the minimum budget required to execute the workflow by a given deadline?”

Advisor(s): Jun Yang & Shivnath Babu