By introducing special support for big matrices at both storage and execution levels, we are able to run statistical workloads much faster by reducing the number of MapReduce jobs required per workflow to about a half compared with previous works. Furthermore, we automatically choose the desired cluster configuration for carrying out a given workload.
Jobs involving different statistical operations (CPU-, network-, or I/O-intensive) prefer different machine types in terms of time and cost. The best job execution plan also changes as the cluster changes. A clever choice of execution plan along with cluster could save us a lot of money. Manual plan selection is hard, thus we built a cost- based optimizer that can solve these optimizations systematically and provide the best execution plan and cluster provisioning plan to the user automatically. With this optimizer, we can answer interesting questions like: “what is the minimum budget required to execute the workflow by a given deadline?”