Starfish v0.3.0 Tutorial: 4. What-if Analysis

What-if Analysis with the Visualizer

A core functionality provided by Starfish is the ability to answer hypothetical questions about the behavior of a MapReduce job when run under different settings. This functionality allows you to study and understand the impact of (a) configuration parameter settings, (b) cluster resources, and (c) input data properties on the performance of a MapReduce job. The Settings View of the Visualizer can be used to modify any of the above settings.

Visualizer: Settings View for What-if Analysis
Visualizer: Settings View for What-if Analysis

The Configuration Parameters table in the Settings View lists the most important parameters that can effect the performance of a MapReduce job. You can edit the values of the parameters and ask a what-if questions. For example, in order to ask a what-if question of the form: "How will the execution time of a job change if the number of reduce tasks changes?", simply modify the mapred.reduce.tasks parameter and click on the Ask What-if Question button. Under the hood, the Visualizer invokes the What-if Engine to generate a virtual job profile for the job in the hypothetical setting .

Visualizer: Timeline View for a virtual MapReduce job
Visualizer: Timeline View for a virtual MapReduce job

The Cluster Specification table in the Settings View summarizes the cluster resources: number of nodes in the cluster, number of map and reduce slots per node, and the available memory per task. You can edit these values to explore how the cluster resources affect the execution of a MapReduce job.

The Input Specification table in the Settings View provides a concise representation of the input data properties at the map-level. For example, in the figure above, 107 map tasks processed on average 75MB of input data, 216 map tasks 38MB, and 3 map tasks 26MB. The data was also compressed. The Path Index represents the index of the logical input to the job. In most cases, the Path Index is 0 since most jobs have a single input that could be a file, a directory, or a set of files (using globbing). Some jobs, however, may have multiple logical inputs. For example, a job that performs a join over two tables will have as input two input paths, each one corresponding to one of the joining tables.

All tables also offer Import and Export functionalities for saving and reusing user settings in XML format. Note that the XML file exported for the Configuration Parameters follows the Hadoop configuration file specifications, and hence can be used as is for executing MapReduce jobs.

What-if Analysis on a Live Hadoop Cluster

The bin/whatif script can be used to answer hypothetical questions of the form: "How will the performance of a MapReduce job executing on the live Hadoop cluster change if I change some configuration parameter settings and/or the input data paths?".

Usage:

./bin/whatif question job_id hadoop jar jarFile args...

The what-if question can be one of the following:

  1. time: Display the execution time of the predicted job
  2. details: Display the execution details of the predicted job
  3. profile: Display the predicted job profile of the job
  4. timeline: Display the timeline of the predicted job
  5. mappers: Display task information for the map tasks of the predicted job
  6. reducers: Display task information for the reduce taks of the predicted job

The job_id is the id of the profiled job.

The remaining parameters in the command are identical to the parameters required by ${HADOOP_HOME}/bin/hadoop,

Example:

./bin/whatif details job_2010030839_0000 hadoop jar \
   contrib/examples/hadoop-starfish-examples.jar wordcount \
   -Dmapred.reduce.tasks=20 /input/path /output/path

What-if Analysis on a Hypothetical Hadoop Cluster

The bin/whatif script provides a command line interface for asking hypothetical questions regarding configuration parameter settings, cluster resources, and input data properties on a hypothetical cluster.

Usage:

./bin/whatif question profile_file input_file cluster_file \
   [-c conf_file] [-o output_file]

The question can be one of time, details, timeline, mappers, reducers and profile (same as above).

The profile_file is the generated job profile XML file.

The input_file is the input specifications XML file.

The cluster_file is the cluster specifications XML file.

The conf_file is an optional job configuration XML file.

The output_file is an optional file to write the output to.

The following example can be used to display the statistics of the predicted job execution using the configuration settings specified in conf.xml, the cluster resources specified in virtual-cluster.xml, and the input data properties specified in virtual-input.xml. The samples/whatif directory in the Starfish release contains sample input.xml and cluster.xml files.

./bin/whatif details profile-wordcount.xml virtual-input.xml \
   virtual-cluster.xml -c conf.xml