Starfish v0.3.0 Tutorial: 5. Job Optimization
Job Optimization with the Visualizer
Perhaps the most important functionality of the Visualizer comes from how it can use the Cost-based Optimizer to find good configuration settings for executing a MapReduce job on a Hadoop cluster.
The Configuration Parameters table in the Settings View lists the parameter values selected by the Cost-based Optimizer. The configurations settings can be exported into an XML file that can be used when the same program has to be run in future. At the same time, all other views provided by the Visualizer can be used to analyze the behavior of the job using the suggested configuration settings.
The Cluster Specification and Input Specification tables in the Settings View contain the cluster resources and input data properties respectively. The tables are editable (similar to the What-if Analysis), allowing the user to find the optimal configuration settings for different cluster resources and input data properties.
Job Optimization on a Live Hadoop Cluster
The bin/optimize script can be used to find the optimal configuration settings before executing a MapReduce job, as well as to execute the job with the suggested settings.
-
Set the following (optional) optimization options in bin/config.sh:
- EXCLUDE_PARAMETERS: A comma-separated list of parameters to exclude from optimization
- OUTPUT_LOCATION: Where to print out the best configuration (Options: stdout, stderr, file_path)
-
Execute the bin/optimize script:
./bin/optimize mode job_id hadoop jar jarFile args...
The mode can be one of the following:
- recommend: Find and print out the configuration settings suggested by the Cost-based Optimizer
- run: Run the job with the configuration settings suggested by the Cost-based Optimizer
The job_id is the job id of the profiled job.
The remaining parameters in the command are identical to the parameters required by ${HADOOP_HOME}/bin/hadoop,
Examples:
-
Print on the console the configuration parameter settings
suggested by the Cost-based Optimizer for a WordCount MapReduce job.
./bin/optimize recommend job_2010030839_0000 hadoop jar \
contrib/examples/hadoop-starfish-examples.jar wordcount \
/input/path /output/path -
Store the configuration parameter settings
suggested by the Cost-based Optimizer for a WordCount MapReduce job
in the my-conf.xml file on the local working directory.
./bin/optimize recommend job_2010030839_0000 hadoop jar \
contrib/examples/hadoop-starfish-examples.jar wordcount \
-Dstarfish.job.optimizer.output="my-conf.xml" \
/input/path /output/path -
Execute a WordCount MapReduce job using the configuration parameter
settings automatically suggested by the Cost-based Optimizer.
./bin/optimize run job_2010030839_0000 hadoop jar \
contrib/examples/hadoop-starfish-examples.jar wordcount \
/input/path /output/path -
Execute a WordCount MapReduce job using the configuration parameter
settings automatically suggested by the Cost-based Optimizer.
Exclude the io.sort.mb parameter from the optimization space.
./bin/optimize run job_2010030839_0000 hadoop jar \
contrib/examples/hadoop-starfish-examples.jar wordcount \
-Dstarfish.job.optimizer.exclude.parameters="io.sort.mb" \
/input/path /output/path
Job Optimization on a Hypothetical Hadoop Cluster
The bin/optimize script provides a command line interface for asking hypothetical questions regarding configuration parameter settings, cluster resources, and input data properties on a hypothetical cluster.
Usage:
[-c conf_file] [-o output_file]
The profile_file is the generated job profile XML file.
The input_file is the input specifications XML file.
The cluster_file is the cluster specifications XML file.
The conf_file is an optional job configuration XML file.
The output_file is an optional file to write the output to.
The following example can be used to display the optimal configuration parameter settings using the cluster resources specified in virtual-cluster.xml, and the input data properties specified in virtual-input.xml. The samples/whatif directory in the Starfish release contains sample input.xml and cluster.xml files.
virtual-cluster.xml -c conf.xml
Previous: 4. What-if Analysis
Tutorial Overview