Starfish v0.3.0 Tutorial: 6. Utilities

Gather job execution files

When a Hadoop job is executed using the default bin/hadoop script (or some other way), it might be useful to gather its execution files after it completes execution, so that you can analyze the job execution using Starfish (see Job Analysis).

Usage:

./bin/utilities gather <hadoop_job_id_1> [<hadoop_job_id_N>]

The parameter <hadoop_job_id_1> is the job ID of an executed MapReduce job.

The parameter <hadoop_job_id_N> is the optional job ID for defining a range of job IDs to gather the execution results for.

Adjust execution profiles

The overhead for profiling compression using BTrace was prohibitively expensive and interfere with profiling the other execution costs. Hence, we cannot directly profile compression. To alleviate this problem, we can indirectly estimate the cost of compression by profiling the same job twice: once with compression disabled and once with compression enabled. Then, we can "adjust" the two profiles to generate the adjusted profiles that will account for the presence of compression.

Usage:

./bin/utilities adjust <hadoop_job_id_1> <hadoop_job_id_2>

The parameter <hadoop_job_id_1> is the job ID of the job profiled WITHOUT compression.

The parameter <hadoop_job_id_1> is the job ID of the job profiled WITH compression.

Get the cluster information

We can get the cluster information to learn some details about the cluster or to create the XML file needed for asking hypothetical questions (see What-if Analysis).

Usage:

./bin/utilities cluster {info|xml} [<output_file>]

The parameter value info is used to print out the cluster information in an easy-readable text format.

The parameter value xml is used to print out the cluster information XML format.

The parameter <output_file> represents a file path for storing the cluster information.

Get the input information

We can get the input information to learn some details about the input data or to create the XML file needed for asking hypothetical questions (see What-if Analysis).

Usage:

./bin/utilities input <conf_xml_file> [<output_file>] [-libjars <jars_list>]

The parameter <conf_xml_file> represents the XML Hadoop configuration file containing the input format and input path(s).

The parameter <output_file> represents a file path for storing the input information.

The parameter <jars_list> represents a comma-separated list of jar filess that should be added to the classpath.