Starfish v0.3.0 Tutorial: 3. Job Analysis

Analyzing a MapReduce Job with the Visualizer

The visualizer/visualize.sh script launches a graphical user interface for analyzing past MapReduce job executions. To start the Visualizer, simply execute:

./visualizer/visualize.sh

The sole input to the visualizer is the directory PROFILER_OUTPUT_DIR specified in bin/config.sh. You can also provide the directory of the job histories that you collect by yourself, see Job History Collection.

Visualizer: Select the input directory
Visualizer: Select the input directory

The Starfish Visualizer provides a table with all the MapReduce jobs found in the input directory. The three buttons at the bottom correspond to the three main functionalities provided by Starfish, namely, job analysis, what-if analysis, and job optimization.

Visualizer: List of MapReduce jobs
Visualizer: List of MapReduce jobs

For each functionality, the Visualizer offers five different views:

Visualizer: Timeline View
Visualizer: Timeline View
  1. Timeline views show the execution timeline of map and reduce tasks that ran during a MapReduce job execution.
  2. Data-skew views are used to identify the presence of data skew in the input and output data for map and reduce tasks.
  3. Data-flow views help visualize the flow of data among the nodes of a Hadoop cluster, and between the map and reduce tasks of a job.
  4. Profile views present the detailed information exposed by the job profiles, including the phase timings within the tasks.
  5. Settings views list the configuration parameter settings, cluster setup, and the input data properties during job execution.

Visualizer: Data Flow View
Visualizer: Data Flow View
Visualizer: Data Skew View
Visualizer: Data Skew View

Visualizer: Settings View
Visualizer: Settings View
Visualizer: Profile View
Visualizer: Profile View

Analyzing a MapReduce Job at the Concole

The bin/analyze script provides a command line interface for analyzing past MapReduce job executions.

Usage:

./bin/analyze hadoop mode job_id [output_file]

The mode represents the analysis requested and can be one of the following:

  1. list_all: List all available jobs
  2. list_stats: List basic statistics for all available jobs
  3. details: Display the execution details of a job
  4. profile: Display the job profile of a job
  5. profile_xml: Display the job profile of a job in XML format
  6. timeline: Display the timeline of task execution in a job
  7. mappers: Display task information for all map tasks in a job
  8. reducers: Display task information for all reduce tasks in a job
  9. cluster: Display the cluster information
  10. transfers_all: Display all data transfers among the tasks in a job
  11. transfers_map: Display aggregated data transfers from each map task
  12. transfers_red: Display aggregated data transfers to each reduce task

The job_id is the job id of interest and it is not needed for modes list_all and list_stats.

The output_file is an optional file to store the output.

We provide a few basic examples for using the bin/analyze. Please refer to docs/analyze.readme for the full documentation.

  1. List basic statistical information regarding all jobs found in the PROFILER_OUTPUT_DIR directory specified in bin/config.sh:
    ./bin/analyze hadoop list_stats
  2. List detailed information from executing the job job_2010030839_0000:
    ./bin/analyze hadoop details job_2010030839_0000
  3. List information regarding the data transfers among the map and reduce tasks for job job_2010030839_0000:
    ./bin/analyze hadoop transfers_all job_2010030839_0000
  4. Display the job profile for job job_2010030839_0000:
    ./bin/analyze hadoop profile job_2010030839_0000