Starfish v0.3.0 Tutorial: 3. Job Analysis
Analyzing a MapReduce Job with the Visualizer
The visualizer/visualize.sh script launches a graphical user interface for analyzing past MapReduce job executions. To start the Visualizer, simply execute:
The sole input to the visualizer is the directory PROFILER_OUTPUT_DIR specified in bin/config.sh. You can also provide the directory of the job histories that you collect by yourself, see Job History Collection.
The Starfish Visualizer provides a table with all the MapReduce jobs found in the input directory. The three buttons at the bottom correspond to the three main functionalities provided by Starfish, namely, job analysis, what-if analysis, and job optimization.
For each functionality, the Visualizer offers five different views:
- Timeline views show the execution timeline of map and reduce tasks that ran during a MapReduce job execution.
- Data-skew views are used to identify the presence of data skew in the input and output data for map and reduce tasks.
- Data-flow views help visualize the flow of data among the nodes of a Hadoop cluster, and between the map and reduce tasks of a job.
- Profile views present the detailed information exposed by the job profiles, including the phase timings within the tasks.
- Settings views list the configuration parameter settings, cluster setup, and the input data properties during job execution.
Analyzing a MapReduce Job at the Concole
The bin/analyze script provides a command line interface for analyzing past MapReduce job executions.
Usage:
The mode represents the analysis requested and can be one of the following:
- list_all: List all available jobs
- list_stats: List basic statistics for all available jobs
- details: Display the execution details of a job
- profile: Display the job profile of a job
- profile_xml: Display the job profile of a job in XML format
- timeline: Display the timeline of task execution in a job
- mappers: Display task information for all map tasks in a job
- reducers: Display task information for all reduce tasks in a job
- cluster: Display the cluster information
- transfers_all: Display all data transfers among the tasks in a job
- transfers_map: Display aggregated data transfers from each map task
- transfers_red: Display aggregated data transfers to each reduce task
The job_id is the job id of interest and it is not needed for modes list_all and list_stats.
The output_file is an optional file to store the output.
We provide a few basic examples for using the bin/analyze. Please refer to docs/analyze.readme for the full documentation.
-
List basic statistical information regarding all jobs found in the
PROFILER_OUTPUT_DIR directory specified
in bin/config.sh:
./bin/analyze hadoop list_stats
-
List detailed information from executing the job
job_2010030839_0000:
./bin/analyze hadoop details job_2010030839_0000
-
List information regarding the data transfers among the map and reduce
tasks for job job_2010030839_0000:
./bin/analyze hadoop transfers_all job_2010030839_0000
-
Display the job profile for job job_2010030839_0000:
./bin/analyze hadoop profile job_2010030839_0000
Previous: 2. Job Profiling
Tutorial Overview