Starfish Hadoop Log Analyzer

Prerequisites

  1. Currently only job history logs generated by Apache Hadoop v0.20.x and v1.0.3 are supported.

Steps

  1. Download latest Starfish Hadoop Log Analyzer from Downloads.
  2. Job History Collection: Collect the job history files for the completed jobs that you would like to analyze.
  3. Job Analysis: Obtain a deep understanding of a MapReduce program's behavior during execution, as well as to diagnose bottlenecks during job execution.

Starfish Profiler, What-If Engine and Cost-based Optimizer

Prerequisites

Before moving forward, please ensure the following prerequisites:

  1. The Hadoop cluster uses Apache Hadoop v0.20.2 or v0.20.203.0.
  2. Optional: JavaFX version 1.2.3 SDK is installed (required only by the Starfish Visualizer).

You must also set the following environment variables on the machine from which you will be using Starfish:

  1. JAVA_HOME: Contains the installation directory of the Java JDK
  2. HADOOP_HOME: Contains the installation directory of Hadoop
  3. JAVAFX_HOME: Contains the installation directory of JavaFX (required only by the Starfish Visualizer)

Overview

This tutorial contains compilation, installation, and usage instructions for the Starfish analytics system. The tutorial is designed to provide an overview of the various Starfish components. For the full documentation, please refer to the docs directory in the Stafish release.

The tutorial is divided into the following five sections:

  1. Installation: Explains the installation and compilation process of Starfish.
  2. Job Profiling: Used to collect detailed statistical information from unmodified MapReduce programs that helps in performing job analysis, what-if analysis, and ultimately job optimization.
  3. Job Analysis: Used to obtain a deep understanding of a MapReduce program's behavior during execution, as well as to diagnose bottlenecks during job execution.
  4. What-if Analysis: Used to study the effects of configuration parameters, cluster resources, and input data properties on the performance of a MapReduce job, without actually running the job.
  5. Job Optimization: Used to find the optimal configuration parameter settings for a MapReduce job, as well as to help understand why the current settings are possibly suboptimal.
  6. Utilities: Used to execute various utility functionalities

The examples and directory references in the remaining tutorial assumes the working directory is the starfish directory.