Preliminary Exam Talks

Output Performance Analysis on a Supercomputer

Speaker:Bing Xie
bingxie at cs.duke.edu
Date: Tuesday, April 30, 2013
Time: 2:00pm - 4:00pm
Location: D344 LSRC, Duke

Abstract

It is observed that supercomputer I/O loads are often dominated by writes. Supercomputer I/O systems are designed to absorb the bursty outputs efficiently through massive parallelism. However, the delivered write bandwidth often falls well below the peak, slowing applications and compromising system throughput. This problem motivates the development of middleware systems to adapt the application's I/O access to deliver on the performance potential of supercomputer file systems.

We propose to understand the output absorption behavior of supercomputer file systems, through characterizing the performance of output bursts in a production supercomputer. To achieve this goal, we propose a statistical benchmarking methodology to obtain the distributions of write bandwidth across samples of compute nodes, disks and time intervals. Moreover, we apply this methodology on Titan and Spider, a production petascale facility housed at OLCF (Oak Ridge Leadership Computing Facility), and quantify the frequency and severity of contention and other transient system conditions.

In this talk, we first present our published work on characterizing output bottlenecks in Titan/Spider, introduce our statistical benchmarking methodology, and summarize the benchmarking results. We then discuss the ongoing work and potential problems: balancing benchmarking cost and accuracy in noisy production environments; monitoring and diagnosing health of large-scale I/O systems; and extending our work to other parallel storage systems.

Advisor(s): Jeffrey Chase
Alvin Lebeck, Shivnath Babu, Benjamin Lee