|
Abstract
We propose to understand the output absorption behavior of supercomputer file systems, through characterizing the performance of output bursts in a production supercomputer. To achieve this goal, we propose a statistical benchmarking methodology to obtain the distributions of write bandwidth across samples of compute nodes, disks and time intervals. Moreover, we apply this methodology on Titan and Spider, a production petascale facility housed at OLCF (Oak Ridge Leadership Computing Facility), and quantify the frequency and severity of contention and other transient system conditions.
In this talk, we first present our published work on characterizing output bottlenecks in Titan/Spider, introduce our statistical benchmarking methodology, and summarize the benchmarking results. We then discuss the ongoing work and potential problems: balancing benchmarking cost and accuracy in noisy production environments; monitoring and diagnosing health of large-scale I/O systems; and extending our work to other parallel storage systems.