As VLSI technology improvements continue to widen the gap between processor and main memory cycle times, cache performance becomes increasingly important to overall system performance. Cache memories help alleviate the cycle time disparity, but only for programs that exhibit sufficient spatial and temporal locality. Programs with unruly access patterns spend much of their time transferring data to and from the cache. To fully exploit the performance potential of fast processors, programmers must explicitly consider cache behavior, restructuring their codes to increase locality. As these fast processors proliferate, techniques for improving cache performance must move beyond the supercomputer and multiprocessor communities and into the mainstream of computing.

In this paper, we examine some of the techniques that programmers can use to improve cache performance. We show how to use CProf, a cache profiler, to identify cache performance bottlenecks and gain insight into their origin. This insight helps programmers understand which of the well-known program transformations are likely to improve cache performance. Using CProf and a "cookbook" of simple transformations, we show how to tune the cache performance of six of the SPEC92 benchmarks. By restructuring the source code, we greatly improve cache behavior and achieve execution time speedups ranging from 1.02 to 3.46.

CProf is available as part of the Wisconsin Architectural Research Tool Set (WARTS).

Cache Profiling and the SPEC Benchmarks: A Case Study Alvin R. Lebeck and David A. Wood, IEEE COMPUTER, October 1994, Pages 15-26