

Journal of Instruction-Level Parallelism 1 (2004) XX-YY Submitted 5/03; published 6/04

The Efficacy of Software Prefetching and Locality Optimizations

on Future Memory Systems*

Abdel-Hameed Badawy# absalam@eng.umd.edu Aneesh Aggarwal# aneesh@eng.umd.edu Donald Yeung# yeung@eng.umd.edu Chau-Wen Tseng## tseng@cs.umd.edu#

Electrical and Computer Engineering Dept.,## Computer Science Dept., University of Maryland, College Park.

Abstract Software prefetching and locality optimizations are techniques for overcoming the speed gapbetween processor and memory. In this paper, we provide a comprehensive summary of current

software prefetching and locality optimization techniques, and evaluate the impact of memory trendson the effectiveness of these techniques for three types of applications: regular scientific codes, irregular scientific codes, and pointer-chasing codes. We find that for many applications, softwareprefetching outperforms locality optimizations when there is sufficient memory bandwidth, but locality optimizations outperform software prefetching under bandwidth-limited conditions. Thebreak-even point (for 1 GHz processors) occurs at roughly 2.26 GBytes/sec on today's memory systems, and will increase on future memory systems. We also study the interactions betweensoftware prefetching and locality optimizations when applied in concert. Naively combining the techniques provides robustness to changes in memory bandwidth and latency, but does not yieldadditional performance gains. We propose and evaluate several algorithms to better integrate software prefetching and locality optimizations, including a modified tiling algorithm, padding forprefetching, and index prefetching. Finally, we investigate the interactions of stride-based hardware prefetching with our software techniques. We find that combining hardware and software prefetchingyields similar performance to software prefetching alone, and that locality optimizations enable stride-based hardware prefetching for benchmarks that do not normally exhibit striding.

1. Introduction Current microprocessors spend a large percentage of execution time on memory access stalls, even with large on-chip caches. Since processor speeds are growing at a greater rate than memory speeds, we expect memory access costs to become even more important in the future. Computer architects have been battling this memory wall problem [2] by designing ever larger and more sophisticated caches. Although caches are extremely effective, they are not the complete solution. Other techniques are required to fully address the memory wall problem.

Two promising approaches for improving memory performance are software prefetching and locality optimizations. The first executes explicit prefetch instructions to begin loading data from memory to cache. As long as prefetching begins early enough and the data is not evicted prior to its use, memory access latency can be completely hidden. However, as processor throughput improves due to memory latency tolerance, memory bandwidth use is increased since prefetching increases memory traffic. In comparison, locality optimizations use compiler or run-time transformations to change the computation order and/or data layout of a program to increase the probability it

cfl2004 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.

Badawy, Aggarwal, Yeung, & Tseng ERROR: rangecheckOFFENDING COMMAND: get

STACK: 1 [0 ] false 40  (\Delta )6575  1621 -savelevel- %%[ Error: rangecheck; OffendingCommand: get ]%%
