摘要:Current microprocessors incorporate techniques to exploit instruction-level parallelism (ILP). However,previous work has shown that these ILP techniques are less effective in removing memory stall time thanCPU time, making the memory system a greater bottleneck in ILP-based systems than in previous-generationsystems. These deficiencies arise largely because applications present limited opportunities for an out-of-order issue processor to overlap multiple read misses, the dominant source of memory stalls.This work proposes code transformations to increase parallelism in the memory system by overlappingmultiple read misses within the same instruction window, while preserving cache locality. We present ananalysis and transformation framework suitable for compiler implementation. Our simulation experimentsshow execution time reductions averaging 20% in a multiprocessor and 30% in a uniprocessor. A substantialpart of these reductions comes from increases in memory parallelism. We see similar benefits on a ConvexExemplar.