摘要:Data parallel compilers have long aimed to equal the performance of carefully hand-optimizedparallel co des. For tightly-coupled applications based on line sweeps, this goal has been particularlyelusive. In the Rice dHPF compiler, we have developed a wide spectrum of optimizations that enableus to closely approach hand-coded performance for tightly-coupled line sweep applications includingthe NAS SP and BT benchmark co des. From lightly-modified copies of standard serial versions ofthese benchmarks, dHPF generates MPI-based parallel code that is within 4% of the performanceof the hand-crafted MPI implementations of these co des for a 1023problem size (Class B) on64 pro cessors. We describe and quantitatively evaluate the impact of partitioning, communicationand memory hierarchy optimizations implemented by dHPF that enable us to approach hand-codedperformance with compiler-generated parallel code