摘要:Wire delay and power consumption are primary obstacles to the continued scaling of micro-processor performance. Fundamentally, both issues are addressed by the emerging breed of single-chip, tiled microarchitectures including Raw [1], Trips [2], Scale [3], Wavescalar [4], and Syn-chroscalar [5], that replicate programmable processing elements with small amounts of memoryand communicate via on-chip networks characterized by extremely low latencies and high band-width. Our goal is to show that tiled microarchitectures permit energy-efficient high-performancecomputations when algorithms are mapped properly. To that end, we propose a decoupled systolicarchitecture as a canonical tiled microarchitecture that supports the resource requirements of de-coupled systolic algorithms designed specifically for this architecture. We develop an analyticalframework for reasoning about the efficiency and performance of decoupled systolic algorithms.In particular, we define stream algorithms as a class of decoupled systolic algorithms that achieve.¢.£.¥¤computational efficiency asymptotically for large numbers of processors. We focus ourattention on the class of regularly structured computations that form the foundation of scientificcomputing and digital signal processing