文章基本信息

标题：Optimizing SMT Processors for High Single-Thread Performance
本地全文：下载
作者：Gautham Thambidorai ; Donald Yeung ; Seungryul Choi 等
期刊名称：The Journal of Instruction-Level Parallelism
电子版ISSN：1942-9525
出版年度：2003
卷号：5
页码：1-35
出版社：International Symposium on Microarchitecture
摘要：Simultaneous Multithreading (SMT) processors achieve high processor throughput atthe expense of single-thread performance. This paper investigates resource allocation poli-cies for SMT pro cessors that preserve, as much as possible, the single-thread performanceof designated "foreground" threads, while still permitting other "background" threads toshare resources. Since background threads onsuchanSMTmachinehaveanear-zeroper-formance impact on foreground threads, we refer to the background threads as transparentthreads. Transparent threads are ideal for performing low-priority or non-critical compu-tations, with applications in process scheduling, subordinate multithreading, and on-lineperformance monitoring.To realize transparent threads, we propose three mechanisms for maintaining the trans-parency of background threads: slot prioritization, background thread instruction-windowpartitioning, and background thread .ushing. In addition, we propose three mechanismsto boost background thread performance without sacrificing transparency: aggressive fetchpartitioning, foreground thread instruction-window partitioning, and foreground thread.ushing. We implement our mechanisms on a detailed simulator of an SMT pro cessor, andevaluate them using 8 benchmarks, including 7 from the SPEC CPU2000 suite. Our resultsshow when cache and branch predictor interference are factored out, background threadsintroduce less than 1% performance degradation on the foreground thread. Furthermore,maintaining the transparency of background threads reduces their throughput by only 23%relative to an equal priority scheme.To demonstrate the usefulness of transparent threads, we study Transparent Soft-ware Prefetching (TSP), an implementation of software data prefetching using transparentthreads. Due to its near-zero overhead, TSP enables prefetch instrumentation for all loadsin a program, eliminating the need for profiling. TSP, without any profile information,achieves a 9.41% gain across 6 SPEC benchmarks, whereas conventional software prefetch-ing guided by cache-miss profiles increases performance by only 2.47%.