出版社:SISSA, Scuola Internazionale Superiore di Studi Avanzati
摘要:At the time of Lattice 2010, we were about to announce a distribution of the code (QUDA 0.3) that supported both Wilson/clover and improved staggered quarks for computation on a single GPU. Multi-GPU code was running for both solvers, but with the restriction of grid partitioning in only the time dimension. In the past year, we developed code that allows us to cut the lattice in all four dimensions. This allows us to scale computations to order 100 GPUs yielding multi-teraflop performance. We will present results for both types of solvers on GPU clusters and for other kernels important for physics projects. We also compare performance and cost-effectiveness of full application codes running on CPUs with our GPU accelerated code.