TY - GEN
T1 - Hardware support for OpenMP collective operations
AU - Kim, Soohong P.
AU - Midkiff, Samuel P.
AU - Dietz, Henry G.
PY - 2010
Y1 - 2010
N2 - Efficient implementation of OpenMP collective operations (e.g. barriers and reductions) is essential for good performance from OpenMP programs. State-of-the-art on-chip networks and block-based cache coherence protocols used in shared memory Chip MultiProcessors (CMPs) are inefficient for implementing these collective operations. The performance of CMPs can be seriously degraded by the multitude of memory requests and coherence messages required to implement collective operations. To provide efficient support for OpenMP collective operations, this paper presents a CMP-AFN architecture and Instruction Set Architecture (ISA) extensions that augment a conventional shared-memory CMP with a tightly-integrated Aggregate Function Network (AFN) that implements low-latency collectives without using or interfering with the memory hierarchy. For a modest increase in circuit complexity, traffic within a CMP's internal network is dramatically reduced, improving the performance of caches and reducing power consumption. Full system simulations of 16-core CMPs show a CMP-AFN outperforms the reference design significantly, eliminating more than 60% of memory accesses and more than 70% of private L1 data cache misses in both the EPCC OpenMP microbenchmarks and SPEC OMP benchmarks.
AB - Efficient implementation of OpenMP collective operations (e.g. barriers and reductions) is essential for good performance from OpenMP programs. State-of-the-art on-chip networks and block-based cache coherence protocols used in shared memory Chip MultiProcessors (CMPs) are inefficient for implementing these collective operations. The performance of CMPs can be seriously degraded by the multitude of memory requests and coherence messages required to implement collective operations. To provide efficient support for OpenMP collective operations, this paper presents a CMP-AFN architecture and Instruction Set Architecture (ISA) extensions that augment a conventional shared-memory CMP with a tightly-integrated Aggregate Function Network (AFN) that implements low-latency collectives without using or interfering with the memory hierarchy. For a modest increase in circuit complexity, traffic within a CMP's internal network is dramatically reduced, improving the performance of caches and reducing power consumption. Full system simulations of 16-core CMPs show a CMP-AFN outperforms the reference design significantly, eliminating more than 60% of memory accesses and more than 70% of private L1 data cache misses in both the EPCC OpenMP microbenchmarks and SPEC OMP benchmarks.
UR - http://www.scopus.com/inward/record.url?scp=77954407791&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77954407791&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-13374-9_3
DO - 10.1007/978-3-642-13374-9_3
M3 - Conference contribution
AN - SCOPUS:77954407791
SN - 3642133738
SN - 9783642133732
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 31
EP - 49
BT - Languages and Compilers for Parallel Computing - 22nd International Workshop, LCPC 2009, Revised Selected Papers
T2 - 22nd International Workshop on Languages and Compilers for Parallel Computing, LCPC 2009
Y2 - 8 October 2009 through 10 October 2009
ER -