The concept of “data parallelism” is a pervasive force throughout parallel processing. Although a certain level of processing-element autonomy can help performance, the fact is that many parallel algorithms, applications, and compiler analysis techniques focus on identifying a set of data objects that can be processed using loosely synchronous parallelism. Thus, it is not surprising that a large number of communication libraries support at least a few synchronized aggregate operations on data. Over the past few years, we have developed eleven different types of PAPERS (Purdue’s Adapter for Parallel Execution and Rapid Synchronization) hardware specifically to efficiently implement aggregate functions for clusters of PCs or workstations. The Aggregate Function Application Program Interface (AFAPI) library was initially designed to be a portable high-level interface to the various types of PAPERS cluster hardware, so one would expect it to work well using this custom hardware, and it does work well. In this paper, we show that the AFAPI is also an efficient programming model for other types of parallel systems — especially shared memory multiprocessors. For many operations, AFAPI can outperform threads libraries and other more traditional shared memory programming models.