Digital halftoning is a crucial technique used in digital printers to convert a continuous-tone image into a pattern of black and white dots. Halftoning is used since printers have a limited availability of inks and cannot reproduce all the color intensities in a continuous image. Error Diffusion is an algorithm in halftoning that iteratively quantizes pixels in a neighborhood dependent fashion. This manuscript focuses on the development, design and Hardware Description Language (HDL) functional and performance simulation validation of a parallel scalable hardware architecture for high performance implementation of a high quality Stacked Error Diffusion algorithm. A CMYK printer, utilizing the high quality error diffusion algorithm, would be required to execute error diffusion 16 times per pixel, resulting in a potentially high computational cost. The algorithm, originally described in 'C', requires a significant processing time when implemented on a conventional single Central Processing Unit (CPU) based computer system. Thus, a new scalable high performance parallel hardware processor architecture is developed to implement the algorithm and is implemented to and tested on a single Programmable Logic Device (PLD) based Field Programmable Gate Array (FPGA) chip. There is a significant decrease in the run time of the algorithm when run on the newly proposed parallel architecture implemented to FPGA technology compared to execution on a single CPU based system.