Taming the cost of coherency for multicore systems

Innovation Matters


When building multicore systems, computer architects must increase performance while exploiting the data-sharing benefits of parallelism. In designing the basic building block for the BlueGene/P supercomputer –- a four-way multicore node -– our architects turned to a filtering technique that recognizes and then eliminates processes that result in unnecessary cost.



As multicore systems scale to increasing core counts, the cost of coherence traffic to communicate between them is growing more severe. In a snooping-based coherence system, each core must inspect the memory traffic of every other core. For BlueGene/P’s write invalidate protocol, for example, snooping affects all write requests.


Hence, as the number of cores increases, the impact of coherence requests waiting to be processed scales up commensurately. Indeed, as each of the n nodes in a multicore system must process all other (n-1) nodes’ memory requests, the number of coherence actions scales with O(n_). The number of coherence actions will affect the overall performance of processors because these coherence requests interfere with a core’s access to its own cache.

Figure 1. The snooping scenario

The four processors pictured here snoop on all transactions generated by other processors and the DMA engine. The snoop unit contains snoop filters that can filter out the majority of snoop requests received from other processors and the DMA Engine.

One way to reduce the impact of coherence requests is to use dual-port cache directories that let snoop requests and cache accesses proceed in parallel. This solution, however, would come at the cost of a significant power and area increase (compared to a single-ported cache directory). Even so, system performance might be affected adversely as n cores must queue to use a single snoop directory port shared by all of the caches.

To increase performance while reducing power dissipation, the designers of the BlueGene/P multicore compute node turned to filtering out unnecessary coherence requests. In this scenario, each processor is shielded from unnecessary snoop requests by a novel snoop filter unit first introduced in the BlueGene/P system. The new snoop filter unit contains multiple filter engines, each adapted to a particular memory access pattern. Stream registers track a series of accesses by observing a processor’s own memory read requests. Snoop caches remember recently received coherence requests that have already been processed and, therefore, do not need to be repeated. Finally, a software-configurable range filter lets software exclude private data areas from snoop processing.

Where our research is going
Because of global financial and energy concerns, we have a mandate to make computers use less power. We are excited that the system we built became the most power-efficient supercomputer as reflected in its top rank in the “Green500” at its product launch -- an important new metric that will energize the information technology community to optimize energy-efficiency in computers.

Figure 2: Maximizing snoop efficiency


Each snoop unit contains one port filter per coherence traffic participant. The port filter consists of multiple snoop filters that work together to maximize snoop efficiency.

Initial performance analysis of the design point was performed using trace-based simulation. As shown in Figure 3, initial analysis demonstrated up to 98 percent efficiency for the popular SPLASH-2 benchmark. The result was a significant improvement in system performance, power, energy and energy-delay characteristics, as demonstrated by application measurements performed using the UMT2K* application on BlueGene/P hardware after system completion (shown in Figure 4).

Figure 3. Initial performance analysis of the design point
using trace-based simulation

Using trace-based analysis, we modeled the efficiency of snoop filtering and found that for common benchmarks, such as the SPLASH-2 benchmark suite, snoop filtering could filter out up to 98 percent of snoop requests.

Figure 4: BlueGene/P hardware measurements: With/without snoop filtering


With these BlueGene/P hardware measurements comparing system operation with and without snoop filtering, we find that snoop filtering offers a 10 percent speedup and a reduction in overall system power and energy, as well as the energy-delay and energy-delay-metrics.

* UMT2K is a 3D, deterministic, multi-group photon transport program for unstructured meshes..

This article was written by Valentina Salapura on behalf of the BlueGene team.

Publications

Valentina Salapura. Scaling Up Next Generation Supercomputers. Keynote at ACM International Conference on Computing Frontiers 2008, Ischia, Italy, May 2008.

Valentina Salapura, Matthias Blumrich and Alan Gara. Design and Implementation of the Blue Gene/P Snoop Filter HPCA'08. The 14th International Symposium on High-Performance Computer Architecture, February 2008.

Matthias Blumrich, Valentina Salapura and Alan Gara. Exploring the Architecture of a Stream Register-Based Snoop Filter, Transactions on HiPEAC, Volume 3, Issue 2, 2008.

Valentina Salapura, Matthias Blumrich and Alan Gara. Improving the Accuracy of Snoop Filtering Using Stream Registers, MEDEA Workshop in conjunction with PACT 2007 Conference, September 2007.

Valentina Salapura. Next generation supercomputers: Exploiting innovative massively parallel system architecture to facilitate breakthrough scientific discoveries. Invited talk at Grace Hopper Celebration of Women in Computing 2007 (GHC 2007), Orlando, Florida, October 2007.

Last updated on December 1, 2008

Innovator's corner  

Valentina SalapuraValentina Salapura Researcher

What's the potential for the work you're doing?
Supercomputers are the prime enablers of computational science. By simulating systems, we can analyze the behavior of natural processes. We can gain insight into areas previously not accessible via traditional sciences, such as physical and chemical processes at the nano-scale, so that ultimately we can understand the creation of matter and the structure of the universe, as well as the causes of disease.

What is the most interesting part of your research?
Computer Architecture is experiencing tectonic shifts The way we think of computers will change drastically. Until recently, building high-performance systems did not address power concerns. Building a high-performance system today requires that we worry about power right from the beginning. Other challenges are on the horizon with the generation of new technologies: Feature size and voltage concerns make designing reliable systems both more exciting and more challenging.

Who or what inspired you to go into this field?
My father was a physicist, so I was surrounded by science from a young age. I was always interested in how science can improve people’s lives, and information technology appeared to be a key discipline.

What is your favorite invention of all time?
The printing press has had profound influence on the way knowledge and education can be spread. As a result, knowledge has become more accessible and democratic.