Distributed Memory Management Units Architecture for NoC-based CMPs.
ABSTRACT Network on Chip (NoC) is considered as the promising diagram of interconnection mechanism for future chip multiprocessors. As the number of processing elements (PE) on chip keeps growing, the delay for simultaneous memory references of these PEs is emerging as a serious bottleneck on high performance. One major part of this delay is from the Memory Management Unit (MMU) due to its centralized structure. In this paper, we propose a novel distributed MMU architecture for NoC-based CMPs, which can effectively reduce the bottleneck effect in contrast of traditional MMU. We discuss the benefit of this architecture in aspects of TLB hit rate, network communication efficiency, memory bandwidth and coherence. Experimental results show that the distributed MMU structure significantly improves network throughput balance and lowers communicational delay.
[Show abstract] [Hide abstract]
ABSTRACT: On-chip distributed memory has emerged as a promising memory organization for future many-core systems, since it efficiently exploits memory level parallelism and can lighten off the load on each memory module by providing a comparable number of memory interfaces with on-chip cores. Packet-based memory access model (PDMA) has provided a scalable and flexible solution for distributed memory management, but suffers from complicated and costly on-chip network protocol translation and massive interferences among packets, which leads to unpredictable performance. In this paper we propose a direct distributed memory access (DDMA) model, in which remote memory can be directly accessed by local cores via remote-to-local virtualization, without network protocol translation. From the perspective of local cores, remote memory controllers (MC) can be directly manipulated through accessing the local agent MC, which is responsible for accessing remote memory through high-performance inter-tile communication. We further discuss some detailed architecture supports for the DDMA model, including the memory interface design, work flow and the protocols involved. Simulation results of executing PARSEC benchmarks show that our DDMA architecture outperforms PDMA in terms of both average memory access latency and IPC by 17.8% and 16.6% respectively on average. Besides, DDMA can better manage congested memory traffic, since a reduction of bandwidth in running memory-intensive SPEC2006 workloads only incurs 18.9% performance penalty, compared with 38.3% for PDMA.Journal of Parallel and Distributed Computing 01/2013; DOI:10.1016/j.jpdc.2013.11.004 · 1.01 Impact Factor