Distributed Memory Management Units Architecture for NoC-based CMPs
Network on Chip (NoC) is considered as the promising diagram of interconnection mechanism for future chip multiprocessors. As the number of processing elements (PE) on chip keeps growing, the delay for simultaneous memory references of these PEs is emerging as a serious bottleneck on high performance. One major part of this delay is from the Memory Management Unit (MMU) due to its centralized structure. In this paper, we propose a novel distributed MMU architecture for NoC-based CMPs, which can effectively reduce the bottleneck effect in contrast of traditional MMU. We discuss the benefit of this architecture in aspects of TLB hit rate, network communication efficiency, memory bandwidth and coherence. Experimental results show that the distributed MMU structure significantly improves network throughput balance and lowers communicational delay.
- [Show abstract] [Hide abstract] ABSTRACT: With ongoing technology scaling, networks-on-chip will have to be organized hierarchically in order to cope with their growing size. Privileged nodes carry out dedicated tasks essential to establish and coordinate proper functionality within and among hierarchical units. In order to mitigate the impact of a privileged node's failure, redundancy has to be established by a suitable distribution of privileged nodes. This contribution presents an integer linear programming (ILP) formulation for finding distributions that minimize hardware cost under redundancy constraints or maximize redundancy for given cost for arbitrary NoC topologies. For mesh and mesh torus, optimal solutions are obtained with up to 289 nodes. For larger regular networks, a technique based on the generation and assembly of regular tiles is proposed. A characterization of the conditions for optimality of tile-based networks is provided.
- [Show abstract] [Hide abstract] ABSTRACT: On-chip distributed memory has emerged as a promising memory organization for future many-core systems, since it efficiently exploits memory level parallelism and can lighten off the load on each memory module by providing a comparable number of memory interfaces with on-chip cores. Packet-based memory access model (PDMA) has provided a scalable and flexible solution for distributed memory management, but suffers from complicated and costly on-chip network protocol translation and massive interferences among packets, which leads to unpredictable performance. In this paper we propose a direct distributed memory access (DDMA) model, in which remote memory can be directly accessed by local cores via remote-to-local virtualization, without network protocol translation. From the perspective of local cores, remote memory controllers (MC) can be directly manipulated through accessing the local agent MC, which is responsible for accessing remote memory through high-performance inter-tile communication. We further discuss some detailed architecture supports for the DDMA model, including the memory interface design, work flow and the protocols involved. Simulation results of executing PARSEC benchmarks show that our DDMA architecture outperforms PDMA in terms of both average memory access latency and IPC by 17.8% and 16.6% respectively on average. Besides, DDMA can better manage congested memory traffic, since a reduction of bandwidth in running memory-intensive SPEC2006 workloads only incurs 18.9% performance penalty, compared with 38.3% for PDMA.