Article

Memory Systems and Interconnects for Scale-Out Servers

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

The information revolution of the last decade has been fueled by the digitization of almost all human activities through a wide range of Internet services. The backbone of this information age are scale-out datacenters that need to collect, store, and process massive amounts of data. These datacenters distribute vast datasets across a large number of servers, typically into memory-resident shards so as to maintain strict quality-of-service guarantees. While data is driving the skyrocketing demands for scale-out servers, processor and memory manufacturers have reached fundamental efficiency limits, no longer able to increase server energy efficiency at a sufficient pace. As a result, energy has emerged as the main obstacle to the scalability of information technology (IT) with huge economic implications. Delivering sustainable IT calls for a paradigm shift in computer system design. As memory has taken a central role in IT infrastructure, memory-centric architectures are required to fully utilize the IT's costly memory investment. In response, processor architects are resorting to manycore architectures to leverage the abundant request-level parallelism found in data-centric applications. Manycore processors fully utilize available memory resources, thereby increasing IT efficiency by almost an order of magnitude. Because manycore server chips execute a large number of concurrent requests, they exhibit high incidence of accesses to the last-level-cache for fetching instructions (due to large instruction footprints), and off-chip memory (due to lack of temporal reuse in on-chip caches) for accessing dataset objects. As a result, on-chip interconnects and the memory system are emerging as major performance and energy-efficiency bottlenecks in servers. This thesis seeks to architect on-chip interconnects and memory systems that are tuned for the requirements of memory-centric scale-out servers. By studying a wide range of data-centric applications, we uncover application phenomena common in data-centric applications, and examine their implications on on-chip network and off-chip memory traffic. Finally, we propose specialized on-chip interconnects and memory systems that leverage common traffic characteristics, thereby improving server throughput and energy efficiency.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... The MAC operation power is modeled assuming 3× more current than a typical gapless read [56]. We assume that each GDDR6 memory controller for two channels consumes 314.6 mW [108] and each BOOM RISC-V core consumes 250 mW [9]. We implement the RTL of the remaining components in the CXL controller and synthesize it using a TSMC 28nm technology library and the Synopsys Design Compiler [104]. ...
Preprint
Full-text available
Large Language Model (LLM) inference uses an autoregressive manner to generate one token at a time, which exhibits notably lower operational intensity compared to earlier Machine Learning (ML) models such as encoder-only transformers and Convolutional Neural Networks. At the same time, LLMs possess large parameter sizes and use key-value caches to store context information. Modern LLMs support context windows with up to 1 million tokens to generate versatile text, audio, and video content. A large key-value cache unique to each prompt requires a large memory capacity, limiting the inference batch size. Both low operational intensity and limited batch size necessitate a high memory bandwidth. However, contemporary hardware systems for ML model deployment, such as GPUs and TPUs, are primarily optimized for compute throughput. This mismatch challenges the efficient deployment of advanced LLMs and makes users to pay for expensive compute resources that are poorly utilized for the memory-bound LLM inference tasks. We propose CENT, a CXL-ENabled GPU-Free sysTem for LLM inference, which harnesses CXL memory expansion capabilities to accommodate substantial LLM sizes, and utilizes near-bank processing units to deliver high memory bandwidth, eliminating the need for expensive GPUs. CENT exploits a scalable CXL network to support peer-to-peer and collective communication primitives across CXL devices. We implement various parallelism strategies to distribute LLMs across these devices. Compared to GPU baselines with maximum supported batch sizes and similar average power, CENT achieves 2.3×\times higher throughput and consumes 2.3×\times less energy. CENT enhances the Total Cost of Ownership (TCO), generating 5.2×\times more tokens per dollar than GPUs.
... While the memory controller and interface require negligible power compared to the processor, we include them for completeness. We estimate controller and interface power per DDR5 channel to be 0.5W and 0.6W, respectively [57], or 13W in total for a baseline processor with 12 channels. Similarly, PCIe 5.0's interface power is ∼0.2W per lane [4], or 77W for the 384 lanes required to support COAXIAL's 48 DDR5 channels. ...
Preprint
Full-text available
The memory system is a major performance determinant for server processors. Ever-growing core counts and datasets demand higher bandwidth and capacity as well as lower latency from the memory system. To keep up with growing demands, DDR--the dominant processor interface to memory over the past two decades--has offered higher bandwidth with every generation. However, because each parallel DDR interface requires a large number of on-chip pins, the processor's memory bandwidth is ultimately restrained by its pin-count, which is a scarce resource. With limited bandwidth, multiple memory requests typically contend for each memory channel, resulting in significant queuing delays that often overshadow DRAM's service time and degrade performance. We present CoaXiaL, a server design that overcomes memory bandwidth limitations by replacing \textit{all} DDR interfaces to the processor with the more pin-efficient CXL interface. The widespread adoption and industrial momentum of CXL makes such a transition possible, offering 4×4\times higher bandwidth per pin compared to DDR at a modest latency overhead. We demonstrate that, for a broad range of workloads, CXL's latency premium is more than offset by its higher bandwidth. As CoaXiaL distributes memory requests across more channels, it drastically reduces queuing delays and thereby both the average value and variance of memory access latency. Our evaluation with a variety of workloads shows that CoaXiaL improves the performance of manycore throughput-oriented servers by 1.52×1.52\times on average and by up to 3×3\times.
... They play a pivotal role in ensuring the performance and power scalability of several manycore chips as they provide the path to performance-critical LLC resident instructions. Communication power dissipation is emerging as a significant fraction of the total chip power budget [2], [3]. In order to meet the high data rates and low energy-perbit requirements, as projected by the International Technology Roadmap for Semiconductor (ITRS), there is a need to minimize total losses comprising of conductor and dielectric losses at current and future technology nodes [4], [5]. ...
Article
Full-text available
In planar on-chip copper interconnects, conductor losses due to surface roughness demands explicit consideration for accurate modeling of their performance metrics. This is quite pertinent for high-performance manycore processors/servers, where on-chip interconnects are increasingly emerging as one of the key performance bottlenecks. This paper presents a novel analytical model for parameter extraction in current and future on-chip interconnects. Our proposed model aids in analyzing the impact of spatial and vertical surface roughness on their electrical performance. Our analysis clearly depicts that as the technology nodes scale down; the effect of the surface roughness becomes dominant and cannot be ignored. Based on AFM images of fabricated ultra-thin copper sheets, we have extracted roughness parameters to define realistic surface profiles using the well-known Mandelbrot-Weierstrass (MW) fractal function. For our analysis, we have considered four current and future interconnect technology nodes (i.e. 45nm, 22nm, 13nm, 7nm) and evaluated the impact of surface roughness on typical performance metrics, such as delay, energy and bandwidth. Results obtained using our model are verified by comparing with industry standard field solver Ansys HFSS as well as available experimental data that exhibits accuracy within 9%. We present signal integrity analysis using the eye diagram at 1Gbps, 5Gbps, 10Gbps and 18Gbps bit rates to find the increase in frequency dependent losses due to surface roughness. Finally, simulating a standard three line on-chip interconnect structure, we also report the computational overhead incurred for different values of roughness and technology nodes.
Article
This paper presents an overview of the problem of surface roughness in ultra-scaled Copper (Cu) interconnects. It is seen that surface roughness can severely degrade the electrical and thermal performance of Cu interconnects. This penalty has largely been ignored that has resulted in fairly optimistic models and estimates. It is in this context that this paper and our ongoing work gains significance. The authors make an attempt to present the big picture with reference to interconnect surface roughness and its implications on various design metrics.
ResearchGate has not been able to resolve any references for this publication.