Mattan Erez's research while affiliated with UT Dots and other places

Publications (115)

Preprint
We make three observations in modern processors: (1) LLC capacity is getting larger (up to 1GB); (2) core counts are increasing (up to 128 cores), accumulating a more significant amount of private L2 cache capacity on the chip; and (3) overall processor utilization in the cloud remains very low despite many efforts, leaving many large private cache...
Preprint
The security goals of cloud providers and users include memory confidentiality and integrity, which requires implementing Replay-Attack protection (RAP). RAP can be achieved using integrity trees or mutually authenticated channels. Integrity trees incur significant performance overheads and are impractical for protecting large memories. Mutually au...
Article
We aim to reduce contention caused by multiple aggressive prefetchers on shared resources (e.g., LLC and memory bandwidth) with a multi-agent reinforcement learning scheme. The agent finds what prefetchers to use and determines how aggressive they should be at any time during the execution. To do so, we utilize a highly scalable action branching ag...
Preprint
Full-text available
High load latency that results from deep cache hierarchies and relatively slow main memory is an important limiter of single-thread performance. Data prefetch helps reduce this latency by fetching data up the hierarchy before it is requested by load instructions. However, data prefetching has shown to be imperfect in many situations. We propose cac...
Preprint
DL inference queries play an important role in diverse internet services and a large fraction of datacenter cycles are spent on processing DL inference queries. Specifically, the matrix-matrix multiplication (GEMM) operations of fully-connected MLP layers dominate many inference tasks. We find that the GEMM operations for datacenter DL inference ta...
Preprint
Full-text available
Resistive memories have limited lifetime caused by limited write endurance and highly non-uniform write access patterns. Two main techniques to mitigate endurance-related memory failures are 1) wear-leveling, to evenly distribute the writes across the entire memory, and 2) fault tolerance, to correct memory cell failures. However, one of the main o...
Preprint
Modern recommendation systems rely on real-valued embeddings of categorical features. Increasing the dimension of embedding vectors improves model accuracy but comes at a high cost to model size. We introduce a multi-layer embedding training (MLET) architecture that trains embeddings via a sequence of linear layers to derive superior embedding accu...
Preprint
Full-text available
Modern deep learning models have high memory and computation cost. To make them fast and memory-cost efficient, structured model pruning is commonly used. We find that pruning a model using a common training accelerator with large systolic arrays is extremely performance-inefficient. To make a systolic array efficient for pruning and training, we p...
Conference Paper
State-of-the-art convolutional neural networks (CNNs) used in vision applications have large models with numerous weights. Training these models is very compute- and memory-resource intensive. Much research has been done on pruning or compressing these models to reduce the cost of inference, but little work has addressed the costs of training. We f...
Conference Paper
Timing errors are a growing concern for system resilience as technology continues to scale. It is problematic to use low-fidelity errors such as single-bit flips to model realistic timing errors. We address the lack of holistic methodology and tool for evaluating resilience of applications against timing errors. The proposed technique is able to ra...
Preprint
Near-data accelerators (NDAs) that are integrated with main memory have the potential for significant power and performance benefits. Fully realizing these benefits requires the large available memory capacity to be shared between the host and the NDAs in a way that permits both regular memory access by some applications and accelerating others wit...
Conference Paper
Future High-Performance Computing (HPC) systems will likely be composed of accelerator-dense heterogeneous computers because accelerators are able to deliver higher performance at lower costs, socket counts and energy consumption. Such accelerator-dense nodes pose a reliability challenge because preserving a large amount of state within accelerator...
Preprint
Full-text available
Training convolutional neural networks (CNNs) requires intense compute throughput and high memory bandwidth. Especially, convolution layers account for the majority of the execution time of CNN training, and GPUs are commonly used to accelerate these layer workloads. GPU design optimization for efficient CNN training acceleration requires the accur...
Preprint
GPUs offer orders-of-magnitude higher memory bandwidth than traditional CPU-only systems. However, GPU device memory tends to be relatively small and the memory capacity can not be increased by the user. This paper describes Buddy Compression, a scheme to increase both the effective GPU memory capacity and bandwidth while avoiding the downsides of...
Preprint
Full-text available
Model pruning is a popular mechanism to make a network more efficient for inference. In this paper, we explore the use of pruning to also make the training of such neural networks more efficient. Unlike all prior model pruning methods that sparsify a pre-trained model and then prune it, we train the network from scratch, while gradually and structu...
Preprint
Full-text available
Training convolutional neural networks (CNNs) requires intense computations and high memory bandwidth. We find that bandwidth today is over-provisioned because most memory accesses in CNN training can be eliminated by rearranging computation to better utilize on-chip buffers and avoid traffic resulting from large per-layer memory footprints. We int...
Conference Paper
Strategies to detect, correct, or mitigate the impact of soft errors rely on errors injection experiments. For efficient evaluation, these experiments typically inject errors in software by sampling errors from a candidate distribution. Most often, these strategies randomly select and flip one bit in the output of an instruction. While single-bit f...
Article
Current memory technology has hit a wall trying to scale to meet the increasing demands of modern client and datacenter systems. Data compression is a promising solution to this problem. Several compressed memory systems have been proposed in the past years. Unfortunately, a reasonable methodology to evaluate these systems is missing. In this paper...
Article
We present Lux, a distributed multi-GPU system that achieves fast graph processing by exploiting the aggregate memory bandwidth of multiple GPUs and taking advantage of locality in the memory hierarchy of multi-GPU clusters. Lux provides two execution models that optimize algorithmic efficiency and enable important GPU optimizations, respectively....
Conference Paper
In this paper, we introduce the Do-It-Yourself virtual memory translation (DVMT) architecture as a flexible complement for current hardware-fixed translation flows. DVMT decouples the virtual-to-physical mapping process from the access permissions, giving applications freedom in choosing mapping schemes, while maintaining security within the operat...
Article
In this paper, we introduce the Do-It-Yourself virtual memory translation (DVMT) architecture as a flexible complement for current hardware-fixed translation flows. DVMT decouples the virtual-to-physical mapping process from the access permissions, giving applications freedom in choosing mapping schemes, while maintaining security within the operat...
Conference Paper
Latency-critical applications suffer from both average performance degradation and reduced completion time predictability when collocated with batch tasks. Such variation forces the system to overprovision resources to ensure Quality of Service (QoS) for latency-critical tasks, degrading overall system throughput. We explore the causes of this vari...
Article
As key applications become more data-intensive and the computational throughput of processors increases, the amount of data to be transferred in modern memory subsystems grows. Increasing physical bandwidth to keep up with the demand growth is challenging, however, due to strict area and energy limitations. This paper presents a novel and lightweig...
Article
Increasing transfer rates and decreasing I/O voltage levels make signals more vulnerable to transmission errors. While the data in computer memory are well-protected by modern error checking and correcting (ECC) codes, the clock, control, command, and address (CCCA) signals are weakly protected or even unprotected such that transmission errors leav...
Article
Memory system reliability is a serious concern in many systems today, and is becoming more worrisome as technology scales and system size grows. Stronger fault tolerance capability is therefore desirable, but often comes at high cost. In this paper, we propose a low-cost, fault-aware, hardware-only resilience mechanism, RelaxFault, that repairs the...
Article
Latency-critical applications suffer from both average performance degradation and reduced completion time predictability when collocated with batch tasks. Such variation forces the system to overprovision resources to ensure Quality of Service (QoS) for latency-critical tasks, degrading overall system throughput. We explore the causes of this vari...
Article
Latency-critical applications suffer from both average performance degradation and reduced completion time predictability when collocated with batch tasks. Such variation forces the system to overprovision resources to ensure Quality of Service (QoS) for latency-critical tasks, degrading overall system throughput. We explore the causes of this vari...
Article
Latency-critical applications suffer from both average performance degradation and reduced completion time predictability when collocated with batch tasks. Such variation forces the system to overprovision resources to ensure Quality of Service (QoS) for latency-critical tasks, degrading overall system throughput. We explore the causes of this vari...
Article
Memory hierarchies in modern computing systems work well for workloads that exhibit temporal data locality. Data that is accessed frequently is brought closer to the computing cores, allowing faster access times, higher bandwidth, and reduced transmission energy. Many applications that work on big data, however, read data only once. When running th...
Conference Paper
Adaptive-granularity memory architectures have been considered mainly because of main memory bottleneck and power efficiency. Meanwhile, highly reliable protection schemes are getting popular especially in large computing systems. Unfortunately, conventional ECC mechanisms including Chipkill require a large number of symbols to guarantee strong pro...
Conference Paper
Because main memory is vulnerable to errors and failures, large-scale systems and critical servers utilize error checking and correcting (ECC) mechanisms to meet their reliability requirements. We propose a novel mechanism, Frugal ECC (FECC), that combines ECC with fine-grained compression to provide versatile protection that can be both stronger a...
Article
In this paper, we demonstrate an energy-reduction strategy that overcomes the stochastic switching characteristics of the spin-torque-transfer magnetic-RAM (STT-RAM) write operation and propose a write completion circuit needed for it. In contrast to the traditional worst case approach, which fixes the write duration for all cells, the proposed wri...
Article
Growing computer system sizes and levels of integration have made memory reliability a primary concern, necessitating strong memory error protection. As such, large-scale systems typically employ error checking and correcting codes to trade redundant storage and bandwidth for increased reliability. While stronger memory protection will be needed to...
Article
GPUs employ massive multithreading and fast context switching to provide high throughput and hide memory latency. Multithreading can Increase contention for various system resources, however, that may result In suboptimal utilization of shared resources. Previous research has proposed variants of throttling thread-level parallelism to reduce cache...
Article
Memory errors have been a major source of system failures and fault rates may rise even further as memory continues to scale. This increasing fault rate, especially when combined with advent of integrated on-package memories, may exceed the capabilities of traditional fault tolerance mechanisms or significantly increase their overhead. In this pape...
Patent
Atomic memory access requests are handled using a variety of systems and methods. According to one example method, a data-processing circuit having an address-request generator that issues requests to a common memory implements a method of processing the requests using a memory-access intervention circuit coupled between the generator and the commo...
Patent
Full-text available
Methods and apparatus for restoring a meta predictor system upon detecting a branch or binary misprediction, are disclosed. An example apparatus may include a base misprediction history register to store a set of misprediction history values each indicating whether a previous branch prediction taken by a previous branch instruction was predicted co...
Conference Paper
As GPU's compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. Current GPU memory hierarchies use coarse-grained memory accesses to exploit spatial locality, maximize peak bandwidth, simplify control, and reduce cache meta-data storage. These coarse-grained memory accesses, however, are a poor match for emerging GPU a...
Conference Paper
In this paper we demonstrate an energy-reduction strategy that relies on the stochastic long-tail nature of the STT-RAM write operation. To move away from the traditional worst-case approach, the per-cell write process is continuously monitored and is terminated as soon as each cell's state matches the written state. Since the average write duratio...
Conference Paper
Current GPUs maintain high programmability by abstracting the SIMD nature of the hardware as independent concurrent threads of control with hardware responsible for generating predicate masks to utilize the SIMD hardware for different flows of control. This dynamic masking leads to poor utilization of SIMD resources when the control of different th...
Article
Full-text available
We present here a report produced by a workshop on ‘Addressing failures in exascale computing’ held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software laye...
Conference Paper
Current graphics processing units (GPUs) utilize the single instruction multiple thread (SIMT) execution model. With SIMT, a group of logical threads executes such that all threads in the group execute a single common instruction on a particular cycle. To enable control flow to diverge within the group of threads, GPUs partially serialize execution...
Article
This paper describes and evaluates a scalable and efficient resilience scheme based on the concept of containment domains. Containment domains are a programming construct that enable applications to express resilience needs and to interact with the system to tune and specialize error detection, state preservation and restoration, and recovery schem...
Conference Paper
This paper describes and evaluates a scalable and efficient resilience scheme based on the concept of containment domains. Containment domains are a programming construct that enable applications to express resilience needs and to interact with the system to tune and specialize error detection, state preservation and restoration, and recovery schem...
Article
A significant portion of the energy dissipated in modern integrated circuits is consumed by the overhead associated with timing guardbands that ensure reliable execution. Timing speculation, where the pipeline operates at an unsafe voltage with any rare errors detected and resolved by the architecture, has been demonstrated to significantly improve...
Article
Chip multiprocessors enable continued performance scaling with increasingly many cores per chip. As the throughput of computation outpaces available memory bandwidth, however, the system bottleneck will shift to main memory. We present a memory system, the dynamic granularity memory system (DGMS), which avoids unnecessary data transfers, saves powe...
Article
Wide SIMD-based GPUs have evolved into a promising platform for running general purpose workloads. Current programmable GPUs allow even code with irregular control to execute well on their SIMD pipelines. To do this, each SIMD lane is considered to execute a logical thread where hardware ensures that control flow is accurate by automatically applyi...
Conference Paper
Full-text available
Diverse IP cores are integrated on a modern system-on-chip and share resources. Off-chip memory bandwidth is often the scarcest resource and requires careful allocation. Two of the most important cores, the CPU and the GPU, can both simultaneously demand high bandwidth. We demonstrate that conventional quality-of-service allocation techniques can s...
Article
Wide SIMD-based GPUs have evolved into a promising platform for running general purpose workloads. Current programmable GPUs allow even code with irregular control to execute well on their SIMD pipelines. To do this, each SIMD lane is considered to execute a logical thread where hardware ensures that control flow is accurate by automatically applyi...
Article
Free-p—fine-grained remapping with error checking and correcting (ECC) and embedded pointers—remaps worn-out nonvolatile RAM (NVRAM) blocks at a fine granularity without requiring large dedicated storage and protects NVRAM against both hard and soft errors. Furthermore, Free-p can be implemented purely in the memory controller, avoiding custom NVRA...
Article
Near-threshold computing exhibits improved energy efficiency compared to nominal super-threshold operation [1, 2]. Two critical bottlenecks prevent mainstream adoption of low-VDD operation: degraded logic delay resulting in significantly lower throughput than at super-threshold, and excessive, unpredictable delay variation caused by increased sensi...
Article
Modern memory systems rely on spatial locality to provide high bandwidth while minimizing memory device power and cost. The trend of increasing the number of cores that share memory, however, decreases apparent spatial locality because access streams from independent threads are interleaved. Memory access scheduling recovers only a fraction of the...
Article
Full-text available
While Moore's law scaling continues to double transistor density every technology generation, new design challenges are introduced. One of these challenges is variation, resulting in deviations in the behavior of transistors, most importantly in switching delays. These exaggerated delays widen the gap between the average and the worst case behavior...
Conference Paper
We propose adaptive granularity to combine the best of fine-grained and coarse-grained memory accesses. We augment virtual memory to allow each page to specify its preferred granularity of access based on spatial locality and error-tolerance tradeoffs. We use sector caches and sub-ranked memory systems to implement adaptive granularity. We also sho...
Article
We propose adaptive granularity to combine the best of fine-grained and coarse-grained memory accesses. We augment virtual memory to allow each page to specify its preferred granularity of access based on spatial locality and error-tolerance tradeoffs. We use sector caches and sub-ranked memory systems to implement adaptive granularity. We also sho...
Article
Full-text available
This report describes diverse error detection mechanisms that can be utilized within a resilient system to protect applications against various types of errors and faults, both hard and soft. These detection mechanisms have different overhead costs in terms of energy, performance, and area, and also differ in their error coverage, complexity, and p...
Article
Networks-on-chip (NoCs) are used in a growing number of SoCs and multi-core processors. Because messages compete for the NoC’s shared resources, quality of service and resource allocation are major concerns for system designers. In particular, a model for the properties of packet delivery through the network is desirable. We present a methodology f...
Article
This paper presents a cache-aware Bloom Filter algorithm with improved cache behavior and a lower false positive rates compared to prior work. The algorithm relies on the power-of-two choice principle to provide a better distribution of set elements in a Blocked Bloom Filter. Instead of choosing a single block, we insert new elements into the least...
Conference Paper
Full-text available
Emerging non-volatile memories such as phase-change RAM (PCRAM) offer significant advantages but suffer from write endurance problems. However, prior solutions are oblivious to soft errors (recently raised as a potential issue even for PCRAM) and are incompatible with high-level fault tolerance techniques such as chipkill. To additionally address s...
Article
Virtualized error checking and correcting (ECC) is a scheme that virtualizes memory-error correction. Unlike traditional uniform ECC, which provides a fixed level of error tolerance, virtualized ECC enables flexible memory protection by mapping redundant information needed for correcting errors onto the memory namespace. Additionally, virtualized E...
Conference Paper
Full-text available
We propose adaptive granularity to combine the best of fine-grained and coarse-grained memory accesses. We augment virtual memory to allow each page to specify its preferred granularity of access based on spatial locality and error-tolerance tradeoffs. We use sector caches and sub-ranked memory systems to implement adaptive granularity. We also sho...
Conference Paper
Scaling process technology necessitates the introduction of wide design-time guard bands that ensure lifetime reliability as circuits wear out over time. In this paper, we show how to utilize this knowledge of the guard band and a predictive model to absolutely improve processor power consumption and lifetime without impacting the processor perform...
Conference Paper
Full-text available
We present a general scheme for virtualizing main memory error-correction mechanisms, which map redundant information needed to correct errors into the memory namespace itself. We rely on this basic idea, which increases flexibility to increase error protection capabilities, improve power efficiency, and reduce system cost; with only small performa...
Article
While Moore's law scaling continues to double transistor density every technology generation, supply voltage reduction has essentially stopped, increasing both power density and total energy consumed in conventional microprocessors. Therefore, future processors will require an architecture that can: a) take advantage of the massive amount of transi...
Conference Paper
Full-text available
We present ECC FIFO, a mechanism enabling two-tiered last-level cache error protection using an arbitrarily strong tier-2 code without increasing on-chip storage. Instead of adding redundant ECC information to each cache line, our ECC FIFO mechanism off-loads the extra information to off-chip DRAM. We augment each cache line with a tier- 1 code, wh...
Article
Full-text available
Networks on chip must deliver high bandwidth at low latencies while keeping within a tight power envelope. Using express virtual channels for flow control improves energy-delay throughput by letting packets bypass intermediate routers, but EVCS have key limitations. Nochi (NoC with hybrid interconnect) overcomes these limitations by transporting da...
Chapter
Stream processors, like other multi core architectures partition their functional units and storage into multiple processing elements. In contrast to typical architectures, which contain symmetric general-purpose cores and a cache hierarchy, stream processors have a significantly leaner design. Stream processors are specifically designed for the st...
Article
As processor core counts increase, networks-on-chip (NoCs) are becoming an increasingly popular interconnection fabric due to their ability to supply high bandwidth. However, NoCs need to deliver this high bandwidth at low latencies, while keeping within a tight power envelope. In this paper, we present a novel NoC with hybrid interconnect that lev...
Article
This paper presents a novel technique, Memory Mapped ECC, which reduces the cost of providing error correction for SRAM caches. It is important to limit such overheads as processor resources become constrained and error propensity increases. The continuing decrease in SRAM cell size and the growing capacity of caches increases the likelihood of err...
Conference Paper
Full-text available
This paper presents a novel technique, Memory Mapped ECC, which reduces the cost of providing error correction for SRAM caches. It is important to limit such overheads as proces- sor resources become constrained and error propensity in- creases. The continuing decrease in SRAM cell size and the growing capacity of caches increases the likelihood of...
Conference Paper
Full-text available
Networks-on-chip (NoCs) are used in a growing number of SoCs and multi-core processors, increasing the need for accurate and efficient modeling to aid the design of these highly-integrated systems. Towards this modeling goal, we present a methodology for packet-level static timing analysis in NoCs. Our methodology enables quick and accurate gauging...
Conference Paper
Full-text available
As processor core counts increase, networks-on-chip (NoCs) are becoming an increasingly popular interconnection fabric due to their ability to supply high bandwidth. However, NoCs need to deliver this high bandwidth at low latencies, while keeping within a tight power envelope. In this paper, we present a novel NoC with hybrid interconnect that lev...
Conference Paper
Full-text available
There has recently been much interest in stream processing, both in industry (e.g., Cell, NVIDIA G80, ATI R580) and academia (e.g., Stanford Merrimac, MIT RAW), with stream programs becoming increasingly popular for both media and more general-purpose computing. Although a special style of programming called stream programming is needed to target t...
Conference Paper
Full-text available
This paper explores the scalability of the Stream Processor ar- chitecture along the instruction-, data-, and thread-level paral- lelism dimensions. We develop detailed VLSI-cost and processor- performance models for a multi-threaded Stream Processor and evaluate the tradeoffs, in both functionality and hardware costs, of mechanisms that exploit th...
Conference Paper
Full-text available
The recent emergence of compute-intensive stream processors such as the Cell Broadband Engine, Stanford's Merrimac, and Clear- Speed's CSX600 has made them attractive platforms for scientific high-performance computing. Unstructured mesh and graph appli- cations are an important class of numerical algorithms used in the scientific computing domain,...
Conference Paper
Full-text available
We present a compiler for machines with an explicitly managed memory hierarchy and suggest that a primary role of any compiler for such architectures is to manipulate and schedule a hierarchy of bulk operations at varying scales of the application and of t he ma- chine. We evaluate the performance of our compiler using several benchmarks running on...
Conference Paper
Full-text available
We present Sequoia, a programming language designed to facilitate the development of memory hierarchy aware parallel programs that remain portable across modern machines featuring different memory hierarchy configurations. Sequoia abstractly exposes hierarchical memory in the programming model and provides language mechanisms to describe communicat...
Article
Full-text available
Data-parallel memory systems must maintain a large number of outstanding memory references to fully use in- creasing DRAM bandwidth in the presence of rising la- tencies. Additionally, throughput is increasingly sensitive to the reference patterns due to the rising latency of issuing DRAM commands, switching between reads and writes, and prechargin...