Krste Asanovic's research while affiliated with University of California and other places

Publications (248)

Preprint
Hardware enclaves rely on a disjoint memory model, which maps each physical address to an enclave to achieve strong memory isolation. However, this severely limits the performance and programmability of enclave programs. While some prior work proposes enclave memory sharing, it does not provide a formal model or verification of their designs. This...
Article
This work presents a RISC-V system-on-chip (SoC) with eight application cores containing programmable-precision vector accelerators. The SoC is built by using a generator-based design methodology, which enables the integration of open-source and project-specific building blocks to develop differentiated functionality. The digital component generato...
Article
Given the complexity of modern systems-on-chip (SoCs), hardware-assisted verification is an integral part of the chip-design process. However, chip designers often need to choose between richly featured but expensive emulation platforms or faster, cheaper, but less debuggable FPGA prototyping solutions. FireSim, an open-source, FPGA-accelerated har...
Article
This article presents a framework, Genesis (GENomE analySIS), to efficiently and flexibly accelerate generic data manipulation operations that have become performance bottlenecks in the genomic data processing pipeline utilizing FPGAs-as-a-service. Genesis conceptualizes genomic data as a very large relational database and uses extended SQL as a do...
Article
This work demonstrates a dual-core RISC-V system-on-chip (SoC) with integrated fine-grain power management. The 28-nm fully depleted silicon-on-insulator (FD-SOI) SoC integrates switched-capacitor voltage converters and 4-Gb/s off-chip serial links. The SoC runs applications with operating system support on dual RISC-V Rocket cores with vector acce...
Article
This letter presents a RISC-V System-on-Chip (SoC) with fully integrated switched-capacitor DC–DC converters, adaptive clock generators, mixed-precision floating-point vector accelerators, a 5-Gb/s serial memory interface, and an integrated power management unit (PMU) manufactured in 28-nm FD-SOI. The vector accelerator improves performance and ene...
Article
Trusted execution environments (TEEs) are a growing part of the security ecosystem. Unfortunately, widely available TEEs are hampered by closed designs and a lack of flexibility. We outline the challenges to TEEs, advocate for extensible and portable open TEEs, and detail current efforts.
Preprint
We explore applying the Monte Carlo Tree Search (MCTS) algorithm in a notoriously difficult task: tuning programs for high-performance deep learning and image processing. We build our framework on top of Halide and show that MCTS can outperform the state-of-the-art beam-search algorithm. Unlike beam search, which is guided by greedy intermediate pe...
Article
Continued improvement in computing efficiency requires functional specialization of hardware designs. Agile hardware design methodologies have been proposed to alleviate the increased design costs of custom silicon architectures, but their practice thus far has been accompanied with challenges in integration and validation of complex systems-on-a-c...
Chapter
Improving the energy efficiency of processor systems-on-chip (SoCs) is key to improving their performance and utility. The FD-SOI silicon process enables integrated systems that can deliver dramatic improvements in energy efficiency through system integration. This chapter presents the Raven-3 and Raven-4 testchips, fully integrated and fully featu...
Preprint
The performance of the code a compiler generates depends on the order in which it applies the optimization passes. Choosing a good order--often referred to as the phase-ordering problem, is an NP-hard problem. As a result, existing solutions rely on a variety of heuristics. In this paper, we evaluate a new technique to address the phase-ordering pr...
Preprint
Advances in deep learning and neural networks have resulted in the rapid development of hardware accelerators that support them. A large majority of ASIC accelerators, however, target a single hardware design point to accelerate the main computational kernels of deep neural networks such as convolutions or matrix multiplication. On the other hand,...
Conference Paper
This paper presents a novel runtime power modeling methodology which automatically identifies key signals for power dissipation of any RTL design. The toggle-pattern matrix is constructed with the VCD dumps from a training set, where each signal is represented as a high-dimensional point. By clustering signals showing similar switching activities,...
Preprint
One of the key challenges arising when compilers vectorize loops for today's SIMD-compatible architectures is whether to vectorize and/or interleave. Then, the compiler has to determine how many instructions to pack together and how many loop iterations to interleave. Compilers are designed today to use fixed-cost models that rely on heuristics to...
Article
Deep Learning is arguably the most rapidly evolving research area in recent years. As a result, it is not surprising that the design of state-of-the-art deep neural net models often proceeds without much consideration of the latest hardware targets, and the design of neural net accelerators proceeds without much consideration of the characteristics...
Preprint
The recent advancements in deep reinforcement learning have opened new horizons and opportunities to tackle various problems in system optimization. Such problems are generally tailored to delayed, aggregated, and sequential rewards, which is an inherent behavior in the reinforcement learning setting, where an agent collects rewards while exploring...
Preprint
Trusted execution environments (TEEs) are becoming a requirement across a wide range of platforms, from embedded sensors to cloud servers, which encompass a wide range of cost and power constraints as well as security threat models. Unfortunately, each of the current vendor-specific TEEs makes a fixed choice in each of the design dimensions of depl...
Conference Paper
Full-text available
We describe our experience developing and promoting a set of open-source tools and IP over the last 9 years, including the Chisel hardware construction language, the Rocket Chip SoC generator, and the BAG analog layout generator.
Article
In this article, we present FireSim, an open-source simulation platform that enables fast cycle-exact microarchitectural simulation of large scale-out clusters by combining FPGA-accelerated simulation of silicon-proven RTL designs with scalable, distributed network simulation, running on a public-cloud host platform. By introducing automation and h...
Article
Many workloads are written in garbage-collected languages and GC consumes a significant fraction of resources for these workloads. We propose to decrease this overhead by moving GC into a small hardware accelerator that is located close to the memory controller and performs GC more efficiently than a CPU. We first show a general design of such a GC...
Article
Enclaves have emerged as a particularly compelling primitive to implement trusted execution environments: strongly isolated sensitive user-mode processes in a largely untrusted software environment. While the threat models employed by various enclave systems differ, the high-level guarantees they offer are essentially the same: attestation of an en...
Conference Paper
Recent work in FPGA-accelerated simulation of ASICs has shown that much of a simulator can be automatically generated from ASIC RTL. Alas, these works rely on simple models of the outer cache hierarchy and DRAM, as mapping ASIC RTL for these components into an FPGA fabric is too complex and resource intensive. To improve FPGA simulation model accur...
Article
Architecture-level assist techniques enable low-voltage operation by tolerating errors in SRAM-based caches. A line recycling (LR) technique is proposed to reuse faulty cache lines that fail at low voltages to correct errors with only 0.77% level-2 (L2) area overhead. LR can either save 33% of cache capacity loss from line disable or allow further...
Article
BROOM is a resilient, wide-voltage-range implementation of an open-source out-of-order (OoO) RISC-V processor implemented in an ASIC flow. A 28 nm test-chip contains a BOOM OoO core and a 1-MiB level-2 (L2) cache, enhanced with architectural error tolerance for low-voltage operation. It was implemented by using an agile design methodology, where th...
Preprint
Full-text available
The performance of the code generated by a compiler depends on the order in which the optimization passes are applied. In the context of high-level synthesis, the quality of the generated circuit relates directly to the code generated by the front-end compiler. Unfortunately, choosing a good order--often referred to as the phase-ordering problem--i...
Preprint
Enclaves have emerged as a particularly compelling primitive to implement trusted execution environments: strongly isolated sensitive user-mode processes in a largely untrusted software environment. While the threat models employed by various enclave systems differ, the high-level guarantees they offer are essentially the same: attestation of an en...
Conference Paper
Deep Learning is arguably the most rapidly evolving research area in recent years. As a result it is not surprising that the design of state-of-the-art deep neural net models proceeds without much consideration of the latest hardware targets, and the design of neural net accelerators proceeds without much consideration of the characteristics of the...
Preprint
Full-text available
Deep Learning is arguably the most rapidly evolving research area in recent years. As a result it is not surprising that the design of state-of-the-art deep neural net models proceeds without much consideration of the latest hardware targets, and the design of neural net accelerators proceeds without much consideration of the characteristics of the...
Article
Reducing the operating voltage of digital systems improves energy efficiency, and the minimum operating voltage of a system (V <sub xmlns:xlink="http://www.w3.org/1999/xlink">min</sub> ) is commonly limited by SRAM bitcells. Common techniques to lower SRAM V <sub xmlns:xlink="http://www.w3.org/1999/xlink">min</sub> focus on using circuit-level peri...
Article
This chapter studies the problem of traversing large graphs using the breadth-first search order on distributed-memory supercomputers. We consider both the traditional level-synchronous top-down algorithm as well as the recently discovered direction optimizing algorithm. We analyze the performance and scalability trade-offs in using different local...
Conference Paper
The public cloud is moving to a Platform-as-a-Service model where services such as data management, machine learning or image classification are provided by the cloud operator while applications are written in high-level languages and leverage these services. Managed languages such as Java, Python or Scala are widely used in this setting. However,...
Article
This paper presents a RISC-V system-on-chip (SoC) with integrated voltage regulation, adaptive clocking, and power management implemented in a 28 nm fully depleted silicon-on-insulator process. A fully integrated simultaneous-switching switched-capacitor DC-DC converter supplies an application core using a clock from a free-running adaptive clock g...
Conference Paper
In this work, we provide an overview of the technology and architecture of a microprocessor chip with optical I/O. Zero-change photonics integration enabled the chip to be fabricated in a commercial electronics CMOS foundry.
Article
This report makes the case that a well-designed Reduced Instruction Set Computer (RISC) can match, and even exceed, the performance and code density of existing commercial Complex Instruction Set Computers (CISC) while maintaining the simplicity and cost-effectiveness that underpins the original RISC goals. We begin by comparing the dynamic instruc...
Conference Paper
High-performance embedded processors are frequently designed as arrays of small, in-order scalar cores, even when their workloads exhibit high degrees of data-level parallelism (DLP). We show that these multiple instruction, multiple data (MIMD) systems can be made more efficient by instead directly exploiting DLP using a modern vector architecture...
Article
This paper presents a sample-based energy simulation methodology that enables fast and accurate estimations of performance and average power for arbitrary RTL designs. Our approach uses an FPGA to simultaneously simulate the performance of an RTL design and to collect samples containing exact RTL state snapshots. Each snapshot is then replayed in g...
Conference Paper
Many distributed workloads in today's data centers are written in managed languages such as Java or Ruby. Examples include big data frameworks such as Hadoop, data stores such as Cassandra or applications such as the SOLR search engine. These workloads typically run across many independent language runtime systems on different nodes. This setup rep...
Article
Many distributed workloads in today's data centers are written in managed languages such as Java or Ruby. Examples include big data frameworks such as Hadoop, data stores such as Cassandra or applications such as the SOLR search engine. These workloads typically run across many independent language runtime systems on different nodes. This setup rep...
Article
Many distributed workloads in today's data centers are written in managed languages such as Java or Ruby. Examples include big data frameworks such as Hadoop, data stores such as Cassandra or applications such as the SOLR search engine. These workloads typically run across many independent language runtime systems on different nodes. This setup rep...
Article
Many distributed workloads in today's data centers are written in managed languages such as Java or Ruby. Examples include big data frameworks such as Hadoop, data stores such as Cassandra or applications such as the SOLR search engine. These workloads typically run across many independent language runtime systems on different nodes. This setup rep...
Article
The final phase of CMOS technology scaling provides continued increases in already vast transistor counts, but only minimal improvements in energy efficiency, thus requiring innovation in circuits and architectures. However, even huge teams are struggling to complete large, complex designs on schedule using traditional rigid development flows. This...
Article
This work demonstrates a RISC-V vector microprocessor implemented in 28 nm FDSOI with fully integrated simultaneous-switching switched-capacitor DC–DC (SC DC–DC) converters and adaptive clocking that generates four on-chip voltages between 0.45 and 1 V using only 1.0 V core and 1.8 V IO voltage inputs. The converters achieve high efficiency at the...
Article
Data transport across short electrical wires is limited by both bandwidth and power density, which creates a performance bottleneck for semiconductor microchips in modern computer systems - from mobile phones to large-scale data centres. These limitations can be overcome by using optical communications based on chip-scale electronic-photonic system...
Conference Paper
As new applications for graph algorithms emerge, there has been a great deal of research interest in improving graph processing. However, it is often difficult to understand how these new contributions improve performance. Execution time, the most commonly reported metric, distinguishes which alternative is the fastest but does not give any insight...
Conference Paper
Designing hardware specialized for a target application domain dramatically improves energy efficiency. However, the design of specialized hardware is hampered by high cost, dominated by the effort needed to design, validate, and verify the custom integrated circuit. An agile design approach, discussed here, relies on hardware generators coupled wi...
Article
We present a graph processing benchmark suite targeting shared memory platforms. The goal of this benchmark is to help standardize graph processing evaluations, making it easier to compare different research efforts and quantify improvements. The benchmark not only specifies kernels, input graphs, and evaluation methodologies, but it also provides...
Conference Paper
This article consists of a collection of slides from the authors' conference presentation. The topics discussed included: Motivation/Raven Project Goals; On-Chip Switched Capacitor DC-DC Converters; Raven3 Chip Architecture; Raven3 Implementation; Raven3 Evaluation; and RISC-V Chip Building at UC Berkeley.
Conference Paper
This work demonstrates a RISC-V vector microprocessor implemented in 28nm FDSOI with fully-integrated non-interleaved switched-capacitor DCDC (SC-DCDC) converters and adaptive clocking that generates four on-chip voltages between 0.5V and 1V using only 1.0V core and 1.8V IO voltage inputs. The design pushes the capabilities of dynamic voltage scali...
Article
Motivated by rapid software and hardware innovation in warehouse-scale computing (WSC), we visit the problem of warehouse-scale network design evaluation. A WSC is composed of about 30 arrays or clusters, each of which contains about 3000 servers, leading to a total of about 100,000 servers per WSC. We found many prior experiments have been conduct...