Andrew A. Chien

Andrew A. Chien
  • Doctor of Engineering
  • University of Chicago

About

334
Publications
18,283
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
10,389
Citations
Introduction
Current institution
University of Chicago

Publications

Publications (334)
Article
With daily data generation of Zettabytes and exponential growth, our study finds that power needed for data reliability and management is an increasing fraction of storage system power. Thus storage management is an important contributor to data center (DC) energy use and carbon footprint. We study the University of Chicago's high energy physics st...
Article
Generative AI, exemplified in ChatGPT, Dall-E 2, and Stable Diffusion, are exciting new applications consuming growing quantities of computing. We study the compute, energy, and carbon impacts of generative AI inference. Using ChatGPT as an exemplar, we create a workload model and compare request direction approaches (Local, Balance, CarbonMin), as...
Article
Serverless (or Function-as-a-Service) compute model enables new applications with dynamic scaling. However, all current Serverless systems are best-effort, and as we prove this means they cannot guarantee hard real-time deadlines, rendering them unsuitable for such real-time applications. We analyze a proposed extension of the Serverless model that...
Article
In April 2022, the California Independent System Operator (CAISO) power grid reached momentary peaks of 100% renewable energy for the first time. After a year, momentary 100% supply from renewables isn't a news any more, and CAISO reported record wind and solar renewable curtailment of 606 GWh (March 2023) and 686 GWh (April 2023). In addition, pea...
Preprint
Facing growing concerns about power consumption and carbon emissions, cloud providers are adapting datacenter loads to reduce carbon emissions. With datacenters exceeding 100MW, they can affect grid dynamics, so doing this without damaging grid performance is difficult. We study power adaptation algorithms that use grid metrics and seek to reduce d...
Chapter
Datacenter scheduling research often assumes resources as a constant quantity, but increasingly external factors shape capacity dynamically, and beyond the control of an operator. Based on emerging examples, we define a new, open research challenge: the variable capacity resource scheduling problem. The objective here is effective resource utilizat...
Article
Data centers owned and operated by large companies have a high power consumption that is expected to increase in the future. However, the ability to shift computing loads geographically and in time can provide flexibility to the power grid. We introduce the concept of virtual links to capture space-time load flexibility provided by geographically-d...
Conference Paper
Today's serverless provides "function-as-a-service" with dynamic scaling and fine-grained resource charging, enabling new cloud applications. Serverless functions are invoked as a best-effort service. We propose an extension to serverless, called real-time serverless that provides an invocation rate guarantee, a service-level objective (SLO) specif...
Preprint
Data centers owned and operated by large companies have a high power consumption and this is expected to increase in the future. However, the ability to shift computing loads geographically and in time can provide flexibility to the power grid. We introduce the concept of virtual links to capture space-time load flexibility provided by geographical...
Article
The data science revolution and growing popularity of data lakes make efficient processing of raw data increasingly important. To address this, we propose the ACCelerated Operators for Raw Data Analysis (ACCORDA) architecture. By extending the operator interface (subtype with encoding) and employing a uniform runtime worker model, ACCORDA integrate...
Conference Paper
Exascale systems face high error-rates due to increasing scale (10⁹ cores), software complexity and rising memory error rates. Increasingly, errors escape immediate hardware-level detection, silently corrupting application states. Such latent errors can often be detected by application-level tests but typically at long latencies. We propose a new a...
Chapter
This paper presents a model and performance study for Algorithm-Based Focused Recovery (ABFR) applied to N-body computations, subject to latent errors. We make a detailed comparison with the classical Checkpoint/Restart (CR) approach. While the model applies to general frameworks, the performance study is limited to perfect binary trees, due to the...
Article
Power consumption and associated carbon emissions are increasingly critical challenges for large-scale computing. Recent research proposes exploiting stranded power - uneconomic renewable power - for green supercomputing in a system called Zero-Carbon Cloud (ZCCloud) [1], [2], [3]. These efforts studied production supercomputing workloads on strand...
Article
Flash storage has become the mainstream destination for storage users. However, SSDs do not always deliver the performance that users expect. The core culprit of flash performance instability is the well-known garbage collection (GC) process, which causes long delays as the SSD cannot serve (blocks) incoming I/Os, which then induces the long tail l...
Conference Paper
Big data analytic applications give rise to large-scale extract-transform-load (ETL) as a fundamental step to transform new data into a native representation. ETL workloads pose significant performance challenges on conventional architectures, so we propose the design of the unstructured data processor (UDP), a software programmable accelerator tha...
Conference Paper
Full-text available
MittOS provides operating system support to cut millisecond-level tail latencies for data-parallel applications. In MittOS, we advocate a new principle that operating system should quickly reject IOs that cannot be promptly served. To achieve this, MittOS exposes a fast rejecting SLO-aware interface wherein applications can provide their SLOs (e.g....
Conference Paper
Traditional cloud stacks are designed to tolerate random, small-scale failures, and can successfully deliver highly-available cloud services and interactive services to end users. However, they fail to survive large-scale disruptions that are caused by major power outage, cyber-attack, or region/zone failures. Such changes trigger cascading failure...
Article
Full-text available
As power grids incorporate increased renewable generation such as wind and solar, their variability creates growing challenges for grid stability and efficiency. We study two facets: power the grid is unable to accept (curtailment), and power that is assigned zero economic value by the grid (negative or zero price). Collectively we term these "stra...
Preprint
As power grids incorporate increased renewable generation such as wind and solar, their variability creates growing challenges for grid stability and efficiency. We study two facets: power the grid is unable to accept (curtailment), and power that is assigned zero economic value by the grid (negative or zero price). Collectively we term these "stra...
Technical Report
Full-text available
This report summarizes runtime system challenges for exascale computing, that follow from the fundamental challenges for exascale systems that have been well studied in past reports, e.g., [6, 33, 34, 32, 24]. Some of the key exascale challenges that pertain to runtime systems include parallelism, energy efficiency, memory hierarchies, data movemen...
Article
Exascale studies project reliability challenges for future HPC systems. We present the Global View Resilience (GVR) system, a library for portable resilience. GVR begins with a subset of the Global Arrays interface, and adds new capabilities to create versions, name versions, and compute on version data. Applications can focus versioning where and...
Conference Paper
Classical models for distributed systems focus on fine-grained resources and failures (e.g. single processes and fail-stop or byzantine). Cheap hardware, the rise of the worldwide web, and now the rise of cloud computing has transformed distributed systems practice. Instead of single resources, application sites have ten-thousands of virtual machin...
Conference Paper
Low-cost image and depth sensors (RGBD) promise a wealth of new applications as mobile computing devices become aware of the 3D structure of their environs. However, while sensors are now cheap and readily available, the computational demands for even basic 3D services such as model-building and tracking are significant. We assess these requirement...
Article
Power consumption (supply, heat, cost) and associated carbon emissions (environmental impact) are increasingly critical challenges in scaling supercomputing to Exascale and beyond. We proposes to exploit stranded power, renewable energy that has no value to the power grid, for scaling supercomputers, Zero-Carbon Cloud (ZCCloud), and showing that st...
Conference Paper
Resilience has become a major concern in high-performance computing (HPC) systems. Addressing the increasing risk of latent errors (or silent data corruption) is one of the biggest challenges. Multi-version checkpointing system, which keeps multi-version of the application states, has been proposed as a solution and has been implemented in Global V...
Preprint
We analyze how both traditional data center integration and dispatchable load integration affect power grid efficiency. We use detailed network models, parallel optimization solvers, and thousands of renewable generation scenarios to perform our analysis. Our analysis reveals that significant spillage and stranded power will be observed in power gr...
Conference Paper
An early step in measuring jitter in communication signals is locating the transitions, the points in time when the waveform changes between logic levels. Transition localization can be the most time-consuming step in jitter measurement because it is the last step where every sample must be processed. We transform the localization FSM (finite state...
Article
We analyze how both traditional data center integration and dispatchable load integration affect grid dynamics and efficiency. Our analysis uses parallel optimization solvers, 1000’s of renewable generation scenarios, and running thousands of simulations. These enable us to perform sophisticated analysis, including rigorous grid performance modelin...
Article
Customized architecture is widely recognized as an important approach for improved performance and energyefficiency. To balance generality and customization benefit, researchers have proposed to federate heterogeneous micro-engines. Using the 10x10 architecture and an integrated image and vision benchmark as a case study, we explore the performance...
Conference Paper
Full-text available
We propose the Unified Automata Processor (UAP), a new architecture that provides general and efficient support for finite automata (FA). The UAP supports a wide range of existing finite automata models (DFAs, NFAs, A-DFAs, JFAs, counting-DFAs, and counting-NFAs), and future novel FA models. Evaluation on realistic workloads shows that UAP implemen...
Article
Full-text available
Variability is an ongoing challenge to growth of large-scale renewable power generation, posing challenges for the power grid and ambitious renewable portfolio standards. The authors propose Zero-Carbon Cloud (ZCCloud), a new high-value, dispatchable demand for renewables that improves their economic viability. Initial studies show that ZCCloud can...
Conference Paper
We explore the post-recovery efficiency of shrinking and non-shrinking recovery schemes on high performance computing systems using a synthetic benchmark. We study the impact of network topology on post-recovery communication performance. Our experiments on the IBM BG/Q System Mira show that shrinking recovery can deliver up to 7.5 % better efficie...
Conference Paper
Resilience is a growing concern for large-scale simulations. As failures become more frequent, alternatives to global checkpointing that limit the extent of needed recovery become more desirable. Additionally, platforms will differ in both error rates and types, therefore, a flexible and customizable recovery strategy will be extremely helpful to t...
Article
In exascale systems, increasing error rate - particularly silent data corruption - is a major concern. The Global ViewResilience (GVR) system builds a new model of application resilience on versioned global arrays. These arrays can be exploited for flexible, application-specific error checking and recovery. We explore a fundamental challenge to the...
Conference Paper
Graph processing is important for a growing range of applications. Current performance studies of parallel graph computation employ a large variety of algorithms and graphs. To explore their robustness, we characterize behavior variation across algorithms and graph structures at different scales. Our results show that graph computation behaviors, w...
Conference Paper
We consider the use of non-volatile memories in the form of burst buffers for resilience in supercomputers. Their cost and limited lifetime demand effective use and appropriate provisioning. We develop an analytic model for the behavior of workloads on systems with burst buffers, and use it to explore questions of cost-effective provisioning, and m...
Conference Paper
Energy is a critical challenge in computing performance. Due to "word size creep" from modern CPUs are inefficient for short-data element processing. We propose and evaluate a new microarchitecture called "Bit-Nibble-Byte"(BnB). We describe our design which includes both long fixed point vectors and as well as novel variable length instructions. To...
Article
Exascale studies project reliability challenges for future high-performance computing (HPC) systems. We propose the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We describe GVR's interfaces to distributed arrays, versionin...
Article
Accommodating large tally data can be a challenging problem for Monte Carlo neutron transport simulations. Current approaches include either simple data replication, or are based on application-controlled decomposition such as domain partitioning or client/server models, which are limited by either memory cost or performance loss. We propose and an...
Article
Accelerators have long been used to improve the performance and energy efficiency of embedded signal processing systems relying on Fast Fourier Transforms (FFTs). We explore the benefits of processor-integrated FFT accelerators, characterizing their performance and energy efficiency for current and future memory architectures. First, we consider de...
Conference Paper
Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answe...
Article
Moore's law has accurately predicted roughly biennial doubling of component capacity at minimal cost for almost 50 years. Recent flash memory scaling exhibits increased density, but reduced write and read lifetimes effectively constitute an ending of Moore's law. However, new resilience techniques, including adaptive management algorithms, and stor...
Conference Paper
The end of Dennard scaling has made energy-efficiency a critical challenge in the continued increase of computing performance. An important approach to increasing energy-efficiency is hardware customization. In this study we explore the opportunity for energy-efficiency via memory hierarchy customization and present a methodology to identify applic...
Conference Paper
The scaling of semiconductor technology and increasing power concerns combined with system scale make fault management a growing concern in high performance computing systems. Greater variety of errors, higher error rates, longer detection intervals, and "silent" errors are all expected. Traditional checkpointing models and systems assume that erro...
Conference Paper
We study the performance portability of OpenCL across diverse architectures including NVIDIA GPU, Intel Ivy Bridge CPU, and AMD Fusion APU. We present detailed performance analysis at assembly level on three exemplar OpenCL benchmarks: SGEMM, SpMV, and FFT. We also identify a number of tuning knobs that are critical to performance portability, incl...
Article
Chip power consumption has reached its limits, leading to the flattening of single-core performance. We propose the 10x10 processor, a federated heterogeneous multi-core architecture, where each core is an ensemble of u-engines (micro-engines, similar to accelerators) specialized for different workload groups to achieve dramatically higher energy e...
Article
Full-text available
We present here a report produced by a workshop on ‘Addressing failures in exascale computing’ held in Park City, Utah, 4–11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software laye...
Conference Paper
Emerging exascale architectures bring forth new challenges related to heterogeneous systems power, energy, cost, and resilience. These new challenges require a shift from conventional paradigms in understanding how to best exploit and optimize these features and limitations. Our objective is to identify the top few dominant characteristics in a set...
Conference Paper
Amdahl's law has been one of the factors influencing speedup in high performance computing over the last few decades. While Amdahl's approach of optimizing (10% of the code is where 90% of the execution time is spent) has worked very well in the past, new challenges related to emerging exascale heterogeneous architectures, combined with stringent p...
Conference Paper
Amdahl's law has been one of the factors influencing speedup in high performance computing over the last few decades. While Amdahl's approach of optimizing (10% of the code is where 90% of the execution time is spent) has worked very well in the past, new challenges related to emerging exascale heterogeneous architectures, combined with stringent p...
Conference Paper
To ensure reliability, long-running and large-scale computations have long used checkpoint-and-restart techniques to preserve computational progress in case of soft or hard failures. These techniques can incur significant overhead, consuming as much as 15% of an application's resources for the US DOE's leadership-class systems, and these overheads...
Article
Full-text available
Two decades of microprocessor architecture driven by quantitative 90/10 optimization has delivered an extraordinary 1000-fold improvement in microprocessor performance, enabled by transistor scaling which improved density, speed, and energy. Recent generations of technology have produced limited benefits in transistor speed and power, so as a resul...
Article
Before making statements about microprocessor trends 10 years out-Micro 2006-it might be useful to revisit our past statements (Gelsinger et al. (1989, 1991)) about the microprocessor of today and the microprocessor of 2000. Then we can see where we have been right and where wrong. This retrospective will reveal important trends that promise to giv...
Article
The Distributed Virtual Computer (DVC) is the key unifying element of the OptIPuter software architecture. It provides a simple, clean abstraction for applications or higher-level middleware, allowing them to use lambda-grids with the same convenience as a VPN. The DVC is successful because it employs integrated network and end-resource selection,...
Conference Paper
With an increasing number of available resources in large-scale distributed environments, a key challenge is resource selection. Fortunately, several middleware systems provide resource selection services. However, a user is still faced with a difficult question: "What should I ask for?" Since most users end up using naïve and suboptimal resource s...
Conference Paper
A critical challenge for wide-area configurable networks is definition and widespread acceptance of Network Information Model (NIM). When a network comprises multiple domains, intelligent information sharing is required for a provider to maintain a competitive advantage and for customers to use a provider's network and make good resource selection...
Conference Paper
Emerging large-scale scientific applications require to access large data objects in high and robust performance. We propose RobuSTore, a storage architecture that combines erasure codes and speculative access mechanisms for parallel write and read in distributed environments. The mechanisms can effectively aggregate the bandwidth from a large numb...
Article
Desktop Grids are popular platforms for high throughput applications, but due to their inherent resource volatility it is difficult to exploit them for applications that require rapid turnaround. Efficient desktop Grid execution of short-lived applications is an attractive proposition and we claim that it is achievable via intelligent resource sele...
Article
Desktop grids, which use the idle cycles of many desktop PCs, are one of the largest distributed systems in the world. Despite the popularity and success of many desktop grid projects, the heterogeneity and volatility of hosts within desktop grids have been poorly understood. Yet, resource characterization is essential for accurate simulation and m...

Network

Cited By