About
66
Publications
19,880
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,389
Citations
Introduction
Skills and Expertise
Current institution
Netflix
Current position
- Developer
Publications
Publications (66)
This edition of the “Practitioner’s Digest” features recent papers on artificial intelligence (AI) and machine learning (ML), along with papers on tech debt, energy consumption, and collaboration between industry and academia.
This edition of the “Practitioners’ Digest” covers an eclectic mix of topics from cannabis usage to world politics to large language models.
This edition of the “Practitioner’s Digest” summarizes five recently published conference and journal pages on the topic of infrastructure-as-code.
DevOps connects development, delivery, and operations and, thus, facilitates a fluid collaboration of these traditionally separated silos. As an agile method, DevOps is today used across industries and is not limited anymore to IT services and specific technologies. Lorin Hochstein and I present a brief overview on DevOps best practice. A Netflix c...
In honor of this issue’s theme of “Artificial Intelligence (AI) Engineering—Realizing the Potential of AI,” this edition of the “Practitioners’ Digest” brings you recent papers on AI and machine learning (ML) engineering, which we believe will be of interest to practitioners. These papers were published in the First International Conference on AI E...
The theme of this issue is “Bots in Software Engineering,” and we’ve collected a number of recent papers about bots that interact with source code repositories. These papers were published at the fourth International Workshop on Bots in Software Engineering (BotSE ’22), the 37th ACM/SIGAPP Symposium on Applied Computing (SAC ’22), the International...
In keeping with the issue theme of multiconcern assurance, this month’s column features summaries of related papers published recently in the 2021 International Conference on Computer Safety, Reliability, and Security (SAFECOMP 2021), the 26th European Conference on Pattern Languages of Programs (EuroPLoP’21), and the 2021 International Conference...
This article reports papers about technical debt (TD) from the 2021 IEEE/Association for Computing Machinery (ACM) International Conference on technical debt (TechDebt’21), the 43rd IEEE/ACM International Conference on Software Engineering: Journal First Track (ICSE-JF’21), the 43rd IEEE/ACM International Conference on Software Engineering: Softwar...
FOLLOWING ALONG WITH the theme of this issue of IEEE Software,
this column reports papers about digital twins from the 2021 Empirical Assessment in Software Engineering
(EASE’21) conference, the 2021 International Conference on Software
Engineering (ICSE 2021), the 2021 International Symposium on
Software Engineering for Adaptive and Self-Managing...
Distributed systems often face transient errors and localized component degradation and failure. Verifying that the overall system remains healthy in the face of such failures is challenging. At Netflix, we have built a platform for automatically generating and executing chaos experiments, which check how well the production system can handle compo...
While Chaos Engineering has gained currency in the Site Reliability Engineering community, service and business owners are often nervous about experimenting in production. Proving the benefits of Chaos Engineering to these stakeholders before implementing a program can be challenging. We present the business case for Chaos Engineering, through both...
Modern software-based services are implemented as distributed systems with complex behavior and failure modes. Many large tech organizations are using experimentation to verify the reliability of such systems. We use the term "Chaos Engineering" to refer to this approach, and discuss the underlying principles and how to use it to run experiments.
The Netflix video streaming system is composed of many interacting services. In such a large system, failures in individual services are not uncommon. This paper describes the Chaos Automation Platform, a system for running failure injection experiments on the production system to verify that failures in non-critical services do not result in syste...
Large-scale distributed systems must be built to anticipate and mitigate a variety of hardware and software failures. In order to build confidence that fault-tolerant systems are correctly implemented, Netflix (and similar enterprises) regularly run failure drills in which faults are deliberately injected in their production system. The combinatori...
The Netflix video streaming system is composed of many interacting services. In such a large system, failures in individual services are not uncommon. This paper describes the Chaos Automation Platform, a system for running failure injection experiments on the production system to verify that failures in non-critical services do not result in syste...
One of the most significant changes in the software industry over the past two decades has been the transition from standalone applications to networked applications, known as the software-as-a-service model. Whether users are interacting with these services through a web browser or a custom app, on a laptop, desktop, mobile phone, tablet, or even...
Modern software-based services are implemented as distributed systems with complex behavior and failure modes. Many large tech organizations are using experimentation to verify such systems' reliability. Netflix engineers call this approach chaos engineering. They've determined several principles underlying it and have used it to run experiments. T...
21st Century Smart Manufacturing (SM) is manufacturing in which all information is available when it is needed, where it is needed, and in the form it is most useful [1,2] to drive optimal actions and responses. The 21st Century SM enterprise is data driven, knowledge enabled, and model rich with visibility across the enterprise (internal and exter...
In virtual organizations, such as Open Source Software (OSS) communities, we expect that the impressions members have about each other play an important role in fostering effective collaboration. However, there is little empirical evidence about how peer impressions form and change in virtual organizations. This paper reports the results from a sur...
Design, deploy, and maintain your own private or public Infrastructure as a Service (IaaS), using the open source OpenStack platform. In this practical guide, experienced developers and OpenStack contributors show you how to build clouds based on reference architectures, as well as how to perform daily administration tasks.
Clouds establish a new division of responsibilities between platform operators and users than have traditionally existed in computing infrastructure. In private clouds, where all participants belong to the same organization, this creates new barriers to effective communication and resource usage. In this paper, we present poncho, a tool that implem...
Computational Science and Engineering (CSE) software supports a wide variety of domains including nuclear physics, crash simulation, satellite data processing, fluid dynamics, climate modeling, bioinformatics, and vehicle development. The increases importance of CSE software motivates the need to identify and understand appropriate software enginee...
Scientists and engineers devote considerable effort to developing large, complex codes to solve important problems. However, while they often develop useful code, many scientists and engineers are frequently unaware of how various software engineering practices can help them write better code. This article presents the results of a survey on this t...
Cloud computing services, which allow users to lease time on remote computer systems, must be particularly attractive to smaller engineering organizations that use engineering simulation software. Such organizations have occasional need for substantial computing power but may lack the budget and in-house expertise to purchase and maintain such reso...
1 Software is often the dominant cost associated with developing DoD High Performance Embedded Computing (HPEC) systems. Historically there has been no quantifiable methodology for comparing the difficulty of developing code on different HPEC systems and trading off ease of development vs. execution performance. The DARPA High Productivity Computin...
Cloud computing services, which allow users to lease time on remote computer systems, must be particularly attractive to smaller engineering organizations that use engineering simulation software. Such organizations have occasional need for substantial computing power but may lack the budget and in-house expertise to purchase and maintain such reso...
Current cloud computing infrastructure typically assumes a homogeneous collection of commodity hardware, with details about hardware variation intentionally hidden from users. In this paper, we present our approach for extending the traditional notions of cloud computing to provide a cloud-based access model to clusters that contain a heterogeneous...
All compiled software systems require a build system: a set of scripts to invoke compilers and linkers to generate the final executable binaries. For scientific software, these build scripts can become extremely complex. Anecdotes suggest that scientific programmers have long been dissatisfied with the current software build tool chains. In this pa...
Science and Engineering (CSE) software supports a wide variety of domains including nuclear physics, crash simulation, satellite data processing, fluid dynamics, climate modeling, bioinformatics, and vehicle development. The increase in the importance of CSE software motivates the need to identify and understand appropriate software engineering (SE...
The majority of scientific software is distributed as source code. As the number of library dependencies and supported platforms increases, so does the complexity of describing the rules for configuring and building software. In this project, we have performed an empirical study of the magnitude of the build problem by examining the development his...
In this paper, we introduce a semi-automated process called software engineering workflow analysis (SEWA) for developing heuristics that analyze captured data to identify where programmers spend their time. To evaluate our process, we ran two case studies in the domain of high-performance computing to generate programmer workflow models for small p...
The classroom is a valuable resource for conducting software engineering experiments. However, coordinating a family of experiments in classroom environments presents a number of challenges to researchers. Understanding how to run such experiments, developing procedures to collect accurate data, and collecting data that is consistent across multipl...
Web macros give web browser users ways to "program" tedious tasks, allowing those tasks to be repeated more quickly and re- liably than when performed by hand. Web macros face depend- ability problems of their own, however: changes in websites or failure on the part of end-user programmers to anticipate possible macro behaviors can cause macros to...
There is widespread belief in the computer science community that MPI is a difficult and time-intensive approach to developing parallel software. Nevertheless, MPI remains the dominant programming model for HPC systems, and many projects have made effective use of it. It remains unknown how much impact the use of MPI truly has on the productivity o...
ContextWriting software for the current generation of parallel systems requires significant programmer effort, and the community is seeking alternatives that reduce effort while still achieving good performance.ObjectiveMeasure the effect of parallel programming models (message-passing vs. PRAM-like) on programmer effort.Design, setting, and subjec...
Studies of computational scientists developing software for high-performance computing systems indicate that these scientists face unique software engineering issues. Previous failed attempts to transfer SE technologies to this domain haven't always taken these issues into account. To support scientific-software development, the SE community can di...
Computational scientists use computers to simulate physical phenomena in situations where experimentation would be prohibitively expensive or impossible. Advancing scientific research depends on developing software productively, which often must run on high-end computing systems, presenting unique software development challenges. The authors studie...
While there are strong beliefs within the community about whether one particular parallel programming model is eas-ier to use than another, there has been little research to analyze these claims empirically. Currently, the most pop-ular paradigm is message-passing, as implemented by the MPI library [1]. However, MPI is considered to be di cult for...
The evolution of a new technology depends upon a good theoretical basis for developing the technology, as well as upon its
experimental validation. In order to provide for this experimentation, we have investigated the creation of a software testbed
and the feasibility of using the same testbed for experimenting with a broad set of technologies. Th...
We present an iterative, reading-based methodology for analyzing defects in source code when change history is available. Our bottom-up approach can be applied to build knowledge of recurring defects in a specific domain, even if other sources of defect data such as defect reports and change requests are unavailable, incomplete or at the wrong leve...
While there are strong beliefs within the community about whether one particular parallel programming model is easier to use than another, there has been little research to analyze these claims empirically. Currently, the most popular paradigm is message-passing, as implemented by the MPI library [1]. However, MPI is considered to be difficult for...
This paper describes observations about software development for high end computing that we have made from several environments. We conducted a series of case studies of different types of codes, from academic codes to codes from governmental agencies. Based on those studies, we have formed a series of observations, some common and some different a...
The development of high-performance computing (HPC) programs is crucial to progress in many fields of scientific endeavor. We have run initial studies of the productivity of HPC developers and of techniques for improving that productivity, which have not previously been the subject of significant study. Because of key differences between developmen...
In developing High-Performance Computing (HPC) software, time to solution is an important metric. This metric is comprised of two main components: the human effort required developing the software, plus the amount of machine time required to execute it. To date, little empirical work has been done to study the first component: the human effort requ...
Measuring effort accurately and consistently across subjects in a programming experiment can be a surprisingly difficult task. In particular, measures based on self-reported data may differ significantly from measures based on data which is recorded automatically from a subject's computing environment. Since self-reports can be unreliable, and not...
As software systems evolve over time, they invariably undergo changes that can lead to a degeneration of the architecture. Left unchecked, degeneration may reach a level where a complete redesign is necessary, a task that requires significant effort. In this paper, we present a survey of technologies developed by researchers that can be used to com...
Evaluation of High Performance Computing (HPC) systems should take into account software development time productivity in addition to hardware performance, cost, and other factors. We propose a new metric for HPC software development time productivity, defined as the ratio of relative runtime performance to relative programmer effort. This formula...
We define a metric space to measure the contributions of individual programmers to a software development project. It allows us to measure the distance between the contributions of two different programmers as well as the absolute contribution of each individual programmer. Our metric is based on an action function that provides a picture of how on...
In this research, we are developing our understanding of how the high performance computing community develops effective parallel implementations of programs by collecting the folklore within the community. We use this folklore as the basis for a series of experiments, which we expect, will validate or negate these assumptions.
Empirical evidence and technology evaluation are needed to close the gap between the state of the art and the state of the practice in software engineering. However, there are difficulties associated with evaluating technologies based on empirical evidence: insufficient specification of context variables, cost of experimentation, and risks associat...
In the high performance computing domain, the speed of execution of a program has typically been the prima ry performance metric. But productivity is also of con cern to high performance computing developers. In this pape r we will discuss the problems of defining and measuring productivity for these machines and we develop a mo del of productivity...
In developing High-Performance Computing (HPC) software, time to solution is an important metric. This metric is comprised of two main components: The human effort required developing the software, plus the amount of machine time required to execute it. To date, little empirical work has been done to study the first component: the human effort requ...
The ability to write programs that execute efficiently on modern parallel computers has not been fully studied. In a DARPA-sponsored project, we are looking at measuring the development time for programs written for high performance computers (HPC). To attack this relatively novel measurement problem, our goal is to initially measure such developme...
The High Dependability Computing Program (HDCP) project is a NASA initiative for increasing dependability of software-based systems. It researches achieving high dependability by introducing new technologies. We focus on the evaluation of the effectiveness of technologies with respect to dependability. We employ empirical evaluation methods along w...
The ability to write programs that execute efficiently on modern parallel computers has not been fully studied. In this DARPA-sponsored project, we are looking at measuring the development time for programs written for high performance computers (HPC). Our goal is to measure such development time in both student programming (initially), and then la...
Collecting development data automatically is difficult in this era of ubiquitous home computing. This paper describes our efforts in the High Productivity Computing Systems project to better calculate effort data among a set of student programming exercises.
Software systems evolve over time and undergo changes that can lead to a degeneration of the systems' architecture. Degeneration may eventually reach a level where a complete redesign of the software system is necessary, which is a task that requires significant effort. In this paper, we start by presenting examples of such degeneration and continu...
Collecting development data automatically is difficult in this era of ubiquitous home computing. This paper describes our efforts in the High Productivity Computing Systems project to better calculate effort data among a set of student programming exercises.
This paper proposes a framework for the rapid development of high-level, domain-independent AI strategies targeted at the RoboCup competition. This framework, developed within the Swarm simulation system, provides a layer of abstraction that allows strategies to be easily ported from one domain to another. Additionally, the framework provides a pow...
Programmer ability is important to assess in both research and professional settings, but it is very difficult to measure directly. The objective of this study was to measure the ability of programmers to accurately rate the programming ability of their peers. The participants were computer science students in a senior-level undergraduate programmi...
Inexpensive graphics processors are being applied to the general computational problem as a way to provide supercomputer capabilities at an affordable price. But little research has gone into understanding the difficulty in developing such programs in terms of productivity and reliability of the resulting code. The University of Maryland is underta...
The ability to write programs that execute efficiently on modern parallel computers has not been fully studied. In this DARPA-sponsored project, we are looking at measuring the development time for programs written for high performance computers (HPC). Our goal is to measure such development time in both student programming (initially), and then la...
Thesis research directed by: Computer Science. Title from t.p. of PDF. Thesis (Ph. D.) -- University of Maryland, College Park, 2006. Includes bibliographical references. Text.