
Armando Fox- University of California, Berkeley
Armando Fox
- University of California, Berkeley
About
199
Publications
94,899
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
30,720
Citations
Current institution
Publications
Publications (199)
Execution of complex analytic queries on massive semantic graphs is a challenging problem in big-data analytics that requires high-performance parallel computing. In a semantic graph, vertices and edges carry attributes of various types and the analytic queries typically depend on the values of these attributes. Thus, the computation must view the...
As the last standardization effort was done in 2004, the software engineering curriculum is currently being revised. Haven't we reached the point where agile development should be part of all software engineering curricula? And if so, shouldn't new curriculum standards ensure that it is? Thus, the answer to the question in the title of this article...
Developers of rapidly growing applications must be able to anticipate potential scalability problems before they cause performance issues in production environments. A new type of data independence, called scale independence, seeks to address this challenge by guaranteeing a bounded amount of work is required to execute all queries in an applicatio...
High performance is a crucial consideration when executing a complex analytic query on a massive semantic graph. In a semantic graph, vertices and edges carry "attributes" of various types. Analytic queries on semantic graphs typically depend on the values of these attributes; thus, the computation must either view the graph through a "filter" that...
We implement a Communication Avoiding Recursive Matrix Multiplication algorithm (CARMA) . First communication-optimal parallel algorithm for all dimensions of matrices . The shared-memory version of CARMA is only '-50 lines of code . Much simpler than 3D SUMMA [8], the rectangular version of 2.5D [9] . Fasterthan MKL and ScaLAPACK in practice: . Fa...
Communication-optimal algorithms are known for square matrix multiplication. Here, we obtain the first communication-optimal algorithm for all dimensions of rectan- gular matrices. Combining the dimension-splitting technique of Frigo, Leiserson, Prokop and Ramachandran (1999) with the recursive BFS/DFS approach of Ballard, Demmel, Holtz, Lipshitz a...
Domain-expert productivity programmers desire scalable application performance, but usually must rely on efficiency programmers who are experts in explicit parallel programming to achieve it. Since such programmers are rare, to maximize reuse of their work we propose encapsulating their strategies in mini-compilers for domain-specific embedded lang...
Cloud computing and the shift in the software industry toward Software as a Serviceb (SaaS) using Agile development has led to tools and techniques that are a much better match to the classroom than earlier software development methods. While new college graduates are good at coding and debugging, employers complain about other missing skills that...
Though crowdsourcing holds great promise, many struggle with framing tasks and determining which members of the crowd should be recruited to obtain reliable output. In some cases, expert knowledge is desired but, given the time and cost constraints of the problem, may not be available. In this case, it would be beneficial to augment the expert inpu...
Newly-released web applications often succumb to a "Success Disaster," where
overloaded database machines and resulting high response times destroy a
previously good user experience. Unfortunately, the data independence provided
by a traditional relational database system, while useful for agile
development, only exacerbates the problem by hiding p...
We present the Scalable Nucleotide Alignment Program (SNAP), a new short and
long read aligner that is both more accurate (i.e., aligns more reads with
fewer errors) and 10-100x faster than state-of-the-art tools such as BWA.
Unlike recent aligners based on the Burrows-Wheeler transform, SNAP uses a
simple hash index of short seed sequences from th...
Elasticity of cloud computing environments provides an economic incentive for automatic resource allocation of stateful systems running in the cloud. However, these systems have to meet strict performance Service-Level Objectives (SLOs) expressed using upper percentiles of request latency, such as the 99th. Such latency measurements are very noisy,...
Many scientists would love access to large-scale computational resources but find that the programming demands of using a
supercomputer—as well as the cost and queuing time—are too daunting. Privately owned cloud computers—large data centers filled
with computers that mainly run their company's software—are now becoming available to outside users,...
Typically, scientists with computational needs prefer to use high-level languages such as Python or MATLAB; however, large computationally-intensive problems must eventually be recoded in a low level language such as C or Fortran by ex-pert programmers in order to achieve sufficient performance. In addition, multiple strategies may exist for mappin...
Application console logs are a ubiquitous tool for diagnosing system failures and anomalies. While several techniques exist to interpret logs, describing and assessing log quality remains relatively unexplored. In this paper, we describe an abstract graphical representation of console logs called the identifier graph and a visualization based on th...
We describe our early experience in applying our con-sole log mining techniques [19, 20] to logs from produc-tion Google systems with thousands of nodes. This data set is five orders of magnitude in size and contains al-most 20 times as many messages types as the Hadoop data set we used in [19]. It also has many properties that are unique to large...
The ParLab at Berkeley, UPCRC-Illinois, and the Pervasive Parallel Laboratory at Stanford are studying how to make parallel programming succeed given industry's recent shift to multicore computing. All three centers assume that future microprocessors will have hundreds of cores and are working on applications, programming environments, and architec...
CLOUD COMPUTING, the long-held dream of computing as a utility, has the potential to transform a large part of the IT industry, making software even more attractive as a service and shaping the way IT hardware is designed and purchased. Developers with innovative ideas for new Internet services no longer require the large capital outlays in hardwar...
Evaluating the resiliency of stateful Internet services to significant workload spikes and data hotspots requires realistic workload traces that are usually very difficult to obtain. A popular approach is to create a workload model and generate synthetic workload, however, there exists no characterization and model of stateful spikes. In this paper...
Large-scale, user-facing applications are increasingly moving from relational databases to distributed key/value stores for high-request-rate, low-latency workloads. Often, this move is motivated not only by key/value stores' ability to scale simply by adding more hardware, but also by the easy to understand predictable performance they provide for...
A recent trend for data-intensive computations is to use pay-as-you-go execution environments that scale transparently to the user. However, providers of such environments must tackle the challenge of configuring their system to provide maximal performance while minimizing the cost of resources used. In this paper, we use statistical models to pred...
Large-scale websites are increasingly moving from relational databases to distributed key-value stores for high request rate, low latency workloads. Often this move is motivated not only by key-value stores' ability to scale simply by adding more hardware, but also by the easy to understand predictable performance they provide for all operations. W...
Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they of- ten consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general methodology to mine this rich source of information to automatically detect system runtime probl...
When a performance crisis occurs in a datacenter, rapid recovery requires quickly recognizing whether a similar incident occurred before, in which case a known rem- edy may apply, or whether the problem is new, in which case new troubleshooting is necessary. To address this issue we propose a new and efficient representation of the datacenter's sta...
—Today's productivity programmers, such as scientists who need to write code to do science, are typically forced to choose between productive and maintainable code with modest performance (e.g. Python plus native libraries such as SciPy [SciPy]) or complex, brittle, hardware-specific code that entangles application logic with performance concerns b...
We describe a novel application of using data min- ing and statistical learning methods to automatically monitor and detect abnormal execution traces from console logs in an online setting. Different from existing solutions, we use a two stage detection system. The first stage uses frequent pattern mining and distribution estimation techniques to c...
The Stanford Interactive Workspaces project developed a set of technologies for integrating multiple devices in a co-located workspace, based on a few basic principles:
1.
The interactions should maximize the potential for “fluency” of the users, reducing as much as possible the need to shift attention from the content of the work to the mechanism....
Collaborative web applications such as Facebook, Flickr and Yelp present new challenges for storing and querying large amounts of data. As users and developers are focused more on performance than single copy consistency or the ability to perform ad-hoc queries, there exists an opportunity for a highly-scalable system tailored specifically for rela...
Horizontally-scalable Internet services on clusters of commodity computers appear to be a great fit for automatic control: there is a target output (service-level agreement), observed output (actual latency), and gain controller (adjusting the number of servers). Yet few datacenters are automated this way in practice, due in part to well-founded sk...
One of the most challenging aspects of managing a very large data warehouse is identifying how queries will behave before they start executing. Yet knowing their performance characteristics - their runtimes and resource usage - can solve two important problems. First, every database vendor struggles with managing unexpectedly long-running queries....
Multicore architectures have become so complex and di- verse that there is no obvious path to achieving good per- formance. Hundreds of code transformations, compiler flags, architectural features and optimization parameters result in a search space that can take many machine- months to explore exhaustively. Inspired by successes in the systems com...
Horizontally scalable Internet services present an opportunity to use automatic resource allocation strategies for system management in the datacenter. In most of the previous work, a controller employs a performance model of the system to make decisions about the optimal allocation of resources. However, these models are usually trained offline or...
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Acknowledgement The RAD La...
The console logs generated by an application contain messages that the application developers believed would be useful in de- bugging or monitoring the application. Despite the ubiquity and large size of these logs, they are rarely exploited in a syste matic way for monitoring and debugging because they are not read- ily machine-parsable. In this p...
Previous work showed that statistical analysis tech- niques could successfully be used to construct compact sig- natures of distinct operational problems in Internet server systems. Because signatures are amenable to well-known similarity search techniques, they can be used as a way to index past problems and identify particular operational problem...
Although there is prior work on energy conservation in datacenters, we identify a new approach based on the syn- ergy between virtual machines and statistical machine learn- ing, and we observe that constrained energy conservation can improve hardware reliability. We give initial results on a cluster that reduces energy costs by a factor of 5, redu...
Web 2.0 applications place new and different demands on servers compared to their Web 1.0 counterparts. Simul- taneously, the definitive arrival of pay-as-you-go "cloud computing" and the proliferation of application develop- ment stacks present new and different degrees of freedom in deploying and tuning software-as-a-service. We first iden- tify...
Among the many branches of computer science, ubiquitous computing enjoys an unusually distinguished history of creating and deploying prototypes. Why is this? A tempting answer is that many ubicomp researchers and practitioners have backgrounds in subjects such as HCI and systems--areas with a strong focus on learning from deploying working prototy...
Despite significant efforts in the field of Autonomic Com- puting, system operators will still play a critical role in ad- ministering Internet services for many years to come. How- ever, very little is know about how system operators work, what tools they use and how we can make them more effi- cient. In this paper we study the practices of operat...
Space communication systems are plagued by complexity resulting in limited access to orbiting satellites and high mission operation costs that ultimately reduce mission yields and capabilities. In particular, ground stations, the access point between space and terrestrial networks, suffer from monolithic designs, narrow interfaces, and reliability...
In this project, we design and implement flow control in a J2EE application server by applying control theory and dynamic probabilistic scheduling. The goal is to reg-ulate workload at the web front end to prevent overload-ing the shared database, while keeping fairness over all re-quests. Since workload in an enterprise application has a much larg...
We present a method for automatically extracting from a running system an indexable that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar to a previously-observed state. This allows operators to identify and...
We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar to a previously-observed state. This allows operators to id...
Most Internet services (e-commerce, search engines, etc.) suffer faults. Quickly detecting these faults can be the largest bottleneck in improving availability of the system. We present Pinpoint, a methodology for automating fault detection in Internet services by: 1) observing low-level internal structural behaviors of the service; 2) modeling the...
Web applications suffer from software and configuration faults that lower their availability. Recovering from failure is dominated by the time interval between when these faults appear and when they are detected by site operators. We introduce a set of tools that augment the ability of operators to perceive the presence of failure: an automatic ano...
Given that hardware fails, software has bugs, and human operators make mistakes, researchers must increasingly consider recovery-oriented approaches to dependability. The articles in this issue's theme section describe how a range of techniques based on these perspectives canaugment and complement other efforts to improve dependability.
Recent research activity (2, 12, 27, 10, 1) has shown en- couraging results for performance debugging and failure diagnosis and detection in systems by using approaches based on automatically inducing models and deriving correlations from observed data. We believe that max- imizing the potential of this line of research will require surmounting som...
Summary form only given. Our ability to design and deploy large complex systems is outpacing our ability to understand their behavior. How do we detect and recover from "heisenbugs", which account for up to 40% of failures in complex Internet systems, without extensive application-specific coding? Which users were affected, and for how long? How do...
In this paper we show how to reduce downtime of J2EE applications by rapidly and automatically recovering from transient and intermittent software failures, without requiring application modifications. Our prototype combines three application-agnostic techniques: macroanalysis for fault detection and localization, microrebooting for rapid recovery,...
Violations of service level objectives (SLO) in Internet services are urgent conditions requiring immediate attention. Previously we explored (I. Cohen et al., 2004) an approach for identifying which low-level system properties were correlated to high-level SLO violations (the metric attribution problem). The approach is based on automatically indu...
Building systems to recover fast may be more productive than aiming for systems that never fail. Because recovery is not immune to failure either, the authors advocate multiple lines of defense in managing failures.
The increasing popularity of XML Web services motivates us to examine if it is feasible to substitute one vendor service for another when using a Web-based application, assuming that these services are "derived from" a common base. If such substitution were possible, end users could use the same application with a variety of back-end vendor service...
It is by now motherhood-and-apple-pie that complex distributed Internet services form the basis not only of ecommerce but increasingly of mission-critical networkbased applications. What is new is that the workload and internal architecture of three-tier enterprise applications presents the opportunity for a new approach to keeping them running in...
Developers write Web service composition programs in terms of functionalities (e.g., "WebSearch") to postpone choosing which services of the same functionality to invoke (Google or Yahoo). We provide a higher level of abstraction than this for higher reuse. We express high-level "patterns" (e.g., "SearchAndCollectData") as both objects that can be...
From its inception, plenty of vision has existed in ubiquitous computing. This issue concentrates on how to realize the software infrastructure for that vision: on building and evaluating system software. Ubiquitous computing raises major challenges for system software researchers, mainly because of the heterogeneity and volatility that characteriz...
Cluster hash tables (CHTs) are a key persistent-storage component of many large-scale Internet services due to their high performance and scalability. We show that a correctly-designed CHT can also be as easy to manage as a farm of stateless servers. Specifically, we trade away some consistency to obtain reboot-based recovery that is simple, mainta...
A significant fraction of software failures in large-scale Internet systems are cured by rebooting, even when the exact failure causes are unknown. However, rebooting can be expensive, causing nontrivial service disruption or downtime even when clusters and failover are employed. In this work we separate process recovery from data recovery to enabl...
Microreboots restart fine-grained components of software systems "with a clean slate," and only take a fraction of the time needed for full system reboot. Microreboots provide an application-generic recovery technique for Internet services, which can be supported entirely in middleware and requires no changes to the applications or any a priori kno...
Operating systems are complex and their behavior depends on many factors. Source code, if available, does not directly help understand the OS's behavior, as the behavior depends on ac-tual workloads and external inputs. Runtime profiling is a key technique for understanding the behavior and mutual-influence of modern OS components. Such profiling i...
We present a new approach to managing failures and evolution in large, complex distributed systems using runtime paths. We use the paths that requests follow as they move through the system as our core abstraction, and our "macro" approach focuses on component interactions rather than the details of the components themselves. Paths record component...
The current interest in programming models and software infrastructures to support ubiquitous and environmental computing is heightened by the falling cost of hardware and the ubiquity of local-area wireless networking technologies. Interactive workspaces are technologically augmented team-project rooms that represent a specific sub-domain of ubiqu...
We present a new approach to managing failures and evolution in large, complex distributed systems using runtime paths. We use the paths that requests follow as they move through the system as our core abstraction, and our "macro" approach focuses on component interactions rather than the details of the components themselves. Paths record component...
The cost and complexity of administration of large systems has come to dominate their total cost of ownership. State- less and soft-state components, such as Web servers or net- work routers, are relatively easy to manage: capacity can be scaled incrementally by adding more nodes, rebalanc- ing of load after failover is easy, and reactive or proact...
Ubiquitous computing environments accrete slowly over time rather than springing into existence all at once. Mechanisms are needed for incremental integration- the problem of how to incrementally add or modify behaviors in existing ubicomp environments. Examples include adding new input modalities and choreographing the behavior of existing indepen...
Complex distributed Internet services form the basis not only of e-commerce but increasingly of mission-critical network-based applications. What is new is that the workload and internal architecture of three-tier enterprise applications presents the opportunity for a new approach to keeping them running in the face of many common recoverable failu...
Even with reasonable overprovisioning, today's Internet application clusters are unable to handle major traffic spikes and flash crowds. As an alternative to fixed-size, dedicated clusters, we propose a dynamically-shared application cluster model based on virtual machines. The system is dubbed "OnCall" for the extra computing capacity that is alwa...
High availability of internet systems can be achieved either through high mean-time-to-failure (MTTF) or low mean-time-to-recovery (MTTR). Traditionally, system designers have focused on maximizing MTTF as a way of providing high availability. However, recent work in recovery-oriented computing has emphasized recovery from failures, rather than fai...
Pinpoint is an application-generic framework for detecting and localizing likely application-level failures in component-based Internet services. Pinpoint assumes that most of the system is working correctly most of the time, builds a model of this behavior, and searches for deviations from this model. Pinpoint does not rely on a priori application...
Most current interactive group workspaces are prohibitively expensive and difficult to install and use. At the same time, the demand for such spaces is rising dramatically along with the increasing number of electronic media-based meetings, presentations, projects and papers. Teamspace is a prototype for a public interactive workspace designed to b...
Ubiquitous computing embraces both nomadic computing and infrastructure-rich interactive workspaces. Although most effort in these areas to date has focused narrowly on one domain, there are interesting challenges at their intersection. To start a discussion on how some of the benefits of interactive workspaces can be provided to nomadic users, we...
introducemacr'k[HOk[[ , an approach used to infer the high-level properties of dynamic, distributed systems, and an indispensable tool when faced with tasks where local context and individual component details are insufficient. We present a new methodology,r[EHSS path analysis, where paths are traced through software components and then aggregated...
Crash-only programs crash safely and recover quickly. There is only one way to stop such software---by crashing it---and only one way to bring it up---by initiating recovery. Crash-only systems are built from crash-only components, and the use of transparent component-level retries hides intra-system component crashes from end users. In this paper...
Security is an important and open issue in ubiquitous computing. Recent work has focused either on developing models of security suitable for various ubiquitous computing environments, or on the coupling of context-awareness with traditional security mechanisms to provide context-aware security. In this paper, we examine a specific instance of ubiq...
Even after decades of software engineering research, complex computer systems still fail. This paper makes the case for increasing research emphasis on dependability and, specifically, on improving availability by reducing time-to-recover.All software fails at some point, so systems must be able to recover from failures. Recovery itself can fail to...
This paper demonstrates that the dependability of generic, evolving J2EE applications can be enhanced through a combination of a few recovery-oriented techniques. Our goal is to reduce downtime by automatically and efficiently recovering from a broad class of transient software failures without having to modify applications. We describe here the in...
Automatic failure-path inference (AFPI) is an application-generic, automatic technique for dynamically discovering the failure dependency graphs of componentized Internet applications. AFPI's first phase is invasive, and relies on controlled fault injection to determine failure propagation; this phase requires no a priori knowledge of the applicati...
The paper examines various methods of self-repair system design and a low component density fault detection method is described. A computer program utilizing the SPIF method is used to perform the design of monitoring circuitry for several example problems. A self-repairing computer using SPIF monitoring is described to demonstrate the advantages o...
We are rapidly entering a world in which people equip themselves with a small constellation of mobile devices and pass through environments rich in embedded technology. However, we are still a long way from harnessing the power of this technology in a way that seamlessly and invisibly assists users in their day-to-day activities. Bringing pervasive...
To programmatically discover and interact with services in ubiquitous computing environments, an application needs to solve two problems: (1) is it semantically meaningful to interact with a service? If the task is "printing a file", a printer service would be appropriate, but a screen rendering service or CD player service would not. (2) If yes, w...
The dynamism and heterogeneity in ubicomp environments on both short and long time scales implies that middleware platforms for these environments need to be designed ground up for portability, extensibility and robustness. In this paper, we describe how we met these requirements in iROS, a middleware platform for a class of ubicomp environments, t...
The first mass-produced pervasive computing devices are starting to appear---the AutoPC, the Internet-connected ScreenFridge, and the combination Microwave Oven/Home Banking terminal. Although taken separately they appear bizarre, we believe they will play an important role in a world of pervasive computing. Specifically, these devices will accept...