Daniel A. Reed

University of Iowa, Iowa City, Iowa, United States

Are you Daniel A. Reed?

Claim your profile

Publications (182)78.44 Total impact

  • Source
    John Arquilla · Daniel A. Reed

    Preview · Article · Sep 2015 · Communications of the ACM
  • Al Geist · Daniel A Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: Commodity clusters revolutionized high-performance computing when they first appeared two decades ago. As scale and complexity have grown, new challenges in reliability and systemic resilience, energy efficiency and optimization and software complexity have emerged that suggest the need for re-evaluation of current approaches. This paper reviews the state of the art and reflects on some of the challenges likely to be faced when building trans-petascale computing systems, using insights and perspectives drawn from operational experience and community debates.
    No preview · Article · Aug 2015 · International Journal of High Performance Computing Applications
  • Source
    Daniel A. Reed · Jack Dongarra
    [Show abstract] [Hide abstract]
    ABSTRACT: Daniel A. Reed and Jack Dongarra state that scientific discovery and engineering innovation requires unifying traditionally separated high-performance computing and big data analytics. Big data machine learning and predictive data analytics have been considered as the fourth paradigm of science, allowing researchers to extract insights from both scientific instruments and computational simulations. A rich ecosystem of hardware and software has emerged for big-data analytics similar to high-performance computing.
    Full-text · Article · Jul 2015 · Communications of the ACM
  • Source

    Full-text · Dataset · Dec 2012
  • Source
    Daniel A. Reed · Dennis B. Gannon · James R. Larus
    [Show abstract] [Hide abstract]
    ABSTRACT: New and compelling ideas are transforming the future of computing, bringing about a plethora of changes that have significant implications for our profession and our society and raising some profound technical questions. This Web extra video interview features Dan Reed of Microsoft giving us a sense of how new cloud architectures and cloud capabilities will begin to move computer science education, research, and thinking in whole new directions.
    Full-text · Article · Jan 2012 · Computer
  • Source
    Daniel A. Reed · James R. Larus · Dennis Gannon
    [Show abstract] [Hide abstract]
    ABSTRACT: New and compelling ideas are transforming the future of computing, bringing about a plethora of changes that have significant implications for our profession and our society and raising some profound technical questions. This Web extra video interview features Dan Reed of Microsoft giving us a sense of how new cloud architectures and cloud capabilities will begin to move computer science education, research, and thinking in whole new directions.
    Full-text · Article · Jan 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics processing units. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.
    Full-text · Article · Feb 2011 · International Journal of High Performance Computing Applications
  • Roger S. Barga · Dennis Gannon · Daniel A. Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: Extending the capabilities of PC, Web, and mobile applications through on-demand cloud services will significantly broaden the research community's capabilities, accelerating the pace of engineering and scientific discovery in this age of data-driven research. The net effect will be the democratization of research capabilities that are now available only to the most elite scientists. To make this vision a reality, the computer systems research community must develop new approaches to building client-plus-cloud applications to support a new type of science, and many technical challenges exist.
    No preview · Article · Jan 2011 · IEEE Internet Computing
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Achieving high performance for distributed I/O on a wide-area network continues to be an elusive holy grail. Despite enhancements in network hardware as well as software stacks, achieving high-performance remains a challenge. In this paper, our worldwide team took a completely new and non-traditional approach to distributed I/O, called ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing, by utilizing application-specific transformation of data to orders of magnitude smaller metadata before performing the actual I/O. Specifically, this paper details our experiences in deploying a large- scale system to facilitate the discovery of missing genes and constructing a genome similarity tree by encapsulating the mpiBLAST sequence-search algorithm into ParaMEDIC. The overall project involved nine computational sites spread across the U.S. and generated more than a petabyte of data that was 'teleported' to a large-scale facility in Tokyo for storage.
    Full-text · Article · Nov 2010 · Concurrency and Computation Practice and Experience
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Existing supercomputers have hundreds of thousands of processor cores, and future systems may have hundreds of millions. Developers need detailed performance measurements to tune their applications and to exploit these systems fully. However, extreme scales pose unique challenges for performance-tuning tools, which can generate significant volumes of I/O. Compute-to-I/O ratios have increased drastically as systems have grown, and the I/O systems of large machines can handle the peak load from only a small fraction of cores. Tool developers need efficient techniques to analyze and to reduce performance data from large numbers of cores. We introduce CAPEK, a novel parallel clustering algorithm that enables in-situ analysis of performance data at run time. Our algorithm scales sub-linearly to 131,072 processes, running in less than one second even at that scale, which is fast enough for on-line use in production runs. The CAPEK implementation is fully generic and can be used for many types of analysis. We demonstrate its application to statistical trace sampling. Specifically, we use our algorithm to compute efficiently stratified sampling strategies for traces at run time. We show that such stratification can result in data-volume reduction of up to four orders of magnitude on current large-scale systems, with potential for greater reductions for future extreme-scale systems.
    Preview · Conference Paper · Jan 2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: There has been a recent interest in modularized shipping containers as the building block for data centers. However, there are no pub-lished results on the different design tradeoffs it offers. In this paper we investigate a model where such a container is never serviced during its deployment lifetime, say 3 years, for hardware faults. Instead, the hardware is over-provisioned in the beginning and fail-ures are handled gracefully by software. The reasons vary from ease of accounting and management to increased design flexibility owing to its sealed and service-free nature. We present a preliminary model for performance, reliability and cost for such service-less containerized solutions. There are a num-ber of design choices/policies for over-provisioning the containers. For instance, as a function of dead servers and incoming workload we could decide which servers to selectively turn on/off while still maintaining a desired level of performance. While evaluating each such choice is challenging, we demonstrate that arriving at the best and worst-case design is tractable. We further demonstrate that pro-jected lifetimes of these extreme cases are very close (within 10%) to each other. One way to interpret this reliability number is, the utility of keeping machines as cold spares within the container, in anticipation of server failures, is not too different than starting out with all machines active. So as we engineer the containers in so-phisticated ways for cost and performance, we can arrive at the associated reliability estimates using a simpler more-tractable ap-proximation. We demonstrate that these bounds are robust to gen-eral distributions for failure times of servers. We hope that this paper stirs up a number of research investi-gations geared towards understanding these next generation data center building blocks. This involves both improving the models and corroborating them with field data.
    Full-text · Article · Jul 2009
  • Lavanya Ramakrishnan · Daniel A. Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: High performance and distributed computing systems such as peta-scale, grid and cloud infrastructure are increasingly used for running scientific models and business services. These systems experience large availability variations through hardware and software failures. Resource providers need to account for these variations while providing the required QoS at appropriate costs in dynamic resource and application environments. Although the performance and reliability of these systems have been studied separately, there has been little analysis of the lost Quality of Service (QoS) experienced with varying availability levels. In this paper, we present a resource performability model to estimate lost performance and corresponding cost considerations with varying availability levels. We use the resulting model in a multi-phase planning approach for scheduling a set of deadline-sensitive meteorological workflows atop grid and cloud resources to trade-off performance, reliability and cost. We use simulation results driven by failure data collected over the lifetime of high performance systems to demonstrate how the proposed scheme better accounts for resource availability.
    No preview · Article · Jan 2009 · Cluster Computing
  • Source
    Gopi Kandaswamy · Anirban Mandal · Daniel A. Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we describe the design and implementation of two mechanisms for fault-tolerance and recovery for complex scientific workflows on computational grids. We present our algorithms for over-provisioning and migration, which are our primary strategies for fault-tolerance. We consider application performance models, resource reliability models, network latency and bandwidth and queue wait times for batch-queues on compute resources for determining the correct fault-tolerance strategy. Our goal is to balance reliability and performance in the presence of soft real-time constraints like deadlines and expected success probabilities, and to do it in a way that is transparent to scientists. We have evaluated our strategies by developing a Fault-Tolerance and Recovery (FTR) service and deploying it as a part of the Linked Environments for Atmospheric Discovery (LEAD) production infrastructure. Results from real usage scenarios in LEAD show that the failure rate of individual steps in workflows decreases from about 30% to 5% by using our fault-tolerance strategies.
    Preview · Conference Paper · Jun 2008
  • Source
    T. Gamblin · R. Fowler · D.A. Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: Emerging petascale systems will have many hundreds of thousands of processors, but traditional task-level tracing tools already fail to scale to much smaller systems because the I/O backbones of these systems cannot handle the peak load offered by their cores. Complete event traces of all processes are thus infeasible. To retain the benefits of detailed performance measurement while reducing volume of collected data, we developed AMPL, a general-purpose toolkit that reduces data volume using stratified sampling. We adopt a scalable sampling strategy, since the sample size required to measure a system varies sub-linearly with process count. By grouping, or stratifying, processes that behave similarly, we can further reduce data overhead while also providing insight into an application's behavior. In this paper, we describe the AMPL toolkit and we report our experiences using it on large-scale scientific applications. We show that AMPL can successfully reduce the overhead of tracing scientific applications by an order of magnitude or more, and we show that our tool scales sub-linearly, so the improvement will be more dramatic on petascale machines. Finally, we illustrate the use of AMPL to monitor applications by performance-equivalent strata, and we show that this technique can allow for further reductions in trace data volume and traced execution time.
    Preview · Conference Paper · May 2008
  • Emma S. Buneci · Daniel A. Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: Grids promote new modes of scientific collaboration and discovery by connecting distributed instruments, data and computing facilities. Because many resources are shared, application performance can vary widely and unexpectedly. We describe a novel performance analysis framework that reasons temporally and qualitatively about performance data from multiple monitoring levels and sources. The framework periodically analyzes application performance states by generating and interpreting signatures containing structural and temporal features from time-series data. Signatures are compared to expected behaviors and in case of mismatches, the framework hints at causes of degraded performance, based on unexpected behavior characteristics previously learned by application exposure to known performance stress factors. Experiments with two scientific applications reveal signatures that have distinct characteristics during well-performing versus poor-performing executions. The ability to automatically and compactly generate signatures capturing fundamental differences between good and poor application performance states is essential to improving the quality of service for Grid applications.
    No preview · Conference Paper · Jan 2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Good load balance is crucial on very large parallel systems, but the most sophisticated algorithms introduce dynamic imbalances through adaptation in domain decomposition or use of adaptive solvers. To observe and diagnose imbalance, developers need system-wide, temporally-ordered measurements from full-scale runs. This potentially requires data collection from multiple code regions on all processors over the entire execution. Doing this instrumentation naively can, in combination with the application itself, exceed available I/O bandwidth and storage capacity, and can induce severe behavioral perturbations. We present and evaluate a novel technique for scalable, low-error load balance measurement. This uses a parallel wavelet transform and other parallel encoding methods. We show that our technique collects and reconstructs system-wide measurements with low error. Compression time scales sublinearly with system size and data volume is several orders of magnitude smaller than the raw data. The overhead is low enough for online use in a production environment.
    Preview · Conference Paper · Jan 2008
  • Source

    Full-text · Conference Paper · Jan 2008
  • Source
    Lavanya Ramakrishnan · Daniel A. Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: Scientific applications have diverse characteristics and resource requirements. When combined with the complexity of underlying distributed resources on which they execute (e.g. Grid, cloud computing), these applications can experience significant performance fluctuations as machine reliability varies. Although the performance and reliability of cluster and Grid systems have been studied separately, there has been little analysis of the lost Quality of Service (QoS) experienced with varying availability levels. To enable a dynamic environment that can account for such changes while providing required QoS, next generation tools will need extensible application interfaces that allow users to qualitatively express performance and reliability requirements for the underlying systems. In this paper, we use the concept of performability to capture the degraded performance that might result from varying resource availability. We apply the resulting model to workflow planning and fault tolerance strategies. We present experimental data to validate our model and use simulation results driven by failure data from real HPC systems to demonstrate how the proposed scheme better accounts for resource availability.
    Full-text · Conference Paper · Jan 2008
  • E.S. Buneci · D.A. Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: Grids promote new modes of scientific collaboration and discovery by connecting distributed instruments, data and computing facilities. Because many resources are shared, application performance can vary widely and unexpectedly. We describe a novel performance analysis framework that reasons temporally and qualitatively about performance data from multiple monitoring levels and sources. The framework periodically analyzes application performance states by generating and interpreting signatures containing structural and temporal features from time-series data. Signatures are compared to expected behaviors and in case of mismatches, the framework hints at causes of degraded performance, based on unexpected behavior characteristics previously learned by application exposure to known performance stress factors. Experiments with two scientific applications reveal signatures that have distinct characteristics during well-performing versus poor-performing executions. The ability to automatically and compactly generate signatures capturing fundamental differences between good and poor application performance states is essential to improving the quality of service for Grid applications.
    No preview · Conference Paper · Jan 2008
  • Source
    Dennis Gannon · Beth Plale · Daniel A. Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: An e-Science Grid Gateway is a portal that allows a scientific collaboration to use the resources of a Grid in a way that frees them from the complex details of Grid software and middleware. The goal of such a gateway is to allow the users access to community data and applications that can be used in the language of their science. Each user has a private data and metadata space, access to data provenance and tools to use or compose experimental workflows that combine standard data analysis, simulation and post-processing tools. In this talk we will describe the underlying Grid service architecture for such an eScience gateway. In this paper we will describe some of the challenges that confront the design of Grid Gateways and we will outline a few new research directions.
    Full-text · Conference Paper · Nov 2007

Publication Stats

5k Citations
78.44 Total Impact Points

Institutions

  • 2015
    • University of Iowa
      Iowa City, Iowa, United States
    • Oak Ridge National Laboratory
      Oak Ridge, Florida, United States
  • 2008-2012
    • Microsoft
      Washington, West Virginia, United States
  • 2011
    • The University of Tennessee Medical Center at Knoxville
      Knoxville, Tennessee, United States
  • 2005-2008
    • University of North Carolina at Chapel Hill
      • Renaissance Computing Institute
      North Carolina, United States
  • 2004
    • University of North Carolina at Charlotte
      Charlotte, North Carolina, United States
    • North Carolina State University
      Raleigh, North Carolina, United States
  • 1985-2003
    • University of Illinois, Urbana-Champaign
      • • Department of Computer Science
      • • Department of Geology
      Urbana, Illinois, United States
  • 2002
    • Urbana University
      Urbana, Illinois, United States
  • 1995
    • Rice University
      • Department of Computer Science
      Houston, Texas, United States
  • 1987
    • Bureau of Materials & Physical Research
      Springfield, Illinois, United States