Daniel A. Reed

Microsoft, Washington, West Virginia, United States

Are you Daniel A. Reed?

Claim your profile

Publications (176)64.72 Total impact

  • Source
  • D.A. Reed, D.B. Gannon, J.R. Larus
    [Show abstract] [Hide abstract]
    ABSTRACT: New and compelling ideas are transforming the future of computing, bringing about a plethora of changes that have significant implications for our profession and our society and raising some profound technical questions. This Web extra video interview features Dan Reed of Microsoft giving us a sense of how new cloud architectures and cloud capabilities will begin to move computer science education, research, and thinking in whole new directions.
    Computer 01/2012; 45(1):25-30. DOI:10.1109/MC.2011.327 · 1.44 Impact Factor
  • Source
    Daniel A. Reed, James R. Larus, Dennis Gannon
    [Show abstract] [Hide abstract]
    ABSTRACT: New and compelling ideas are transforming the future of computing, bringing about a plethora of changes that have significant implications for our profession and our society and raising some profound technical questions. This Web extra video interview features Dan Reed of Microsoft giving us a sense of how new cloud architectures and cloud capabilities will begin to move computer science education, research, and thinking in whole new directions.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/ exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and graphics processing units. This report describes the work of the community to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.
    International Journal of High Performance Computing Applications 01/2011; 25:3-60. DOI:10.1177/1094342010391989 · 1.63 Impact Factor
  • Roger S. Barga, Dennis Gannon, Daniel A. Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: Extending the capabilities of PC, Web, and mobile applications through on-demand cloud services will significantly broaden the research community's capabilities, accelerating the pace of engineering and scientific discovery in this age of data-driven research. The net effect will be the democratization of research capabilities that are now available only to the most elite scientists. To make this vision a reality, the computer systems research community must develop new approaches to building client-plus-cloud applications to support a new type of science, and many technical challenges exist.
    IEEE Internet Computing 01/2011; 15(1):72-75. DOI:10.1109/MIC.2011.20 · 2.00 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Achieving high performance for distributed I/O on a wide-area network continues to be an elusive holy grail. Despite enhancements in network hardware as well as software stacks, achieving high-performance remains a challenge. In this paper, our worldwide team took a completely new and non-traditional approach to distributed I/O, called ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing, by utilizing application-specific transformation of data to orders of magnitude smaller metadata before performing the actual I/O. Specifically, this paper details our experiences in deploying a large- scale system to facilitate the discovery of missing genes and constructing a genome similarity tree by encapsulating the mpiBLAST sequence-search algorithm into ParaMEDIC. The overall project involved nine computational sites spread across the U.S. and generated more than a petabyte of data that was 'teleported' to a large-scale facility in Tokyo for storage.
    Concurrency and Computation Practice and Experience 11/2010; 22:2266-2281. DOI:10.1002/cpe.1590 · 0.78 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Existing supercomputers have hundreds of thousands of processor cores, and future systems may have hundreds of millions. Developers need detailed performance measurements to tune their applications and to exploit these systems fully. However, extreme scales pose unique challenges for performance-tuning tools, which can generate significant volumes of I/O. Compute-to-I/O ratios have increased drastically as systems have grown, and the I/O systems of large machines can handle the peak load from only a small fraction of cores. Tool developers need efficient techniques to analyze and to reduce performance data from large numbers of cores. We introduce CAPEK, a novel parallel clustering algorithm that enables in-situ analysis of performance data at run time. Our algorithm scales sub-linearly to 131,072 processes, running in less than one second even at that scale, which is fast enough for on-line use in production runs. The CAPEK implementation is fully generic and can be used for many types of analysis. We demonstrate its application to statistical trace sampling. Specifically, we use our algorithm to compute efficiently stratified sampling strategies for traces at run time. We show that such stratification can result in data-volume reduction of up to four orders of magnitude on current large-scale systems, with potential for greater reductions for future extreme-scale systems.
    Proceedings of the 24th International Conference on Supercomputing, 2010, Tsukuba, Ibaraki, Japan, June 2-4, 2010; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: There has been a recent interest in modularized shipping containers as the building block for data centers. However, there are no pub-lished results on the different design tradeoffs it offers. In this paper we investigate a model where such a container is never serviced during its deployment lifetime, say 3 years, for hardware faults. Instead, the hardware is over-provisioned in the beginning and fail-ures are handled gracefully by software. The reasons vary from ease of accounting and management to increased design flexibility owing to its sealed and service-free nature. We present a preliminary model for performance, reliability and cost for such service-less containerized solutions. There are a num-ber of design choices/policies for over-provisioning the containers. For instance, as a function of dead servers and incoming workload we could decide which servers to selectively turn on/off while still maintaining a desired level of performance. While evaluating each such choice is challenging, we demonstrate that arriving at the best and worst-case design is tractable. We further demonstrate that pro-jected lifetimes of these extreme cases are very close (within 10%) to each other. One way to interpret this reliability number is, the utility of keeping machines as cold spares within the container, in anticipation of server failures, is not too different than starting out with all machines active. So as we engineer the containers in so-phisticated ways for cost and performance, we can arrive at the associated reliability estimates using a simpler more-tractable ap-proximation. We demonstrate that these bounds are robust to gen-eral distributions for failure times of servers. We hope that this paper stirs up a number of research investi-gations geared towards understanding these next generation data center building blocks. This involves both improving the models and corroborating them with field data.
  • Lavanya Ramakrishnan, Daniel A. Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: High performance and distributed computing systems such as peta-scale, grid and cloud infrastructure are increasingly used for running scientific models and business services. These systems experience large availability variations through hardware and software failures. Resource providers need to account for these variations while providing the required QoS at appropriate costs in dynamic resource and application environments. Although the performance and reliability of these systems have been studied separately, there has been little analysis of the lost Quality of Service (QoS) experienced with varying availability levels. In this paper, we present a resource performability model to estimate lost performance and corresponding cost considerations with varying availability levels. We use the resulting model in a multi-phase planning approach for scheduling a set of deadline-sensitive meteorological workflows atop grid and cloud resources to trade-off performance, reliability and cost. We use simulation results driven by failure data collected over the lifetime of high performance systems to demonstrate how the proposed scheme better accounts for resource availability.
    Cluster Computing 01/2009; 12(3):1-14. DOI:10.1007/s10586-009-0078-y · 0.95 Impact Factor
  • Source
    G. Kandaswamy, A. Mandal, D.A. Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we describe the design and implementation of two mechanisms for fault-tolerance and recovery for complex scientific workflows on computational grids. We present our algorithms for over-provisioning and migration, which are our primary strategies for fault-tolerance. We consider application performance models, resource reliability models, network latency and bandwidth and queue wait times for batch-queues on compute resources for determining the correct fault-tolerance strategy. Our goal is to balance reliability and performance in the presence of soft real-time constraints like deadlines and expected success probabilities, and to do it in a way that is transparent to scientists. We have evaluated our strategies by developing a Fault-Tolerance and Recovery (FTR) service and deploying it as a part of the Linked Environments for Atmospheric Discovery (LEAD) production infrastructure. Results from real usage scenarios in LEAD show that the failure rate of individual steps in workflows decreases from about 30% to 5% by using our fault-tolerance strategies.
    Cluster Computing and the Grid, 2008. CCGRID '08. 8th IEEE International Symposium on; 06/2008
  • Source
    T. Gamblin, R. Fowler, D.A. Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: Emerging petascale systems will have many hundreds of thousands of processors, but traditional task-level tracing tools already fail to scale to much smaller systems because the I/O backbones of these systems cannot handle the peak load offered by their cores. Complete event traces of all processes are thus infeasible. To retain the benefits of detailed performance measurement while reducing volume of collected data, we developed AMPL, a general-purpose toolkit that reduces data volume using stratified sampling. We adopt a scalable sampling strategy, since the sample size required to measure a system varies sub-linearly with process count. By grouping, or stratifying, processes that behave similarly, we can further reduce data overhead while also providing insight into an application's behavior. In this paper, we describe the AMPL toolkit and we report our experiences using it on large-scale scientific applications. We show that AMPL can successfully reduce the overhead of tracing scientific applications by an order of magnitude or more, and we show that our tool scales sub-linearly, so the improvement will be more dramatic on petascale machines. Finally, we illustrate the use of AMPL to monitor applications by performance-equivalent strata, and we show that this technique can allow for further reductions in trace data volume and traced execution time.
    Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on; 05/2008
  • Emma S. Buneci, Daniel A. Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: Grids promote new modes of scientific collaboration and discovery by connecting distributed instruments, data and computing facilities. Because many resources are shared, application performance can vary widely and unexpectedly. We describe a novel performance analysis framework that reasons temporally and qualitatively about performance data from multiple monitoring levels and sources. The framework periodically analyzes application performance states by generating and interpreting signatures containing structural and temporal features from time-series data. Signatures are compared to expected behaviors and in case of mismatches, the framework hints at causes of degraded performance, based on unexpected behavior characteristics previously learned by application exposure to known performance stress factors. Experiments with two scientific applications reveal signatures that have distinct characteristics during well-performing versus poor-performing executions. The ability to automatically and compactly generate signatures capturing fundamental differences between good and poor application performance states is essential to improving the quality of service for Grid applications.
    Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2008, November 15-21, 2008, Austin, Texas, USA; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Good load balance is crucial on very large parallel systems, but the most sophisticated algorithms introduce dynamic imbalances through adaptation in domain decomposition or use of adaptive solvers. To observe and diagnose imbalance, developers need system-wide, temporally-ordered measurements from full-scale runs. This potentially requires data collection from multiple code regions on all processors over the entire execution. Doing this instrumentation naively can, in combination with the application itself, exceed available I/O bandwidth and storage capacity, and can induce severe behavioral perturbations. We present and evaluate a novel technique for scalable, low-error load balance measurement. This uses a parallel wavelet transform and other parallel encoding methods. We show that our technique collects and reconstructs system-wide measurements with low error. Compression time scales sublinearly with system size and data volume is several orders of magnitude smaller than the raw data. The overhead is low enough for online use in a production environment.
    Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2008, November 15-21, 2008, Austin, Texas, USA; 01/2008
  • Source
    Proceedings of 23rd Annual International Supercomputing Conference (ISC); 01/2008
  • Source
    Lavanya Ramakrishnan, Daniel A. Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: Scientific applications have diverse characteristics and resource requirements. When combined with the complexity of underlying distributed resources on which they execute (e.g. Grid, cloud computing), these applications can experience significant performance fluctuations as machine reliability varies. Although the performance and reliability of cluster and Grid systems have been studied separately, there has been little analysis of the lost Quality of Service (QoS) experienced with varying availability levels. To enable a dynamic environment that can account for such changes while providing required QoS, next generation tools will need extensible application interfaces that allow users to qualitatively express performance and reliability requirements for the underlying systems. In this paper, we use the concept of performability to capture the degraded performance that might result from varying resource availability. We apply the resulting model to workflow planning and fault tolerance strategies. We present experimental data to validate our model and use simulation results driven by failure data from real HPC systems to demonstrate how the proposed scheme better accounts for resource availability.
    Proceedings of the 17th International Symposium on High-Performance Distributed Computing (HPDC-17 2008), 23-27 June 2008, Boston, MA, USA; 01/2008
  • E.S. Buneci, D.A. Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: Grids promote new modes of scientific collaboration and discovery by connecting distributed instruments, data and computing facilities. Because many resources are shared, application performance can vary widely and unexpectedly. We describe a novel performance analysis framework that reasons temporally and qualitatively about performance data from multiple monitoring levels and sources. The framework periodically analyzes application performance states by generating and interpreting signatures containing structural and temporal features from time-series data. Signatures are compared to expected behaviors and in case of mismatches, the framework hints at causes of degraded performance, based on unexpected behavior characteristics previously learned by application exposure to known performance stress factors. Experiments with two scientific applications reveal signatures that have distinct characteristics during well-performing versus poor-performing executions. The ability to automatically and compactly generate signatures capturing fundamental differences between good and poor application performance states is essential to improving the quality of service for Grid applications.
    High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for; 01/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We report on some of the interactions between two SciDAC projects: The National Computational Infrastructure for Lattice Gauge Theory (USQCD), and the Performance Engineering Research Institute (PERI). Many modern scientific programs consistently report the need for faster computational resources to maintain global competitiveness. However, as the size and complexity of emerging high end computing (HEC) systems continue to rise, achieving good performance on such systems is becoming ever more challenging. In order to take full advantage of the resources, it is crucial to understand the characteristics of relevant scientific applications and the systems these applications are running on. Using tools developed under PERI and by other performance measurement researchers, we studied the performance of two applications, MILC and Chroma, on several high performance computing systems at DOE laboratories. In the case of Chroma, we discuss how the use of C++ and modern software engineering and programming methods are driving the evolution of performance tools.
    Journal of Physics Conference Series 08/2007; 78(1):012083. DOI:10.1088/1742-6596/78/1/012083
  • Source
    Nancy Tran, Daniel A. Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: Abstract This study examined,the interplay among,processor speed, cluster interconnect and file I/O, using parallel ap- plications to quantify interactions. We focused on a com- mon case where multiple compute nodes communicate,with a single master node for file accesses. We constructed a predictive model that used time characteristics critical for application performance,to estimate the number,of nodes beyond,which further performance,improvement,became unattainable. Predictions were experimentally validated with NAMD [12, 14], a representative parallel application designed for molecular dynamics simulation. Such predic- tions can help guide decision making to improve machine allocations for parallel codes in large clusters.
    21th International Parallel and Distributed Processing Symposium (IPDPS 2007), Proceedings, 26-30 March 2007, Long Beach, California, USA; 01/2007
  • Source
    Dennis Gannon, Beth Plale, Daniel A. Reed
    [Show abstract] [Hide abstract]
    ABSTRACT: An e-Science Grid Gateway is a portal that allows a scientific collaboration to use the resources of a Grid in a way that frees them from the complex details of Grid software and middleware. The goal of such a gateway is to allow the users access to community data and applications that can be used in the language of their science. Each user has a private data and metadata space, access to data provenance and tools to use or compose experimental workflows that combine standard data analysis, simulation and post-processing tools. In this talk we will describe the underlying Grid service architecture for such an eScience gateway. In this paper we will describe some of the challenges that confront the design of Grid Gateways and we will outline a few new research directions.
    On the Move to Meaningful Internet Systems 2007: CoopIS, DOA, ODBASE, GADA, and IS, OTM Confederated International Conferences CoopIS, DOA, ODBASE, GADA, and IS 2007, Vilamoura, Portugal, November 25-30, 2007, Proceedings, Part II; 01/2007
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The explosive growth of biological and biomedical research requires a new set of software tools and a computational environment that includes high performance computing and large-scale data management. These software environments need to be easy-to-use and scalable to support the diverse requirements of educators and researchers. This paper describes the design philosophy and architecture of a bioinformatics portal that operates atop standard Grid infrastructure and tools such as the Open Grid Computing Environment (OGCE) and Globus toolkit. The Bioportal integrates domain-specific tools and standard community tools to provide an integrated, collaborative environment. We also discuss the experiences with deploying the portal for educational and research users in North Carolina and as part of the NSF TeraGrid.

Publication Stats

5k Citations
64.72 Total Impact Points

Institutions

  • 2008–2012
    • Microsoft
      Washington, West Virginia, United States
  • 2005–2008
    • University of North Carolina at Chapel Hill
      • Renaissance Computing Institute
      Chapel Hill, NC, United States
  • 2004–2005
    • University of North Carolina at Charlotte
      Charlotte, North Carolina, United States
  • 1987–2005
    • Bureau of Materials & Physical Research
      Springfield, Illinois, United States
  • 1985–2004
    • University of Illinois, Urbana-Champaign
      • • National Center for Supercomputing Applications
      • • Department of Computer Science
      • • Department of Geology
      Urbana, IL, United States
  • 2002
    • University of California, Santa Cruz
      • Department of Computer Engineering
      Santa Cruz, CA, United States
  • 1995
    • Rice University
      • Department of Computer Science
      Houston, Texas, United States