Scott B. Baden

University of California, San Diego, San Diego, California, United States

Are you Scott B. Baden?

Claim your profile

Publications (104)35.86 Total impact


  • No preview · Conference Paper · Oct 2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: A recent trend in modern high-performance computing environments is the introduction of powerful, energy-efficient hardware accelerators such as GPUs and Xeon Phi coprocessors. These specialized computing devices coexist with CPUs and are optimized for highly parallel applications. In regular computing-intensive applications with predictable data access patterns, these devices often far outperform CPUs and thus relegate the latter to pure control functions instead of computations. For irregular applications, however, the performance gap can be much smaller and is sometimes even reversed. Thus, maximizing the overall performance on heterogeneous systems requires making full use of all available computational resources, including both accelerators and CPUs.
    No preview · Article · Jul 2015 · IEEE Micro
  • Source
    Mohammed Sourouri · Tor Gillberg · Scott B. Baden · Xing Cai
    [Show abstract] [Hide abstract]
    ABSTRACT: In the context of multiple GPUs that share the same PCIe bus, we propose a new communication scheme that leads to a more effective overlap of communication and computation. Multiple CUDA streams and OpenMP threads are adopted so that data can simultaneously be sent and received. A representative 3D stencil example is used to demonstrate the effectiveness of our scheme. We compare the performance of our new scheme with an MPI-based state-of-the-art scheme. Results show that our approach outperforms the state-of-the-art scheme, being up to 1.85× faster. However, our performance results also indicate that the current underlying PCIe bus architecture needs improvements to handle the future scenario of many GPUs per node.
    Full-text · Conference Paper · Dec 2014
  • Han Suk Kim · Didem Unat · Scott B. Baden · Juergen P. Schulze
    [Show abstract] [Hide abstract]
    ABSTRACT: Automatic viewpoint selection algorithms try to optimize the view of a data set to best show its features. They are often based on information theoretic frameworks. Although many algorithms have shown useful results, they often take several seconds to produce a result because they render the scene from a variety of viewpoints and analyze the result. In this article, we propose a new algorithm for volume data sets that dramatically reduces the running time. Our entire algorithm takes less than a second, which allows it to be integrated into real-time volume-rendering applications. The interactive performance is achieved by solving a maximization problem with a small sample of the data set, instead of rendering it from a variety of directions. We compare performance results of our algorithm to state-of-the-art approaches and show that our algorithm achieves comparable results for the resulting viewpoints. Furthermore, we apply our algorithm to multichannel volume data sets.
    No preview · Article · Jul 2013 · Information Visualization
  • [Show abstract] [Hide abstract]
    ABSTRACT: Face detection is a key component in applications such as security surveillance and human–computer interaction systems, and real-time recognition is essential in many scenarios. The Viola–Jones algorithm is an attractive means of meeting the real time requirement, and has been widely implemented on custom hardware, FPGAs and GPUs. We demonstrate a GPU implementation that achieves competitive performance, but with low development costs. Our solution treats the irregularity inherent to the algorithm using a novel dynamic warp scheduling approach that eliminates thread divergence. This new scheme also employs a thread pool mechanism, which significantly alleviates the cost of creating, switching, and terminating threads. Compared to static thread scheduling, our dynamic warp scheduling approach reduces the execution time by a factor of 3. To maximize detection throughput, we also run on multiple GPUs, realizing 95.6 FPS on 5 Fermi GPUs.
    No preview · Article · May 2013 · Journal of Parallel and Distributed Computing
  • Source
    Alden King · Scott Baden
    [Show abstract] [Hide abstract]
    ABSTRACT: Object oriented application libraries targeted to a specific application domain are an attractive means of reducing the software development time for sophisticated high performance applications. However, libraries can have the drawback of high abstraction penalties. We describe a domain specific, source-to-source translator that eliminates abstraction penalties in an array class library used to analyze turbulent flow simulation data. Our translator effectively flattens the abstractions, yielding performance within 75% of C code that uses primitive C arrays and no user-defined abstractions.
    Preview · Article · Dec 2012 · Procedia Computer Science
  • Tan Nguyen · Pietro Cicotti · Eric Bylaska · Dan Quinlan · Scott B. Baden
    [Show abstract] [Hide abstract]
    ABSTRACT: We present Bamboo, a custom source-to-source translator that transforms MPI C source into a data-driven form that automatically overlaps communication with available computation. Running on up to 98304 processors of NERSC's Hopper system, we observe that Bamboo's overlap capability speeds up MPI implementations of a 3D Jacobi iterative solver and Cannon's matrix multiplication. Bamboo's generated code meets or exceeds the performance of hand optimized MPI, which includes split-phase coding, the method classically employed to hide communication. We achieved our results with only modest amounts of programmer annotation and no intrusive reprogramming of the original application source.
    No preview · Conference Paper · Nov 2012
  • Didem Unat · Jun Zhou · Yifeng Cui · Scott B. Baden · Xing Cai
    [Show abstract] [Hide abstract]
    ABSTRACT: GPUs provide impressive computing power, but GPU programming can be challenging. Here, an experience in porting real-world earthquake code to Nvidia GPUs is described. Specifically, an annotation-based programming model, called Mint, and its accompanying source-to-source translator are used to automatically generate CUDA source code and simplify the exploration of performance tradeoffs.
    No preview · Article · May 2012 · Computing in Science and Engineering
  • Source
    G T Balls · S B Baden · T M Bartol · T J Sejnowski
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe an experimentation environment that enables large-scale numerical simulations of neural microphysiology to be fed back onto living neurons in-vitro via dynamic wholecell patch clamping – in effect making living neurons and simulated neurons part of the same neural circuit. Owing to high computational demands, the experimental testbed will be dispersed over a local area network comprising several high performance computing resources. Parallel execution, including feedback between the simulation components, will be managed by the Tarragon, a programming model and run time library that supports asynchronous data driven execution. Tarragon's execution model matches the underlying dynamics of Monte Carlo simulation of diffusive processes and it masks the long network latencies entailed in coupled dispersed simulations. We discuss Tarragon and show how its data driven execution model can be used to dynamically feed back the results of a neural circuit simulation onto living cells in order to better understand the underlying signaling pathways between and within living cells.
    Full-text · Article · Apr 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Modern large-scale scientiflc computation,problems must execute in a parallel computational environment to achieve acceptable performance. Target parallel environments range from the largest tightly-coupled supercomputers to heterogeneous clusters of workstations. Grid technologies make Internet execution more likely. Hierarchical and heterogeneous systems are increasingly common. Processing and communication capabilities can be nonuniform, non-dedicated, transient or unreliable. Even when targeting homogeneous computing environments, each environment may difier in the number of processors per node, the relative costs of computation, communication, and memory access, and the availability of programming paradigms and software tools. Architecture-aware computation requires knowledge of the computing environment and software performance characteristics, and tools to make use of this knowledge. These challenges may be addressed by compilers, low-level tools, dynamic load balancing or solution procedures, middleware layers, high-level software development techniques, and choice of programming languages and paradigms. Computation and communication may be reordered. Data or computation may be replicated or a load imbalance may be tolerated to avoid costly communication. This paper samples a variety of approaches to architecture-aware parallel computation. Key words. architecture-aware computing, cluster computing, grid computing Modern parallel scientiflc computation is being performed in a wide variety of
    Full-text · Article · Mar 2012
  • Source
    Han Suk Kim · Didem Unat · Scott B. Baden · Jürgen P. Schulze
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a new algorithm for automatic viewpoint selection for volume data sets. While most previous algorithms depend on information theoretic frameworks, our algorithm solely focuses on the data itself without off-line rendering steps, and finds a view direction which shows the data set's features well. The algorithm consists of two main steps: feature selection and viewpoint selection. The feature selection step is an extension of the 2D Harris interest point detection algorithm. This step selects corner and/or high-intensity points as features, which captures the overall structures and local details. The second step, viewpoint selection, takes this set and finds a direction that lays out those points in a way that the variance of projected points is maximized, which can be formulated as a Principal Component Analysis (PCA) problem. The PCA solution guarantees that surfaces with detected corner points are less likely to be degenerative, and it minimizes occlusion between them. Our entire algorithm takes less than a second, which allows it to be integrated into real-time volume rendering applications where users can modify the volume with transfer functions, because the optimized viewpoint depends on the transfer function.
    Preview · Article · Jan 2012 · Proceedings of SPIE - The International Society for Optical Engineering
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Semi-local functionals commonly used in density functional theory (DFT) studies of solids usually fail to reproduce localized states such as trapped holes, polarons, excitons, and solitons. This failure is ascribed to self-interaction which creates a Coulomb barrier to localization. Pragmatic approaches in which the exchange correlation functionals are augmented with small amount of exact exchange (hybrid-DFT, e.g., B3LYP and PBE0) have shown to promise in rectifying this type of failure, as well as producing more accurate band gaps and reaction barriers. The evaluation of exact exchange is challenging for large, solid state systems with periodic boundary conditions, especially when plane-wave basis sets are used. We have developed parallel algorithms for implementing exact exchange into pseudopotential plane-wave DFT program and we have implemented them in the NWChem program package. The technique developed can readily be employed in Γ-point plane-wave DFT programs. Furthermore, atomic forces and stresses are straightforward to implement, making it applicable to both confined and extended systems, as well as to Car-Parrinello ab initio molecular dynamic simulations. This method has been applied to several systems for which conventional DFT methods do not work well, including calculations for band gaps in oxides and the electronic structure of a charge trapped state in the Fe(II) containing mica, annite.
    Full-text · Article · Jan 2011 · Journal of Computational Chemistry
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Systems with hardware accelerators speedup applications by offloading certain compute operations that can run faster on accelerators. Thus, it is not surprising that many of top500 supercomputers use accelerators. However, in addition to procurement cost, significant programming and porting effort is required to realize the potential benefit of such accelerators. Hence, before building such a system it is prudent to answer the question ‘what is the projected performance benefit from accelerators for workloads of interest?’ We address this question by way of a performance-modeling framework, which predicts realizable application performance on accelerators speedily and accurately without going to the considerable effort of porting and tuning.
    Full-text · Article · Jan 2011 · International Journal of High Performance Computing Applications
  • Source
    Didem Unat · Xing Cai · Scott B. Baden
    [Show abstract] [Hide abstract]
    ABSTRACT: We present Mint, a programming model that enables the non-expert to enjoy the performance benefits of hand coded CUDA without becoming entangled in the details. Mint targets stencil methods, which are an important class of scientific applications. We have implemented the Mint programming model with a source-to-source translator that generates optimized CUDA C from traditional C source. The translator relies on annotations to guide translation at a high level. The set of pragmas is small, and the model is compact and simple. Yet, Mint is able to deliver performance competitive with painstakingly hand-optimized CUDA. We show that, for a set of widely used stencil kernels, Mint realized 80% of the performance obtained from aggressively optimized CUDA on the 200 series NVIDIA GPUs. Our optimizations target three dimensional kernels, which present a daunting array of optimizations.
    Full-text · Conference Paper · Jan 2011
  • Source
    Alden King · Eric Arobone · Scott B. Baden · Sutanu Sarkar
    [Show abstract] [Hide abstract]
    ABSTRACT: In many respects, numerical simulations involving solutions to partial differential equations have replaced physical experimentation. However, few tools are available to sift through the deluge of data. We present Saaz, a query framework to analyze the simulation results of multi-scale physical phenomena which admit mathematical rules for characterizing features of interest. Saaz provides high-level primitives that free the domain-scientist to concentrate more on scientific discovery and less on code implementation and maintenance. It supports user-defined domain-specific query operations which may be subsequently composed into more complex queries. While Saaz supports offline processing of queries, we explore here the online capabilities by attaching Saaz to a running simulation, improving the simulation's effective temporal resolution. We discuss analysis for a computational fluid dynamics simulation of turbulent flow running on a cluster.
    Preview · Conference Paper · Jan 2011
  • Source
    P. Cicotti · S.B. Baden
    [Show abstract] [Hide abstract]
    ABSTRACT: In the current practice, scientific programmer and HPC users are required to develop code that exposes a high degree of parallelism, exhibits high locality, dynamically adapts to the available resources, and hides communication latency. Hiding communication latency is crucial to realize the potential of today's distributed memory machines with highly parallel processing modules, and technological trends indicate that communication latencies will continue to be an issue as the performance gap between computation and communication widens. However, under Bulk Synchronous Parallel models, the predominant paradigm in scientific computing, scheduling is embedded into the application code. All the phases of a computation are defined and laid out as a linear sequence of operations limiting overlap and the program's ability to adapt to communication delays. In this paper we present an alternative model, called Tarragon, to overcome the limitations of Bulk Synchronous Parallelism. Tarragon, which is based on dataflow, targets latency tolerant scientific computations. Tarragon supports a task-dependency graph abstraction in which tasks, the basic unit of computation, are organized as a graph according to their data dependencies, i.e. task precedence. In addition to the task graph, Tarragon supports metadata abstractions, annotations to the task graph, to express locality information and scheduling policies to improve performance. Tarragon's functionality and underlying programming methodology are demonstrated on three classes of computations used in scientific domains: structured grids, sparse linear algebra, and dynamic programming. In the application studies, Tarragon implementations achieve high performance, in many cases exceeding the performance of equivalent latency-tolerant, hard coded MPI implementations.
    Preview · Conference Paper · Jan 2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: An overview of the parallel algorithms for ab initio molecular dynamics (AIMD) used in the NWChem program package is presented, including recent developments for computing exact exchange. These algorithms make use of a two-dimensional processor geometry proposed by Gygi et al. for use in AIMD algorithms. Using this strategy, a highly scalable algorithm for exact exchange has been developed and incorporated into AIMD. This new algorithm for exact exchange employs an incomplete butterfly to overcome the bottleneck associated with exact exchange term, and it makes judicious use of data replication. Initial testing has shown that this algorithm can scale to over 20,000 CPUs even for a modest size simulation.
    Full-text · Article · Sep 2010 · Journal of Physics Conference Series
  • Source
    Fred V. Lionetti · Andrew D. McCulloch · Scott B. Baden
    [Show abstract] [Hide abstract]
    ABSTRACT: Large and complex systems of ordinary differential equations (ODEs) arise in diverse areas of science and engineering, and pose special challenges on a streaming processor owing to the large amount of state they manipulate. We describe a set of domain-specific source transformations on CUDA C that improved performance by ×6.7 on a system of ODEs arising in cardiac electrophysiology running on the nVidia GTX-295, without requiring expert knowledge of the GPU. Our transformations should apply to a wide range of reaction-diffusion systems..
    Preview · Conference Paper · Aug 2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Face detection is an important aspect for biometrics, video surveillance and human computer interaction. We present a multi-GPU implementation of the Viola-Jones face detection algorithm that meets the performance of the fastest known FPGA implementation. The GPU design offers far lower development costs, but the FPGA implementation consumes less power. We discuss the performance programming required to realize our design, and describe future research directions.
    Full-text · Conference Paper · Jun 2010
  • Source
    Pietro Cicotti · Xiaoye S Li · Scott B Baden
    [Show abstract] [Hide abstract]
    ABSTRACT: We developed a Performance Modeling Tools (PMTOOLS) library to en-able simulation-based performance modeling for parallel sparse linear algebra al-gorithms. The library includes micro-benchmarks for calibrating the system's pa-rameters, functions for collecting and retrieving performance data, and a cache sim-ulator for modeling the detailed memory system activities. Using these tools, we have built simulation modules to model and predict performance of different vari-ants of parallel sparse LU and Cholesky factorization algorithms. We validated the simulated results with the existing implementation in SuperLU_DIST, and showed that our performance prediction errors are only 6.1% and 6.6% with 64 processors IBM power5 and Cray XT4, respectively. More importantly, we have successfully used this simulation framework to forecast the performance of different algorithm choices, and helped prototyping new algorithm implementations.
    Preview · Article · Jan 2010 · Advances in Parallel Computing

Publication Stats

1k Citations
35.86 Total Impact Points

Institutions

  • 1995-2015
    • University of California, San Diego
      • • Department of Computer Science and Engineering (CSE)
      • • Department of Bioengineering
      San Diego, California, United States
  • 1994
    • California State University
      • College of Engineering & Computer Sciences
      Long Beach, California, United States
  • 1990
    • University of California, Berkeley
      Berkeley, California, United States