Scott B. Baden

University of California, San Diego, San Diego, CA, United States

Are you Scott B. Baden?

Claim your profile

Publications (88)20.73 Total impact

  • [show abstract] [hide abstract]
    ABSTRACT: Face detection is a key component in applications such as security surveillance and human–computer interaction systems, and real-time recognition is essential in many scenarios. The Viola–Jones algorithm is an attractive means of meeting the real time requirement, and has been widely implemented on custom hardware, FPGAs and GPUs. We demonstrate a GPU implementation that achieves competitive performance, but with low development costs. Our solution treats the irregularity inherent to the algorithm using a novel dynamic warp scheduling approach that eliminates thread divergence. This new scheme also employs a thread pool mechanism, which significantly alleviates the cost of creating, switching, and terminating threads. Compared to static thread scheduling, our dynamic warp scheduling approach reduces the execution time by a factor of 3. To maximize detection throughput, we also run on multiple GPUs, realizing 95.6 FPS on 5 Fermi GPUs.
    Journal of Parallel and Distributed Computing 05/2013; 73(5):677–685. · 1.12 Impact Factor
  • [show abstract] [hide abstract]
    ABSTRACT: We propose a new algorithm for automatic viewpoint selection for volume data sets. While most previous algorithms depend on information theoretic frameworks, our algorithm solely focuses on the data itself without off-line rendering steps, and finds a view direction which shows the data set's features well. The algorithm consists of two main steps: feature selection and viewpoint selection. The feature selection step is an extension of the 2D Harris interest point detection algorithm. This step selects corner and/or high-intensity points as features, which captures the overall structures and local details. The second step, viewpoint selection, takes this set and finds a direction that lays out those points in a way that the variance of projected points is maximized, which can be formulated as a Principal Component Analysis (PCA) problem. The PCA solution guarantees that surfaces with detected corner points are less likely to be degenerative, and it minimizes occlusion between them. Our entire algorithm takes less than a second, which allows it to be integrated into real-time volume rendering applications where users can modify the volume with transfer functions, because the optimized viewpoint depends on the transfer function.
    Proc SPIE 01/2012;
  • Didem Unat, Jun Zhou, Yifeng Cui, Scott B. Baden, Xing Cai
    [show abstract] [hide abstract]
    ABSTRACT: GPUs provide impressive computing power, but GPU programming can be challenging. Here, an experience in porting real-world earthquake code to Nvidia GPUs is described. Specifically, an annotation-based programming model, called Mint, and its accompanying source-to-source translator are used to automatically generate CUDA source code and simplify the exploration of performance tradeoffs.
    Computing in Science and Engineering 01/2012; 14(3):48-59. · 1.73 Impact Factor
  • [show abstract] [hide abstract]
    ABSTRACT: We present Bamboo, a custom source-to-source translator that transforms MPI C source into a data-driven form that automatically overlaps communication with available computation. Running on up to 98304 processors of NERSC's Hopper system, we observe that Bamboo's overlap capability speeds up MPI implementations of a 3D Jacobi iterative solver and Cannon's matrix multiplication. Bamboo's generated code meets or exceeds the performance of hand optimized MPI, which includes split-phase coding, the method classically employed to hide communication. We achieved our results with only modest amounts of programmer annotation and no intrusive reprogramming of the original application source.
    High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for; 01/2012
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Semi-local functionals commonly used in density functional theory (DFT) studies of solids usually fail to reproduce localized states such as trapped holes, polarons, excitons, and solitons. This failure is ascribed to self-interaction which creates a Coulomb barrier to localization. Pragmatic approaches in which the exchange correlation functionals are augmented with small amount of exact exchange (hybrid-DFT, e.g., B3LYP and PBE0) have shown to promise in rectifying this type of failure, as well as producing more accurate band gaps and reaction barriers. The evaluation of exact exchange is challenging for large, solid state systems with periodic boundary conditions, especially when plane-wave basis sets are used. We have developed parallel algorithms for implementing exact exchange into pseudopotential plane-wave DFT program and we have implemented them in the NWChem program package. The technique developed can readily be employed in Γ-point plane-wave DFT programs. Furthermore, atomic forces and stresses are straightforward to implement, making it applicable to both confined and extended systems, as well as to Car-Parrinello ab initio molecular dynamic simulations. This method has been applied to several systems for which conventional DFT methods do not work well, including calculations for band gaps in oxides and the electronic structure of a charge trapped state in the Fe(II) containing mica, annite.
    Journal of Computational Chemistry 01/2011; 32(1):54-69. · 3.84 Impact Factor
  • P. Cicotti, S.B. Baden
    [show abstract] [hide abstract]
    ABSTRACT: In the current practice, scientific programmer and HPC users are required to develop code that exposes a high degree of parallelism, exhibits high locality, dynamically adapts to the available resources, and hides communication latency. Hiding communication latency is crucial to realize the potential of today's distributed memory machines with highly parallel processing modules, and technological trends indicate that communication latencies will continue to be an issue as the performance gap between computation and communication widens. However, under Bulk Synchronous Parallel models, the predominant paradigm in scientific computing, scheduling is embedded into the application code. All the phases of a computation are defined and laid out as a linear sequence of operations limiting overlap and the program's ability to adapt to communication delays. In this paper we present an alternative model, called Tarragon, to overcome the limitations of Bulk Synchronous Parallelism. Tarragon, which is based on dataflow, targets latency tolerant scientific computations. Tarragon supports a task-dependency graph abstraction in which tasks, the basic unit of computation, are organized as a graph according to their data dependencies, i.e. task precedence. In addition to the task graph, Tarragon supports metadata abstractions, annotations to the task graph, to express locality information and scheduling policies to improve performance. Tarragon's functionality and underlying programming methodology are demonstrated on three classes of computations used in scientific domains: structured grids, sparse linear algebra, and dynamic programming. In the application studies, Tarragon implementations achieve high performance, in many cases exceeding the performance of equivalent latency-tolerant, hard coded MPI implementations.
    Data-Flow Execution Models for Extreme Scale Computing (DFM), 2011 First Workshop on; 01/2011
  • [show abstract] [hide abstract]
    ABSTRACT: Systems with hardware accelerators speedup applications by offloading certain compute operations that can run faster on accelerators. Thus, it is not surprising that many of top500 supercomputers use accelerators. However, in addition to procurement cost, significant programming and porting effort is required to realize the potential benefit of such accelerators. Hence, before building such a system it is prudent to answer the question ‘what is the projected performance benefit from accelerators for workloads of interest?’ We address this question by way of a performance-modeling framework, which predicts realizable application performance on accelerators speedily and accurately without going to the considerable effort of porting and tuning.
    International Journal of High Performance Computing Applications 01/2011; 27(2). · 1.30 Impact Factor
  • Source
    Didem Unat, Xing Cai, Scott B. Baden
    [show abstract] [hide abstract]
    ABSTRACT: We present Mint, a programming model that enables the non-expert to enjoy the performance benefits of hand coded CUDA without becoming entangled in the details. Mint targets stencil methods, which are an important class of scientific applications. We have implemented the Mint programming model with a source-to-source translator that generates optimized CUDA C from traditional C source. The translator relies on annotations to guide translation at a high level. The set of pragmas is small, and the model is compact and simple. Yet, Mint is able to deliver performance competitive with painstakingly hand-optimized CUDA. We show that, for a set of widely used stencil kernels, Mint realized 80% of the performance obtained from aggressively optimized CUDA on the 200 series NVIDIA GPUs. Our optimizations target three dimensional kernels, which present a daunting array of optimizations.
    Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31 - June 04, 2011; 01/2011
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: An overview of the parallel algorithms for ab initio molecular dynamics (AIMD) used in the NWChem program package is presented, including recent developments for computing exact exchange. These algorithms make use of a two-dimensional processor geometry proposed by Gygi et al. for use in AIMD algorithms. Using this strategy, a highly scalable algorithm for exact exchange has been developed and incorporated into AIMD. This new algorithm for exact exchange employs an incomplete butterfly to overcome the bottleneck associated with exact exchange term, and it makes judicious use of data replication. Initial testing has shown that this algorithm can scale to over 20,000 CPUs even for a modest size simulation.
    Journal of Physics Conference Series 09/2010; 180(1).
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Face detection is an important aspect for biometrics, video surveillance and human computer interaction. We present a multi-GPU implementation of the Viola-Jones face detection algorithm that meets the performance of the fastest known FPGA implementation. The GPU design offers far lower development costs, but the FPGA implementation consumes less power. We discuss the performance programming required to realize our design, and describe future research directions.
    Field-Programmable Custom Computing Machines (FCCM), 2010 18th IEEE Annual International Symposium on; 06/2010
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Large and complex systems of ordinary differential equations (ODEs) arise in diverse areas of science and engineering, and pose special challenges on a streaming processor owing to the large amount of state they manipulate. We describe a set of domain-specific source transformations on CUDA C that improved performance by ×6.7 on a system of ODEs arising in cardiac electrophysiology running on the nVidia GTX-295, without requiring expert knowledge of the GPU. Our transformations should apply to a wide range of reaction-diffusion systems..
    Euro-Par 2010 - Parallel Processing, 16th International Euro-Par Conference, Ischia, Italy, August 31 - September 3, 2010, Proceedings, Part I; 01/2010
  • Source
    D. Unat, T. Hromadka, S.B. Baden
    [show abstract] [hide abstract]
    ABSTRACT: A current challenge in scientific computing is how to curb the growth of simulation datasets without losing valuable information. While wavelet based methods are popular, they require that data be decompressed before it can analyzed, for example, when identifying time-dependent structures in turbulent flows. We present adaptive coarsening, an adaptive subsampling compression strategy that enables the compressed data product to be directly manipulated in memory without requiring costly decompression.We demonstrate compression factors of up to 8 in turbulent flow simulations in three dimensions.Our compression strategy produces a non-progressive multiresolution representation, subdividing the dataset into fixed sized regions and compressing each region independently.
    Data Compression Conference, 2009. DCC '09.; 04/2009
  • Source
    Jacob Sorensen, Scott B. Baden
    [show abstract] [hide abstract]
    ABSTRACT: Reformulating an algorithm to mask communication delays is crucial in maintaining scalability, but traditional solutions embed the overlap strategy into the application. We present an alternative approach based on dataflow, that factors the overlap strategy out of the application. Using this approach we are able to reduce communication delays, meeting and in many cases exceeding performance obtained with traditional hand coded applications.
    Computational Science - ICCS 2009, 9th International Conference, Baton Rouge, LA, USA, May 25-27, 2009, Proceedings, Part I; 01/2009
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: Many important physiological processes operate at time and space scales far beyond those accessible to atom-realistic simulations, and yet discrete stochastic rather than continuum methods may best represent finite numbers of molecules interacting in complex cellular spaces. We describe and validate new tools and algorithms developed for a new version of the MCell simulation program (MCell3), which supports generalized Monte Carlo modeling of diffusion and chemical reaction in solution, on surfaces representing membranes, and combinations thereof. A new syntax for describing the spatial directionality of surface reactions is introduced, along with optimizations and algorithms that can substantially reduce computational costs (e.g., event scheduling, variable time and space steps). Examples for simple reactions in simple spaces are validated by comparison to analytic solutions. Thus we show how spatially realistic Monte Carlo simulations of biological systems can be far more cost-effective than often is assumed, and provide a level of accuracy and insight beyond that of continuum methods.
    SIAM Journal on Scientific Computing 10/2008; 30(6):3126. · 1.95 Impact Factor
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: In today's many-core era, the interconnection networks have been the key factor that dominates the performance of a computer system. In this paper, we propose a design flow to discover the best topology in terms of the communication latency and physical constraints. First a set of representative candidate topologies are generated for the interconnection networks among computing chips; then an efficient multi-commodity flow algorithm is devised to evaluate the performance. The experiments show that the best topologies identified by our algorithm can achieve better average latency compared to the existing networks. I. INTRODUCTION
    2008 International Conference on Computer-Aided Design (ICCAD'08), November 10-13, 2008, San Jose, CA, USA; 01/2008
  • Source
    [show abstract] [hide abstract]
    ABSTRACT: MCell is a Monte Carlo simulator of cell microphysiology, and the scalable variant can be used to study challenging problems of interest to the biological community. MCell can currently model a single synapse out of thousands on a single cell. Petascale technology will enable significant advances in the ability to treat larger structures involving many synapses, with correspondingly more complex behavior. However, there are significant challenges to scaling MCell across two orders of magnitude in performance: increased communication delays and uneven workload concentrations. We discuss software solutions currently under investigation that will accompany us on the path to petascale cell microphysiology.
    Bioinformatics and Bioengineering, 2007. BIBE 2007. Proceedings of the 7th IEEE International Conference on; 11/2007
  • [show abstract] [hide abstract]
    ABSTRACT: We present a second-order accurate algorithm for solving thefree-space Poisson's equation on a locally-refined nested grid hierarchyin three dimensions. Our approach is based on linear superposition oflocal convolutions of localized charge distributions, with the nonlocalcoupling represented on coarser grids. There presentation of the nonlocalcoupling on the local solutions is based on Anderson's Method of LocalCorrections and does not require iteration between different resolutions.A distributed-memory parallel implementation of this method is observedto have a computational cost per grid point less than three times that ofa standard FFT-based method on a uniform grid of the same resolution, andscales well up to 1024 processors.
    Communications in Applied Mathematics and ComputationalScience. 10/2006; 2.
  • [show abstract] [hide abstract]
    ABSTRACT: High Performance Fortran (HPF) is an effective language for implementing regular data parallel applications on distributed memory architectures, but it is not well suited to irregular, block-structured applications such as multiblock and adaptive mesh methods. A solution to this problem is to use a non-HPF SPMD program to coordinate multiple concurrent HPF tasks, each operating on a regular subgrid of an irregular data domain. To this end we have developed an interface between the C++ class library KeLP, which supports irregular, dynamic block-structured applications on distributed systems, and an HPF compiler, SHPF. This allows KeLP to handle the data layout and inter-block communications, and to invoke HPF concurrently on each block. There are a number of advantages to this approach: it combines the strengths of both KeLP and HPF; it is relatively easy to implement; and it involves no extensions to HPF or HPF compilers. This paper describes the KeLP-HPF implementation and programming model, and shows an example KeLP-HPF multiblock solver.
    04/2006: pages 828-839;
  • Pietro Cicotti, Scott B. Baden
    [show abstract] [hide abstract]
    ABSTRACT: Tarragon is an actor-based programming model and library for implementing parallel scientific applications requiring fine grain asynchronous communication. Tarragon raises the level of abstraction by encapsulating run-time services that mange the actor semantics. The workload is over-decomposed into many virtual processes called WorkUnits. WorkUnits can become ready for execution after receiving input; scheduling and communication services coordinate WorkUnit execution and management. In order to maintain balanced workloads, Tarragon automatically monitors workload distribution and redistributes as needed. Tarragon is novel in its support for meta data describing run-time virtual process structures used to manage actor semantics. This meta data may be used to guide run time services policies in order to optimize performance. We are currently applying Tarragon to the MCell cell microphysiology simulator and are considering other applications as well, such as sparse matrix linear algebra.
    Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, November 11-17, 2006, Tampa, FL, USA; 01/2006
  • Source
    Tallat M. Shafaat, Scott B. Baden
    [show abstract] [hide abstract]
    ABSTRACT: We present adaptive coarsening, a multi-resolution lossy com- pression algorithm for scientific datasets. The algorithm provides guar- anteed error bounds according to the user's requirements for subsequent post-processing. We demonstrate compression factors of up to an order of magnitude with datasets coming from solutions to time-dependent partial dierential equations in one and two dimensions.
    Applied Parallel Computing. State of the Art in Scientific Computing, 8th International Workshop, PARA 2006, Umeå, Sweden, June 18-21, 2006, Revised Selected Papers; 01/2006

Publication Stats

556 Citations
608 Downloads
20.73 Total Impact Points

Institutions

  • 1994–2013
    • University of California, San Diego
      • Department of Computer Science and Engineering (CSE)
      San Diego, CA, United States
  • 2005
    • Lawrence Berkeley National Laboratory
      • Applied Numerical Algorithms Group (ANAG)
      Berkeley, CA, United States
  • 2001
    • CSU Mentor
      Long Beach, California, United States
  • 1995
    • University of San Diego
      San Diego, California, United States