Boyana Norris

Boyana Norris
University of Oregon | UO · Department of Computer and Information Sciences

Ph.D.

About

162
Publications
11,676
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,670
Citations
Introduction
I direct the High-Performance Computing Laboratory (HPCL) at the Department of Computer and Information Science at the University of Oregon. We conduct research in several areas, including optimizing compilers, performance modeling and optimization, parallel algorithms, and software engineering. Example projects include static and dynamic analysis of software for building application performance models, ensuring software quality, or finding security vulnerabilities.
Education
August 1995 - January 2000
University of Illinois, Urbana-Champaign
Field of study
  • computer science
August 1993 - May 1995
Wake Forest University
Field of study
  • computer science

Publications

Publications (162)
Article
The Single Source Shortest Path (SSSP) problem is a classic graph theory problem that arises frequently in various practical scenarios; hence, many parallel algorithms have been developed to solve it. However, these algorithms operate on static graphs, whereas many real-world problems are best modeled as dynamic networks, where the structure of the...
Preprint
We present results from parallelizing the unpacking and clustering steps of the raw data from the silicon strip modules for reconstruction of charged particle tracks. Throughput is further improved by concurrently processing multiple events using nested OpenMP parallelism on CPU or CUDA streams on GPU. The new implementation along with earlier work...
Conference Paper
Full-text available
This work explores the effects of nonassociativity of floating-point addition on Message Passing Interface (MPI) reduction operations. Previous work indicates floating-point summation error is comprised of two independent factors: error based on the summation algorithm and error based on the summands themselves. We find evidence to suggest, for MPI...
Preprint
Full-text available
Performance models can be very useful for understanding the behavior of applications and hence can help guide design and optimization decisions. Unfortunately, performance modeling of nontrivial computations typically requires significant expertise and human effort. Moreover, even when performed by experts, it is necessarily limited in scope, accur...
Preprint
Network embedding is an important step in many different computations based on graph data. However, existing approaches are limited to small or middle size graphs with fewer than a million edges. In practice, web or social network graphs are orders of magnitude larger, thus making most current methods impractical for very large graphs. To address t...
Preprint
One of the most computationally challenging problems expected for the High-Luminosity Large Hadron Collider (HL-LHC) is determining the trajectory of charged particles during event reconstruction. Algorithms used at the LHC today rely on Kalman filtering, which builds physical trajectories incrementally while incorporating material effects and erro...
Article
Full-text available
In the High–Luminosity Large Hadron Collider (HL–LHC), one of the most challenging computational problems is expected to be finding and fitting charged-particle tracks during event reconstruction. The methods currently in use at the LHC are based on the Kalman filter. Such methods have shown to be robust and to provide good physics performance, bot...
Preprint
One of the most computationally challenging problems expected for the High-Luminosity Large Hadron Collider (HL-LHC) is finding and fitting particle tracks during event reconstruction. Algorithms used at the LHC today rely on Kalman filtering, which builds physical trajectories incrementally while incorporating material effects and error estimation...
Preprint
Neutrinos are particles that interact rarely, so identifying them requires large detectors which produce lots of data. Processing this data with the computing power available is becoming more difficult as the detectors increase in size to reach their physics goals. In liquid argon time projection chambers (TPCs) the charged particles from neutrino...
Article
Full-text available
Scientific discovery increasingly relies on computation through simulations, analytics, and machine and deep learning. Of these, simulations on high-performance computing (HPC) platforms have been the cornerstone of scientific computing for more than two decades. However, the development of simulation software has, in general, occurred through accr...
Article
Full-text available
One of the most computationally challenging problems expected for the High-Luminosity Large Hadron Collider (HL-LHC) is finding and fitting particle tracks during event reconstruction. Algorithms used at the LHC today rely on Kalman filtering, which builds physical trajectories incrementally while incorporating material effects and error estimation...
Article
Full-text available
Neutrinos are particles that interact rarely, so identifying them requires large detectors which produce lots of data. Processing this data with the computing power available is becoming more difficult as the detectors increase in size to reach their physics goals. In liquid argon time projection chambers (TPCs) the charged particles from neutrino...
Chapter
Accelerator architectures specialize in executing SIMD (single instruction, multiple data) in lockstep. Because the majority of CUDA applications are parallelized loops, control flow information can provide an in-depth characterization of a kernel. CUDAflow is a tool that statically separates CUDA binaries into basic block regions and dynamically m...
Conference Paper
Full-text available
Particle advection, a fundamental building block for many flow visu-alization algorithms, is very difficult to parallelize efficiently. That said, work requesting is a promising technique to improve parallel performance for particle advection. With this work, we introduce a new work requesting-based method which uses the Lifeline scheduling method....
Preprint
Full-text available
Building particle tracks is the most computationally intense step of event reconstruction at the LHC. With the increased instantaneous luminosity and associated increase in pileup expected from the High-Luminosity LHC, the computational challenge of track finding and fitting requires novel solutions. The current track reconstruction algorithms used...
Preprint
Full-text available
In the High-Luminosity Large Hadron Collider (HL-LHC), one of the most challenging computational problems is expected to be finding and fitting charged-particle tracks during event reconstruction. The methods currently in use at the LHC are based on the Kalman filter. Such methods have shown to be robust and to provide good physics performance, bot...
Conference Paper
Full-text available
There are nearly one hundred parallel and distributed graph processing packages. Selecting the best package for a given problem is difficult; some packages require GPUs, some are optimized for distributed or shared memory, and some require proprietary compilers or perform better on different hardware. Furthermore, performance may vary wildly depend...
Article
Full-text available
The High-Luminosity Large Hadron Collider at CERN will be characterized by greater pileup of events and higher occupancy, making the track reconstruction even more computationally demanding. Existing algorithms at the LHC are based on Kalman filter techniques with proven excellent physics performance under a variety of conditions. Starting in 2014,...
Preprint
Full-text available
The High-Luminosity Large Hadron Collider at CERN will be characterized by greater pileup of events and higher occupancy, making the track reconstruction even more computationally demanding. Existing algorithms at the LHC are based on Kalman filter techniques with proven excellent physics performance under a variety of conditions. Starting in 2014,...
Article
n this paper, we present a network-based template for analyzing large-scale dynamic data. Specifically, we present a novel shared-memory parallel algorithm for updating treebased structures, including connected components (CC) and the minimum spanning tree (MST) on dynamic networks. We propose a rooted tree-based data structure to store the edges t...
Article
Autotuning refers to the automatic generation of a search space of possible implementations of a computation that are evaluated through models and/or empirical measurement to identify the most desirable implementation. Autotuning has the potential to dramatically improve the performance portability of petascale and exascale applications. To date, a...
Article
Accelerator architectures specialize in executing SIMD (single instruction, multiple data) in lockstep. Because the majority of CUDA applications are parallelized loops, control flow information can provide an in-depth characterization of a kernel. CUDAflow is a tool that statically separates CUDA binaries into basic block regions and dynamically m...
Article
The performance model of an application can pro- vide understanding about its runtime behavior on particular hardware. Such information can be analyzed by developers for performance tuning. However, model building and analyzing is frequently ignored during software development until perfor- mance problems arise because they require significant expe...
Conference Paper
Intuitive visual representations of architecture capabilities and the performance of applications are critical to enabling effective performance analysis, which in turn guides optimizations. The Roofline Model and its derivatives provide such an intuitive representation of the best achievable performance on a given architecture. The Roofline Toolki...
Article
Full-text available
The rapidly growing number of large network analysis problems has led to the emergence of many parallel and distributed graph processing systems---one survey in 2014 identified over 80. Since then, the landscape has evolved; some packages have become inactive while more are being developed. Determining the best approach for a given problem is infea...
Article
Optimizing the performance of GPU kernels is challenging for both human programmers and code generators. For example, CUDA programmers must set thread and block parameters for a kernel, but might not have the intuition to make a good choice. Similarly, compilers can generate working code, but may miss tuning opportunities by not targeting GPU model...
Conference Paper
We introduce GraphFlow, a big graph framework that is able to encode complex data science experiments as a set of high-level workflows. GraphFlow combines the Spark big data processing platform and the Galaxy workflow management system to offer a set of components for graph processing using a novel interaction model for creating and using complex w...
Article
Scientific and engineering computing rely heavily on linear algebra for large-scale data analysis, modeling and simulation, machine learning, and other applied problems. Sparse linear system solution often dominates the execution time of such applications, prompting the ongoing development of highly optimized iterative algorithms and high-performan...
Conference Paper
Full-text available
Solving large, sparse linear systems efficiently is a challenging problem in scientific computing. Accessible, comprehensive, and usable interfaces or tools for high quality code production for this computation are not available. Lighthouse is the first framework that offers an organized taxonomy of software components for linear algebra that enabl...
Conference Paper
Linear algebra provides the building blocks for a wide variety of scientific and engineering simulation codes. Users of these codes face a world of continuously changing algorithms and high-performance implementations. In this paper, we describe new capabilities of our Lighthouse framework, whose goal is to match specific problems in the area of hi...
Conference Paper
Full-text available
Linear algebra provides the building blocks for a wide variety of scientific and engineering simulation codes. Users face a world of continuously developing new algorithms and high-performance implementations of these fundamental calculations. In this paper, we describe new capabilities of our Lighthouse framework, whose goal is to match specific p...
Conference Paper
Tuning codes for GPGPU architectures is challenging because few performance tools can pinpoint the exact causes of execution bottlenecks. While profiling applications can reveal execution behavior with a particular architecture, the abundance of collected information can also overwhelm the user. Moreover, performance counters provide cumulative val...
Conference Paper
Modeling power/energy consumption of HPC applications and analyzing its correlation with application performance characteristics can reveal interesting insights and provide useful recommendations for improving power/energy efficiency. For that purpose, fine-grained tools are needed that can synchronize power/energy measurements with application act...
Article
Producing high-performance implementations from simple, portable computation specifications is a challenge that compilers have tried to address for several decades. More recently, a relatively stable architectural landscape has evolved into a set of increasingly diverging and rapidly changing CPU and accelerator designs, with the main common factor...
Conference Paper
Full-text available
Many excellent open-source and commercial tools enable the detailed measurement of the performance attributes of applications. However, the process of collecting measurement data and analyzing it remains effort-intensive because of differences in tool interfaces and architectures. Furthermore, insufficient standards and automation may result in los...
Technical Report
Full-text available
The Roofline Model and its derivatives provide an intuitive representation of the best achievable performance on a given architec-ture. The Roofline Toolkit project is a collaboration among researchers at Argonne National Laboratory, Lawrence Berkeley National Laboratory, and the University of Oregon and consists of three main parts: hardware chara...
Article
Full-text available
Various fields of science and engineering rely on linear algebra for large scale data analysis, modeling and simulation, machine learning, and other applied problems. Linear algebra computations often dominate the execution time of such applications. Meanwhile, experts in these domains typically lack the training or time required to develop efficie...
Article
Full-text available
Within the multibody systems literature, few attempts have been made to use automatic differentiation for solving forward multibody dynamics and evaluating its computational efficiency. The most relevant implementations are found in the sensitivity analysis field, but they rarely address automatic differentiation issues in depth. This paper present...
Article
Full-text available
Numerical solutions of nonlinear partial differential equations frequently rely on iterative Newton-Krylov methods, which linearize a finite-difference stencil-based discretization of a problem, producing a sparse matrix with regular structure. Knowledge of this structure can be used to exploit parallelism and locality of reference on modern cache-...
Article
Full-text available
Scientific software applications are increasingly developed by large interdiscplinary teams operating on functional modules organized around a common software framework, which is capable of integrating new functional capabilities without modifying the core of the framework. In such environment, software correctness and modularity take precedence at...
Article
Full-text available
Large, complex, multi-scale, multi-physics simulation codes, running on high performance com-puting (HPC) platforms, have become essential to advancing science and engineering. These codes simulate multi-scale, multi-physics phenomena with unprecedented fidelity on petascale platforms, and are used by large communities. Continued ability of these c...
Conference Paper
Emerging exascale architectures bring forth new challenges related to heterogeneous systems power, energy, cost, and resilience. These new challenges require a shift from conventional paradigms in understanding how to best exploit and optimize these features and limitations. Our objective is to identify the top few dominant characteristics in a set...
Conference Paper
Full-text available
Automatic performance tuning of computationally intensive kernels in scientific applications is a promising approach to achieving good performance on different machines while preserving the kernel implementation's readability and portability. A major bottleneck in automatic performance tuning is the computation time required to test a large number...
Conference Paper
Amdahl's law has been one of the factors influencing speedup in high performance computing over the last few decades. While Amdahl's approach of optimizing (10% of the code is where 90% of the execution time is spent) has worked very well in the past, new challenges related to emerging exascale heterogeneous architectures, combined with stringent p...
Conference Paper
Amdahl's law has been one of the factors influencing speedup in high performance computing over the last few decades. While Amdahl's approach of optimizing (10% of the code is where 90% of the execution time is spent) has worked very well in the past, new challenges related to emerging exascale heterogeneous architectures, combined with stringent p...
Conference Paper
Full-text available
Finite-difference, stencil-based discretization approaches are widely used in the solution of partial differential equations describing physical phenomena. Newton-Krylov iterative methods commonly used in stencil-based solutions generate matrices that exhibit diagonal sparsity patterns. To exploit these structures on modern GPUs, we extend the stan...
Article
Full-text available
Automatic differentiation is a technique for the rule-based transformation of a subprogram that computes some mathematical function into a subprogram that computes the derivatives of that function. Automatic differentiation algorithms are typically expressed as operat- ing on a weighted term graph called a linearized computational graph. Constructi...
Article
Full-text available
The Build to Order (BTO) system compiles a sequence of matrix and vector operations into a high-performance C program for a given architecture. We focus on optimizing programs where mem-ory traffic is the bottleneck. Loop fusion and data parallelism play an important role in this context, but applying them at every op-portunity does not necessarily...
Article
Full-text available
Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) to obtain portable high performance. However, many numerical algorithms require several BLAS calls in sequence, and those successive calls result in suboptimal performance. The entire sequence needs to be optimized in concert. Instead of vendor-tuned BLAS, a p...
Conference Paper
Full-text available
A significant number of large optimization problems exhibit structure known as partial separability, for example, least squares problems, where elemental functions are gathered into groups that are then squared. The sparsity of the Jacobian of a partially separable function can be exploited by computing the smaller Jacobians of the elemental functi...
Conference Paper
Full-text available
Earth system models rely on past observations and knowledge to simulate future climate states. Because of the inherent complexity, a substantial uncertainty exists in model-based predictions. Evaluation and improvement of model codes are one of the priorities of climate science research. Automatic Differentiation enables analysis of sensitivities o...
Article
Full-text available
Many scientific applications benefit from the accurate and efficient computation of derivatives. Automatically generating these derivative computations from an applications source code offers a competitive alternative to other approaches, such as less accurate numerical approximations or labor-intensive analytical implementations. ADIC2 is a source...
Article
In order to improve carbon cycling within Earth System Models, crop representation for corn, spring wheat, and soybean species has been incorporated into the latest version of the Community Land Model (CLM), the land surface model in the Community Earth System Model. As a means to evaluate and improve the CLM-Crop model, we will determine the sensi...
Conference Paper
High-precision accelerator modeling is essential for particle accelerator design and optimization. However, this modeling presents a significant computational challenge. We discuss performance modeling of and computational quality of service (CQoS) results from Synergia2, an advanced particle accelerator simulation code developed under the ComPASS...
Article
Full-text available
We present interim results of a two-pronged study of the application of the Argonne National Laboratory Blue Gene/P (BG/P) supercomputer to the study of precision cosmology: simulations of the large-scale structure of the nniverse (computational cosmology) and radiative transfer calculations for Type Ia supernova (SNIa) explosions. We show that por...