Byunghyun Jang

University of Mississippi, Mississippi, United States

Are you Byunghyun Jang?

Claim your profile

Publications (19)6.67 Total impact

  • Tuan Ta · Kyoshin Choo · Eh Tan · Byunghyun Jang · Eunseo Choi
    [Show abstract] [Hide abstract]
    ABSTRACT: DynEarthSol3D (Dynamic Earth Solver in Three Dimensions) is a flexible, open-source finite element solver that models the momentum balance and the heat transfer of elasto-visco-plastic material in the Lagrangian form using unstructured meshes. It provides a platform for the study of the long-term deformation of earth's lithosphere and various problems in civil and geotechnical engineering. However, the continuous computation and update of a very large mesh poses an intolerably high computational burden to developers and users in practice. For example, simulating a small input mesh containing around 3,000 elements in 20 million time steps would take more than 10 days on a high-end desktop CPU. In this paper, we explore tightly coupled CPU-GPU heterogeneous processors to address the computing concern by leveraging their new features and developing hardware-architecture-aware optimizations. Our proposed key optimization techniques are three-fold: memory access pattern improvement, data transfer elimination and kernel launch overhead minimization. Experimental results show that our proposed implementation on a tightly coupled heterogeneous processor outperforms all other alternatives including traditional discrete GPU, quad-core CPU using OpenMP, and serial implementations by 67%, 50%, and 154% respectively even though the embedded GPU in the heterogeneous processor has significantly less number of cores than high-end discrete GPU.
    No preview · Article · Mar 2015 · Computers & Geosciences
  • Zhangping Wei · Byunghyun Jang · Yafei Jia
    [Show abstract] [Hide abstract]
    ABSTRACT: GPU offers a number of unique benefits to scientific simulation and visualization. Its superior computing capability and interoperability with graphics library are two of those that make GPU the platform of choice. In this paper, we present a fast and interactive heat conduction simulator on GPUs using CUDA and OpenGL. The numerical solution of a two-dimensional heat conduction equation is decomposed into two directions to solve tridiagonal linear systems. To achieve fast simulation, a widely used implicit solver, alternating direction implicit (ADI) is accelerated on GPUs using GPU-based parallel tridiagonal solvers. We investigate the performance bottleneck of the solver and optimize it with several methods. In addition, we conduct thorough evaluations of the GPU-based ADI solver performance with three different tridiagonal solvers. Furthermore, our design takes advantage of efficient CUDA-OpenGL interoperability to make the simulation interactive in real-time. The proposed interactive visualization simulator can be served as a building block for numerous advanced emergency management systems in engineering practices.
    No preview · Article · Nov 2014 · Journal of Computational and Applied Mathematics
  • Kyoshin Choo · William Panlener · Byunghyun Jang
    [Show abstract] [Hide abstract]
    ABSTRACT: Processing elements such as CPUs and GPUs depend on cache technology to bridge the classic processor memory subsystem performance gap. As GPUs evolve into general purpose co-processors with CPUs sharing the load, good cache design and use becomes increasingly important. While both CPUs and GPUs must cooperate and perform well, their memory access patterns are very different. On CPUs only a few threads access memory simultaneously. On GPUs, there is significantly higher memory access contention among thousands of threads. Despite such different behavior, there is little research that investigates the behavior and performance of GPU caches in depth. In this paper, we present our extensive study on the characterization and improvement of GPU cache behavior and performance for general-purpose workloads using a cycle-accurate ISA level GPU architectural simulator that models one of the latest GPU architectures, Graphics Core Next (GCN) from AMD. Our study makes the following observations and improvements. First, we observe that L1 vector data cache hit rate is substantially lower when compared to CPU caches. The main culprit is compulsory misses caused by lack of data reuse among massively simultaneous threads. Second, there is significant memory access contention in shared L2 data cache, accounting for up to 19% of total access for some benchmarks. This high contention remains a main performance barrier in L2 data cache even though its hit rate is high. Third, we demonstrate that memory access coalescing plays a critical role in reducing memory traffic. Finally we found that there exists inter-workgroup locality which can affect the cache behavior and performance. Our experimental results show memory performance can be improved by 1) shared L1 vector data cache where multiple compute units share a single cache to exploit inter-workgroup locality and increase data reusability, and 2) clustered workgroup scheduling where workgroups with consecutive IDs are assigned on the same compute unit.
    No preview · Conference Paper · Jun 2014
  • Source
    Zhangping Wei · Byunghyun Jang · Yaoxin Zhang · Yafei Jia
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a parallel Alternating Direction Implicit (ADI) solver on GPUs. Our implementation significantly improves ex- isting implementations in two aspects. First, we address the scalability issue of existing Parallel Cyclic Reduction (PCR) implementations by eliminating their hardware resource constraints. As a result, our parallel ADI, which is based on PCR, no longer has the maximum domain size limitation. Second, we optimize inefficient data accesses of parallel ADI solver by leveraging hardware texture memory and matrix transpose techniques. These memory optimizations further make already parallelized ADI solver twice faster, achieving overall more than 100 times speedup over a highly optimized CPU version. We also present the analysis of numerical accuracy of the proposed parallel ADI solver.
    Full-text · Article · Dec 2013 · Procedia Computer Science
  • Source
    Rafael Ubal · Byunghyun Jang · Perhaad Mistry · Dana Schaa · David Kaeli
    [Show abstract] [Hide abstract]
    ABSTRACT: Accurate simulation is essential for the proper design and evalu-ation of any computing platform. Upon the current move toward the CPU-GPU heterogeneous computing era, researchers need a simulation framework that can model both kinds of computing de-vices and their interaction. In this paper, we present Multi2Sim, an open-source, modular, and fully configurable toolset that enables ISA-level simulation of an x86 CPU and an AMD Evergreen GPU. Focusing on a model of the AMD Radeon 5870 GPU, we address program emulation correctness, as well as architectural simulation accuracy, using AMD's OpenCL benchmark suite. Simulation ca-pabilities are demonstrated with a preliminary architectural explo-ration study, and workload characterization examples. The project source code, benchmark packages, and a detailed user's guide are publicly available at www.multi2sim.org.
    Full-text · Conference Paper · Sep 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Digital Breast Tomosynthesis (DBT) is a technology that mit-igates many of the shortcomings associated with traditional mammography. Using multiple low-dose x-ray projections with an iterative maximum likelihood estimation method, DBT is able to create a high-quality, three-dimensional re-construction of the breast. However, the tenability of DBT depends largely on the potential for decreasing the execution time to be acceptable within a clinical setting. In this work we accelerate our DBT algorithm on the lat-est generation of NVIDIA's CUDA-enabled GPUs, reducing the execution time to under 20 seconds for eight iterations (the amount usually required to obtain a clean reconstruction). Moreover, with the execution time substantially decreased, a large number of additional benefits can be achieved, such as using redundant computations to prevent inaccuracies or ar-tifacts that can be introduced from transient faults or other memory errors during execution. We also supply the high-level algorithms and thread-mapping strategy (for both the CPU and GPUs) for creating a multiple-GPU version of the algorithm, and discuss how the choices play to the strengths of the GPU architecture.
    Full-text · Article · Dec 2011
  • Source
    Byunghyun Jang · Dana Schaa · Perhaad Mistry · David Kaeli
    [Show abstract] [Hide abstract]
    ABSTRACT: The introduction of General-Purpose computation on GPUs (GPGPUs) has changed the landscape for the future of parallel computing. At the core of this phenomenon are massively multithreaded, data-parallel architectures possessing impressive acceleration ratings, offering low-cost supercomputing together with attractive power budgets. Even given the numerous benefits provided by GPGPUs, there remain a number of barriers that delay wider adoption of these architectures. One major issue is the heterogeneous and distributed nature of the memory subsystem commonly found on data-parallel architectures. Application acceleration is highly dependent on being able to utilize the memory subsystem effectively so that all execution units remain busy. In this paper, we present techniques for enhancing the memory efficiency of applications on data-parallel architectures, based on the analysis and characterization of memory access patterns in loop bodies; we target vectorization via data transformation to benefit vector-based architectures (e.g., AMD GPUs) and algorithmic memory selection for scalar-based architectures (e.g., NVIDIA GPUs). We demonstrate the effectiveness of our proposed methods with kernels from a wide range of benchmark suites. For the benchmark kernels studied, we achieve consistent and significant performance improvements (up to 11.4× and 13.5× over baseline GPU implementations on each platform, respectively) by applying our proposed methodology.
    Full-text · Article · Feb 2011 · IEEE Transactions on Parallel and Distributed Systems
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: As general purpose computing on Graphics Processing Units (GPGPU) matures, more complicated scientific applications are being targeted to utilize the data-level parallelism available on a GPU. Implementing physically-based simulation on data-parallel hardware requires preprocessing overhead which affects application performance. We discuss our implementation of physics-based data structures that provide significant performance improvements when used on data-parallel hardware. These data structures allow us to maintain a physics-based abstraction of the underlying data, reduce programmer effort and obtain 6x-8x speedup over previously implemented GPU kernels.
    Full-text · Conference Paper · Jun 2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Loop vectorization, a key feature exploited to obtain high perfor- mance on Single Instruction Multiple Data (SIMD) vector architec- tures, is significantly hindered by irregular memory access patterns in the data stream. This paper describes data transformations that allow us to vectorize loops targeting massively multithreaded data parallel architectures. We present a mathematical model that cap- tures loop-based memory access patterns and computes the most appropriate data transformations in order to enable vectorization. Our experimental results show that the proposed data transforma- tions can significantly increase the number of loops that can be vec- torized and enhance the data-level parallelism of applications. Our results also show that the overhead associated with our data trans- formations can be easily amortized as the size of the input data set increases. For the set of high performance benchmark kernels stud- ied, we achieve consistent and significant performance improve- ments (up to 11.4X) by applying vectorization using our data trans- formation approach.
    Full-text · Conference Paper · May 2010
  • Source
    Malak Alshawabkeh · Byunghyun Jang · David R. Kaeli
    [Show abstract] [Hide abstract]
    ABSTRACT: The Local Outlier Factor (LOF) is a very powerful anomaly detection method available in machine learning and classification. The algorithm defines the notion of local outlier in which the degree to which an object is outlying is dependent on the density of its local neighborhood, and each object can be assigned an LOF which represents the likelihood of that object being an outlier. Although this concept of a local outlier is a useful one, the computation of LOF values for every data object requires a large number of k-nearest neighbor queries -- this overhead can limit the use of LOF due to the computational overhead involved. Due to the growing popularity of Graphics Processing Units (GPU) in general-purpose computing domains, and equipped with a high-level programming language designed specifically for general-purpose applications (e.g., CUDA), we look to apply this parallel computing approach to accelerate LOF. In this paper we explore how to utilize a CUDA-based GPU implementation of the k-nearest neighbor algorithm to accelerate LOF classification. We achieve more than a 100X speedup over a multi-threaded dual-core CPU implementation. We also consider the impact of input data set size, the neighborhood size (i.e., the value of k) and the feature space dimension, and report on their impact on execution time.
    Full-text · Conference Paper · Jan 2010
  • Source
    Byunghyun Jang · Dana Schaa · Perhaad Mistry · David Kaeli

    Full-text · Conference Paper · Jan 2010
  • Source
    Byunghyun Jang · David R. Kaeli · Synho Do · Homer H. Pien
    [Show abstract] [Hide abstract]
    ABSTRACT: Although iterative reconstruction techniques (IRTs) have been shown to produce images of superior quality over conventional filtered back projection (FBP) based algorithms, the use of IRT in a clinical setting has been hampered by the significant computational demands of these algorithms. In this paper we present results of our efforts to overcome this hurdle by exploiting the combined computational power of multiple graphical processing units (GPUs). We have implemented forward and backward projection steps of reconstruction on an NVIDIA Tesla S870 hardware using CUDA. We have been able to accelerate forward projection by 71x and backward projection by 137x. We generate these results with no perceptible difference in image quality between the GPU and serial CPU implementations. This work illustrates the power of using commercial off-the-shelf relatively low-cost GPUs, potentially allowing IRT tomographic image reconstruction to be run in near real time, lowering the barrier to entry of IRT, and enabling deployment in the clinic.
    Full-text · Conference Paper · Jun 2009
  • Source
    David R. Kaeli · Byunghyun Jang · Perhaad Mistry · Dana Schaa
    [Show abstract] [Hide abstract]
    ABSTRACT: Given the rapid growth in computational requirements for medical image analysis, Graphics Processing Units (GPUs) have begun to be utilized to address these demands. But even though GPUs are well-suited to the underlying processing associated with medical image reconstruction, extracting the full benefits of moving to GPU platforms requires significant programming effort, and presents a fundamental barrier for more general adoption of GPU acceleration in a wider range of medical imaging applications. In this paper we describe our experience in accelerating a number of challenging medical imaging applications, and discuss how we utilize profile-guided analysis to reap the full benefits available on GPU platforms. Our work considers different GPU architectures, as well as how to fully exploit the benefits of using multiple GPUs.
    Full-text · Conference Paper · Jun 2009
  • Source
    Byunghyun Jang · Synho Do · Homer H. Pien · David R. Kaeli
    [Show abstract] [Hide abstract]
    ABSTRACT: Optimizing program execution targeted for Graphics Pro- cessing Units (GPUs) can be very challenging. Our ability to eciently map serial code to a GPU or stream processing platform is a time consuming task and is greatly hampered by a lack of detail about the underlying hardware. Program- mers are left to attempt trial and error to produce optimized codes. Recent publication of the underlying instruction set archi- tecture (ISA) of the AMD/ATI GPU has allowed researchers to begin to propose aggressive optimizations. In this work, we present an optimization methodology that utilizes this in- formation to accelerate programs on AMD/ATI GPUs. We start by defining optimization spaces that guide our work. We begin with disassembled machine code and collect pro- gram statistics provided by the AMD Graphics Shader An- alyzer (GSA) profiling toolset. We explore optimizations targeting three dierent
    Full-text · Conference Paper · Jan 2009
  • Source
    Byunghyun Jang · Yong-Bin Kim · Fabrizio Lombardi
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes the control of monomer concentration as a novel improvement of the kinetic Tile Assembly Model (kTAM) to reduce the error rate in DNA self-assembly. Tolerance to errors in this process is very important for manufacturing scaffolds for highly dense ICs; the proposed technique significantly decreases error rates (i.e. it increases error tolerance) by controlling the concentration of the monomers (tiles) for a specific pattern to be assembled. By profiling, this feature is shown to be applicable to different tile sets. A stochastic analysis based on a new state model is presented. The analysis is extended to the cases of single, double and triple bondings. The kinetic trap model is modified to account for the different monomer concentrations. Different scenarios (such as dynamic and adaptive) for monomer control are proposed: in the dynamic (adaptive) control case, the concentration of each tile is assessed based on the current (average) demand during growth as found by profiling the pattern. Significant error rate reductions are found by evaluating the proposed schemes compared to a scheme with constant concentration. One of the significant advantages of the proposed schemes is that it doesn’t entail an overhead such as increase in size and a slow growth, while still achieving a significant reduction in error rate. Simulation results are provided.
    Preview · Article · Jun 2008 · Journal of Electronic Testing
  • Source
    B. Jang · Y.-B. Kim · F. Lombardi
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes a novel technique based on profiling the monomers for reducing the error rate in DNA self-assembly. This technique utilizes the average concentration of the monomers (tiles) for a specific pattern as found by profiling its growth. The validity of profiling and the large difference in the concentrations of the monomers are shown to be applicable to different tile sets. To evaluate the error rate new Markov based models are proposed to account for the different types of bonding (i.e. single, double and triple) in the monomers as modification to the commonly assumed kinetic trap model. A significant error rates reduction is accomplished compared to a scheme with constant concentration as commonly utilized under the kinetic trap model. Simulation results are provided
    Preview · Conference Paper · Jan 2007
  • Source
    Byunghyun Jang · Yong-Bin Kim · Fabrizio Lombardi
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes the control of monomer con- centration as a novel improvement of the kinetic Tile Assem- bly Model (kTAM) to reduce the error rate in DNA self- assembly. Tolerance to errors in this process is very important for manufacturing highly dense ICs; the proposed technique significantly decreases error rates (i.e. it increases error tolerance) by controlling the concentration of monomers. A stochastic analysis based on a new state model is presented. Error rates reductions of at least 10% are found by evaluating the proposed scheme comparing to a scheme with constant concentration. One of the significant advantages of the proposed scheme is that it doesn't entail an overhead such as increase in size and a slow growth, while still achieving a significant reduction in error rate.
    Preview · Conference Paper · Jan 2006
  • Source
    B. Jang · M. Choi · N. Park · Y.B. Kim · V. Piuri · F. Lombardi
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, a new architecture of distributed embedded memory cores for SoC is proposed and an effective memory repair method by using the proposed spare line borrowing (software-driven reconfiguration) technique is investigated. It is known that faulty cells in memory core show spatial locality, also known as fault clustering. This physical phenomenon tends to occur more often as deep submicron technology advances due to defects that span multiple circuit elements and sophisticated circuit design. The combination of new architecture & repair method proposed in this paper ensures fault tolerance enhancement in SoC, especially in case of fault clustering. This fault tolerance enhancement is obtained through optimal redundancy utilization: spare redundancy in a fault-resistant memory core is used to fix the fault in a fault-prone memory core. The effect of spare line borrowing technique on the reliability of distributed memory cores is analyzed through modeling and extensive parametric simulation
    Full-text · Conference Paper · Jun 2005
  • Byunghyun Jang · Yong-Bin Kim
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, the modeling and evaluation of multi- bank SRAM design with dynamic threshold and supply voltage control is presented to reduce leakage power. The bank of SRAM, the unit of control, is put in sleep mode (high threshold voltage and low supply voltage) from active mode (low threshold voltage and high supply voltage) whenever it is not frequently used. The change of modes is based on the characteristics of the temporal and spatial locality of memory accesses. The simulation results show that significant leakage reduction can be achieved through combined implementation of spatial locality and temporal locality while minimizing the re-synchronization penalties when the size of superbank is optimized based on the characteristics of application program.
    No preview · Article ·