About
61
Publications
7,685
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
438
Citations
Citations since 2017
Publications
Publications (61)
The Vehicle Routing Problem (VRP) is fundamental to logistics operations. Finding optimal solutions for VRPs related to large, real-world operations is computationally expensive. Genetic algorithms (GA) have been used to find good solutions for different types of VRPs but are slow to converge. This work utilizes high-performance computing (HPC) pla...
Lookup operations for in-memory databases are heavily memory bound, because they often rely on pointer-chasing linked data structure traversals. They also have many branches that are hard-to-predict due to random key lookups. In this study, we show that although cache misses are the primary bottleneck for these applications, without a method for el...
Indirect memory accesses have irregular access patterns that limit the performance of conventional software and hardware-based prefetchers. To address this problem, we propose the Array Tracking Prefetcher (ATP), which tracks array-based indirect memory accesses using a novel combination of software and hardware. ATP is first configured by special...
Lookup operations for in-memory databases are heavily memory-bound because they often rely on pointer-chasing linked data structure traversals. They are also branch heavy with branches that are hard-to-predict due to random key lookups. In this study, we show that although cache misses are the primary bottleneck for these applications, without a me...
This article presents position statements and a question-and-answer session by panelists at the Fourth Workshop on Computer Architecture Research Directions. The subject of the debate was the use of field-programmable gate arrays versus GPUs in datacenters.
This article presents position statements and a question-and-answer session by panelists at the 4th Workshop on Computer Architecture Research Directions. The subject of the debate was proprietary versus free and open instruction set architectures.
This article presents position statements and a question-and-answer session by panelists at the 4th Workshop on Computer Architecture Research Directions. The subject of the debate was new technologies and their impact on future architectures.
After over two decades of extensive research on branch prediction, branch mispredictions are still an important performance/power bottleneck for today's aggressive processors. In our prior work, to further understand the causes for mispredictions, we presented a source-code based classification of branch mispredictions extending the prior work on p...
Current processors employ aggressive prediction mechanisms to improve performance and reduce power. It is increasingly important to understand and quantify a program's dynamic behavior to effectively design next-generation prediction mechanisms. In this paper, we develop algorithms and mechanisms inspired by DNA discovery tools to analyze and quant...
In recent years, privacy management has become one of the most complex processes in the connected world. Fundamental technologies like GPS, cellular communications, and the Internet have become essential equipment in the modern vehicle. Subsequently, the vehicle became part of this connected world, wherein data are constantly sent and received. Acc...
This paper explores the performance and energy efficiency of CUDA-enabled GPUs and multi-core SIMD CPUs using a set of kernels and full applications. Our implementations efficiently exploit both SIMD and thread-level parallelism on multi-core CPUs and the computational capabilities of CUDA-enabled GPUs. We discuss general optimization techniques fo...
This paper covers the design and FPGA-based prototyping of a full-featured multi-core platform for use in computer architecture research studies. Existing platforms for performing studies include software simulators and hardware-assisted simulators, but there are no modular full-hardware platforms designed to measure a wide range of performance met...
In this panel discussion from the 2009 Workshop on Computer Architecture Research Directions, David August and Keshav Pingali debate whether explicitly parallel programming is a necessary evil for applications programmers, assess the current state of parallel programming models, and discuss possible routes toward finding the programming model for t...
Simulation is an indispensable tool for evaluation and analysis throughout the development cycle of a computer system, and even after the computer system is built. How simulation should evolve as the complexity of computer systems continues to grow is an open question and the subject of this panel from the 2009 Workshop on Computer Architecture Res...
Branch prediction accuracy remains to be critical for high performance and low power. Prior work has studied causes of branch mispredictions in order to provide insights into how better branch predictors can be designed. However, most of the previous works have only considered run-time classification of branch mispredictions, leaving a large number...
Software simulators remain several orders of magnitude slower than the modern microprocessor architectures they simulate. Although various reduced-time simulation tools are available to accurately help pick truncated benchmark simulation, they either come with a need for offline analysis of the benchmarks initially or require many iterative runs of...
Although high branch prediction accuracy is necessary for high performance, it typically comes at the cost of larger predictor tables and/or more complex prediction algorithms. Unfortunately, large predictor tables and complex algorithms require more chip area and have higher power consumption, which precludes their use in embedded processors. As a...
The core of current-generation high-performance multiprocessor systems is out-of-order execution processors with aggressive branch prediction. Despite their relatively high branch prediction accuracy, these processors still execute many memory instructions down mispredicted paths. Previous work that focused on uniprocessors showed that these wrong-...
Due to the long simulation time of the reference input set, computer architects often use reduced time simulation techniques to shorten the simulation time. However, what has not yet been thoroughly evaluated is the accuracy of these techniques relative to the reference input set and with respect to each other. To rectify this deficiency, this pape...
Today, with the increasing popularity of multicore processors, one approach to optimizing the processor's performance is to reduce the execution times of individual applications running on each core by designing and implementing more powerful cores. Another approach, which is the polar opposite of the first, optimizes the processor's performance by...
One of the primary concerns for microprocessor designers has always been balancing power and thermal management while minimizing performance loss. rather than generate solutions to this dilemma, the advent of multicore chips has raised a host of new challenges. this discussion with Pradip Bose and Kanad Ghose, excerpted from a 2007 Card Workshop Pa...
How can we ensure that platform hardware, firmware, and software work in concert to withstand rapidly evolving security threats? Architectural innovations bring performance gains but can also create new security vulnerabilities. In this panel discussion, from the 2007 workshop on Computer Architecture Research directions, we assess the current stat...
In this paper, we propose a new class of branch predictors, complementary branch predictors, which can be easily added to any branch predictor to improve the overall prediction accuracy. This mechanism differs from conventional branch predictors in that it focuses only on mispredicted branches. As a result, this mechanism has the advantages of scal...
To reduce the simulation time to a tractable amount or due to compilation (or other related) problems, computer architects often simulate only a subset of the benchmarks in a benchmark suite. However, if the architect chooses a subset of benchmarks that is not representative, the subsequent simulation results will, at best, be misleading or, at wor...
The viability of bus interconnection models is explored, using the multiple-valued logic (MVL) paradigm to reduce the cost and energy consumption of off-chip and on-chip address, data and instruction buses within system-on-a-chip platforms. Data can be transferred over the buses using ternary, balanced ternary or quaternary number systems, rather t...
High-performance multiprocessor systems built around out-of-order processors with aggressive branch predictors execute many memory references that turn out to be on a mispredicted branch path. Previous work that focused on uniprocessors showed that these wrong-path memory references may pollute the caches by bringing in data that are not needed on...
Uniprocessor studies have shown that wrong-path memory references pollute the caches by bringing in data that are not needed for the correct execution path and by evicting useful data or instructions. Additionally, they also increase the amount of cache and memory traffic. On the positive side, however, they may have a prefetching effect for loads...
In this paper, we propose three novel cache models using multiple-valued logic (MVL) paradigm to reduce the cache data storage area and cache energy consumption for embedded systems. Multiple-valued caches have significant potential for compact and power-efficient cache array design. The cache models differ from each other depending on whether they...
The speculated execution of threads in a multithreaded architecture, plus the branch prediction used in each thread execution unit, allows many instructions to be executed speculatively, that is, before it is known whether they actually needed by the program. In this study, we examine how the load instructions executed on what turn out to be incorr...
Due to the simulation time of the reference input set, architects often use alternative simulation techniques. Although these alternatives reduce the simulation time, what has not been evaluated is their accuracy relative to the reference input set, and with respect to each other. To rectify this deficiency, this paper uses three methods to charact...
Address correlation is a technique that links the addresses that reference the same data values. Using a detailed source-code level analysis, a recent study (1) revealed that different addresses containing the same data can often be correlated at run-time to eliminate on-chip data cache misses. In this paper, we study the upper-bound performance of...
Concurrent multithreaded architectures exploit both instruction-level and thread-level parallelism through a combination of branch prediction and thread-level control speculation. The resulting speculative issuing of load instructions in these architectures can significantly impact the performance of the memory hierarchy as the system exploits high...
Pointer-intensive and sparse numerical computations typically display irregular memory access behavior. This work presents
a mathematical model, called the Self-tuning Adaptive Predictor (SAP), to characterize the behavior of load instructions in
procedures with pointer-based data structures by using procedure call boundaries as the fundamental sam...
Concurrent multithreaded architectures exploit both instruction-level and thread-level parallelism in application programs. A single-threaded sequencing mechanism needs speculative execution beyond conditional branches in order to exploit more instruction-level parallelism. In addition, an aggressive multithreaded architecture should also use threa...
As the degree of instruction-level parallelism in superscalar architectures increases, the gap between processor and memory performance continues to grow requiring more aggressive techniques to increase the performance of the memory system. Several data prefetching techniques have been proposed for hiding the latency of main memory accesses, all of...
Value reuse improves a processor's performance by dynamically caching the results of previous instructions into the value reuse table and reusing those results to bypass the execution of future instructions that have the same opcode and input operands. However, replacing the least recently used entries with the results of the current instructions c...
Concurrent multithreaded architectures exploit both instruction-level and thread-level parallelism through a combination of branch prediction and thread-level control speculation. The resulting speculative issuing of load instructions in these architectures can significantly impact the performance of the memory hierarchy as the system exploits high...
Relatively little background work has been done to examine the miss behavior of all static and dynamic load instructions, especially in the context of the entire program. This study addresses this gap in knowledge by presenting the whole-program (as opposed to sampling) profiling results for load behavior. Specifically, this study confirms the conc...
We investigate a program phenomenon, Address Correlation, which links addresses that reference the same data.This work shows that different addresses containing the samedata can often be correlated at run-time to eliminate a load missor a partial hit. For ten of the SPEC CPU2000 benchmarks, 57 to99% of all L1 data cache load misses, and 4 to 85% of...
Mathematical modeling is an important tool for understanding and improving the memory referencing behavior of the programs. For some programs, such as scientific codes operating on dense arrays or matrices, memory accesses exhibit strong regularity. However, pointer-intensive and sparse numerical computations typically display irregular memory acce...
Value reuse improves a processor’s performance by dynamically caching the results of previous instructions and reusing those
results to bypass the execution of future instructions that have the same opcode and input operands. However, continually
replacing the least recently used entries could eventually fill the value reuse table with instructions...
Value reuse improves a processor's performance by dynamically caching the results of previous instructions and reusing those results to bypass the execution of future instructions that have the same opcode and input operands. However, continually replacing the least recently used entries could eventually fill the value reuse table with instructions...
As the degree of instruction-level parallelism in superscalar architectures increases, the gap between processor and memory performance continues to grow requiring more aggressive techniques to increase the performance of the memory system. We propose a new technique, which is based on the wrong-path execution of loads far beyond instruction fetch-...
This paper considers the problem of routing and wavelength
assignment (RWA) in optical passive star networks with non-uniform
traffic load. The problem can be considered as designing a logical
topology over an optical passive star physical topology with a given
non-uniform traffic. The approach uses the bipartite graphs and the
concept of time and...
In this study, scattering of plane electromagnetic waves at the
junction formed by a PEC half-plane and a half-plane with anisotropic
conductivity is investigated. By using the Fourier transform technique
the problem is formulated into a matrix Wiener-Hopf system and an exact
closed-form solution is obtained for the most general case by
factorizing...
The performance of a processor is limited by the specific bottlenecks that a benchmark exposes while running on that processor. Since the quantification of these bottlenecks can be extremely time-consuming, our prior work proposed using the Plackett and Burman design as a statistically-rigorous, but time-efficient method of determining the processo...
In this paper, we explore the potential of bus interconnection models using the Multiple-Valued Logic paradigm to reduce the power consumption of on-chip address and data buses within embedded SoC platforms. Data is sent over the buses using radix-r number system, i.e. ternary, balanced ternary or quaternary, rather than binary. This allows more co...
Out-of-order execution processors with aggressive branch prediction are the core of current-generation high-performance multiprocessor systems. Despite their relatively high branch prediction accuracies, these processors still execute many memory instructions on the mispredicted path. These wrong-path memory references pollute the caches and increa...
Thesis (Ph. D.)--University of Minnesota, 2003. Includes bibliographical references (leaves 107-114).