About
139
Publications
7,583
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
735
Citations
Publications
Publications (139)
Large Language Models (LLMs) are emerging as promising tools in hardware design and verification, with recent advancements suggesting they could fundamentally reshape conventional practices. In this survey, we analyze over 54 research papers to assess the current role of LLMs in enhancing automation, optimization, and innovation within hardware des...
Condition monitoring (CM) is essential for maintaining operational reliability and safety in complex machinery, particularly in robotic systems. Despite the potential of deep learning (DL) in CM, its ‘black box’ nature restricts its broader adoption, especially in mission-critical applications. Addressing this challenge, our research introduces a r...
In many Internet of Things (IoT) applications, knowing the device's location can be quite important for some reasons such as asset tracking and inventory management, geolocation services, safety and security, environmental monitoring, and proximity-based interactions. Mobile users often experience mobile services/applications within an indoor or ou...
Optical Network on Chip (ONoC) is now considered a promising alternative to traditional electrical interconnects. Meanwhile, several challenges such as temperature and process variations, aging, crosstalk noise, and insertion loss endanger the data transmission reliability of ONoCs. Many investigations have been made to evaluate the effect of these...
Deep Convolutional Neural Networks (CNNs) have achieved impressive performance in edge detection tasks, but their large number of parameters often leads to high memory and energy costs for implementation on lightweight devices. In this paper, we propose a new architecture, called Efficient Deep-learning Gradients Extraction Network (EDGE-Net), that...
Deep Convolutional Neural Networks (CNNs) have achieved human-level performance in edge detection. However, there have not been enough studies on how to efficiently utilize the parameters of the neural network in edge detection applications. Therefore, the associated memory and energy costs remain high. In this paper, inspired by Depthwise Separabl...
Magnetic resonance imaging (MRI) is one of the best imaging techniques that produce highquality images of objects. The long scan time is one of the biggest challenges in MRI acquisitions. To address this challenge, many researchers have aimed at finding methods to speed up the process. Faster MRI can reduce patient discomfort and motion artifacts....
Capsule Networks (CapsNets) are a generation of image classifiers with proven advantages over Convolutional Neural Networks (CNNs). Better robustness to affine transformation and overlapping image detection are some of the benefits associated with CapsNets. However, CapsNets cannot be classified as a resource-efficient deep learning architecture du...
Over the past several years, I have been interacting with an increasing number of Iranian scientists, including those currently living in Iran as well as others who are being educated elsewhere or have independent positions outside of that country. In all circumstances, the resulting collaborations have extended my own knowledge and allowed me to c...
Automatic Modulation Classification (AMC) is a well-known problem in the Radio Frequency (RF) domain. Solving this problem requires determining the modulation of an RF signal. Once the modulation is determined, the signal could be demodulated making it possible to analyse the signal for various purposes. Deep Neural Networks (DNNs) have recently pr...
A Capsule Network (CapsNet) is a relatively new classifier and one of the possible successors of Convolutional Neural Networks (CNNs). CapsNet maintains the spatial hierarchies between the features and outperforms CNNs at classifying images including overlapping categories. Even though CapsNet works well on small-scale datasets such as MNIST, it fa...
Capsule Network (CapsNet) is among the promising classifiers and a possible successor of the classifiers built based on Convolutional Neural Network (CNN). CapsNet is more accurate than CNNs in detecting images with overlapping categories and those with applied affine transformations. In this work, we propose a deep variant of CapsNet consisting of...
The wide usage of computer vision has become popular in the recent years. One of the areas of computer vision that has been studied is facial emotion recognition, which plays a crucial role in the interpersonal communication. This paper tackles the problem of intraclass variances in the face images of emotion recognition datasets. We test the syste...
The Activity and Event Network Model (AEN) is a new security knowledge graph that leverages large dynamic uncertain graph theory to capture and analyze stealthy and longterm attack patterns. Because the graph is expected to become extremely large over time, it can be very challenging for security analysts to navigate it and identify meaningful info...
The focus of this work is to explore the use of quantum annealing solvers for the problem of phase unwrapping of synthetic aperture radar (SAR) images. Although solutions to this problem exist based on network programming, these techniques do not scale well to larger sized images. Our approach involves formulating the problem as a quadratic unconst...
Using Error Detection Code (EDC) and Error Correction Code (ECC) is a noteworthy way to increase cache memories robustness against soft errors. EDC enables detecting errors in cache memory while ECC is used to correct erroneous cache blocks. ECCs are often costly as they impose considerable area and energy overhead on cache memory. Reducing this ov...
The focus of this work is to explore the use of quantum annealing solvers for the problem of phase unwrapping of synthetic aperture radar (SAR) images. Although solutions to this problem exist based on network programming, these techniques do not scale well to larger-sized images. Our approach involves formulating the problem as a quadratic unconst...
The branch predictor unit (BPU) is among the main energy consuming components in out-of-order (OoO) processors. For integer applications, we find 16% of the processor energy is consumed by the BPU. BPU is accessed in parallel with the instruction cache before it is known if a fetch group contains control instructions. We find 85% of BPU lookups are...
It has been more than a decade since general porous applications targeted GPUs to benefit from the enormous processing power they offer. However, not all applications gain speedup running on GPUs. If an application does not have enough parallel computation to hide memory latency, running it on a GPU will degrade the performance compared to what it...
Using software-managed cache in CUDA programming provides significant potential to improve memory efficiency. Employing this feature requires the programmer to identify data tiles associated with thread blocks and bring them to the cache explicitly. Despite the advantages, the development effort required to exploit this feature can be significant....
During recent years, GPU micro-architectures have changed dramatically, evolving into powerful many-core deep-multithreaded platforms for parallel workloads. While important micro-architectural modifications continue to appear in every new generation of these processors, unfortunately, little is known about the details of these innovative designs....
The OpenACC programming model has been developed to simplify accelerator programming and improve development productivity. In this article, we investigate the main limitations faced by OpenACC in harnessing all capabilities of GPU-like accelerators. We build on our findings and discuss the opportunity to exploit a software-managed cache as (i) a fa...
Different applications have different memory and computational demands. Therefore, obtainable performance and energy efficiency on a GPU depends on how well the GPU resources and application demands are balanced. In this study, we are presenting a Neural Network based predictor to model power and performance of GPGPU applications. The proposed mode...
In this paper we propose a scalable, low-power and highly accurate prediction-based cache coherence solution. We introduce Speculative Multicasting (or simply SM) as a multi-casting cache coherency solution. SM maintains coherency by selectively sending snoop requests to a subset of nodes. This subset is selected by recording previous snoop outcome...
The graphics processing unit (GPU) is the most promising candidate platform for achieving faster improvements in peak processing speed, low latency and high performance. The highly programmable and multithreaded nature of GPUs makes them a remarkable candidate for general purpose computing. However, supporting non-graphics computing on graphics pro...
In this paper we introduce IPMACC, a framework for translating OpenACC
applications to CUDA or OpenCL. IPMACC is composed of set of translators
translating OpenACC for C applications to CUDA or OpenCL. The framework uses
the system compiler (e.g. nvcc) for generating final accelerator's binary. The
framework can be used for extending the OpenACC AP...
In this paper, we study application behavior in GPGPUs. We investigate how data type impacts performance in different applications. As we show, expectedly, some applications can take significant advantage of small data types. Such applications benefit from small data types as a result of increasing cache effective capacity, reducing memory pressure...
Reliability of the current microprocessor technology is seriously challenged by radiation-induced soft errors. Accurate Vulnerability Factor (VF) modeling of system components is crucial in designing cost-effective protection schemes in high-performance processors. Although Statistical Fault Injection (SFI) techniques can be used to provide relativ...
Voltage scaling can reduce power dissipation significantly. SRAM cells (which are traditionally implemented by using six-transistor cells) can limit voltage scaling because of stability concerns. Eight-transistor (8T) cells were proposed to enhance cell stability under voltage scaling. 8T cells, however, suffer from costly write operations caused b...
SIMT accelerators are equipped with thousands of computational resources. Conventional accelerators, however, fail to fully utilize available resources due to branch and memory divergences. This underutilization is manifested in two underlying inefficiencies: pipeline width underutilization and pipeline depth underutilization. Width underutilizatio...
In this work we introduce a hybrid CLA-Ripple Power-aware adder (or simply HICPA) for high performance processors. HICPA is a multi-component adder that saves power by avoiding aggressive usage of the Carry Look-Ahead adder for add operations using small operands. Instead, for small size operands, HICPA uses a small and power efficient Ripple Carry...
The Graphics Processing Unit (GPU) is the most promising candidate platform for faster rate of improvement in peak processing speed, low latency and high performance. The highly programmable and multithreaded nature of GPUs makes them a remarkable candidate for general purpose computing. However, supporting non-graphics computing on graphics proces...
Software transactional memory (STM) is a promising paradigm which simplifies concurrent programming for chip multiprocessors. Although the implementations of STMs are simple and efficient, they display inconsistent performance: different design decisions lead to systems performing best under different circumstances, often by a large margin.In this...
Previous studies have suggested using drowsy caches to reduce leakage power in caches. Such studies often move an entire cache line in and out of the drowsy mode to reduce leakage power while maintaining performance. In this work we extend previous work and introduce Application Specific Low Leakage Cache (ASL) as an alternative power-aware data ca...
Selecting the right GPU configuration can impact the overall design in many ways. One of the critical parameters in a GPU is warp size. Smaller warps come with branch divergence reduction while larger warps provide better memory coalescing. In this work we are interested in two possible design choices and their impacts on GPUs: using small warps an...
This paper introduces an alternative FaultTolerant Power-Aware Hybrid Adder (or simply FARHAD). FARHAD is a highly power efficient protection solution against errors in application with high number of additions. FARHAD, similar to earlier studies, relies on performing add operations twice to detect errors. Unlike previous studies, FARHAD uses an ag...
There are a number of design decisions that impact a GPU's performance. Among such decisions deciding the right warp size can deeply influence the rest of the design. Small warps reduce the performance penalty associated with branch divergence at the expense of a reduction in memory coalescing. Large warps enhance memory coalescing significantly bu...
GPUs employ thousands of threads per core to achieve high throughput. These threads exhibit localities in control-flow, instruction and data addresses and values. In this study we investigate inter-warp instruction temporal locality and show that during short intervals a significant share of fetched instructions are fetched unnecessarily. This obse...
GPUs spend significant time on synchronization stalls. Such stalls provide ample opportunity to save leakage energy in GPU structures left idle during such periods. In this paper we focus on the register file structure of NVIDIA GPUs and introduce sync-aware low leakage solutions to reduce power. Accordingly, we show that applying the power gating...
The goal of this work is to revisit GPU design and introduce a fast, low-cost and effective approach to optimize resource allocation in future GPUs. We have achieved this goal by using the Plackett-Burman methodology to explore the design space efficiently. We further formulate the design exploration problem as that of a constraint optimization. Ou...
Voltage scaling can reduce power dissipation significantly. SRAM cells (which are traditionally implemented using six-transistor cells) can limit voltage scaling due to stability concerns. Eight-transistor (8T) cells were proposed to enhance cell stability under voltage scaling. 8T cells, however, suffer from costly write operations caused by the c...
In this work we propose using heterogeneous interconnects in power-aware chip multiprocessors (also referred to as Helia). Helia improves energy efficiency in snoop-based chip multiprocessors as it eliminates unnecessary activities in both interconnect and cache. This is achieved by
using innovative snoop filtering mechanisms coupled with wire mana...
Modern processors are highly dependent on the speculation of the control flow of the application by means of branch prediction. To provide accurate speculation of the direction of each branch, processors need a complex branch predictor with a large Branch Target Buffer. The branch target buffer is a memory area in which the target addresses of the...
Modern GPUs synchronize threads grouped in warps. The number of threads included in each warp (or warp size) affects divergence, synchronization overhead, and the efficiency of memory access coalescing. Small warps reduce the performance penalty associated with branch and memory divergence at the expense of a reduction in memory coalescing. Large w...
Modern GPUs synchronize threads grouped in a warp at every instruction. These
results in improving SIMD efficiency and makes sharing fetch and decode
resources possible. The number of threads included in each warp (or warp size)
affects divergence, synchronization overhead and the efficiency of memory
access coalescing. Small warps reduce the perfo...
There are a number of design decisions that impact a GPU's performance. Among
such decisions deciding the right warp size can deeply influence the rest of
the design. Small warps reduce the performance penalty associated with branch
divergence at the expense of a reduction in memory coalescing. Large warps
enhance memory coalescing significantly bu...
In this work we introduce a hybrid CLA-Ripple Power-aware adder (or simply HICPA) for high performance processors. HICPA is a multi-component adder that saves power by avoiding aggressive usage of the Carry Look-Ahead adder for add operations using small operands. Instead, for small size operands, HICPA uses a small and power efficient Ripple Adder...
In this work we study cache peak temperature variation under different cache access patterns. In particular we show that unbalanced cache access results in higher cache peak temperature. This is the result of frequent accesses made to overused cache sets. Moreover we study cache peak temperature under cache access balancing techniques and show that...
Using Error Detection Code (EDC) and Error Correction Code (ECC) is a noteworthy way to increase cache memories robustness against soft errors. EDC enables detecting errors in cache memory while ECC is used to correct erroneous cache blocks.
ECCs are often costly as they impose considerable area and energy overhead on cache memory. Reducing this ov...
The authors introduce a history-aware, resource-based dynamic (or simply HARD) scheduler for heterogeneous chip multi-processors (CMPs). HARD relies on recording application resource utilisation and throughput to adaptively change cores for applications during runtime. The authors show that HARD can be configured to achieve both performance and pow...
u Abstract-In this work we study control independence in embedded processors. We classify control independent instruc tions to data dependent and data independent and measure each group's frequency and behavior. Moreover, we study how control independent instructions impact power dissipation and resource utilization. We also investigate control in...
Over the past few years, radiation-induced transient errors, also referred to as soft errors, have been a severe threat to the data integrity of high-end and mainstream processors. Recent studies show that cache memories are among the most vulnerable components to soft errors within high-performance processors. Accurate modeling of the Vulnerabilit...
In this paper, we enhance previously suggested vulnerability estimation techniques by presenting a detailed modeling technique based on Input-to-Output Masking (IOM). Moreover we use our model to compute the System-level Vulnerability Factor (SVF) for data-path components in a high-performance processor. As we show, recent suggested estimation tech...
In this work we introduce Heterogeneous Interconnect for Low Resolution Cache Access (Helia). Helia improves energy efficiency in snoop-based chip multiprocessors as it eliminates unnecessary activities in both interconnect and cache. This is achieved by using innovative snoop filtering mechanisms coupled with wire management techniques. Our optimi...
Modern processors access the branch target buffer (BTB) every cycle to speculate branch target addresses. This aggressive approach improves performance as it results in early identification of target addresses. However, unfortunately, such accesses, quite often, are unnecessary as there is no control flow instruction among those fetched.In this wor...
Despite the huge potential, value predictors have not been used in modern processors. This is partially due to the complex structures associated with such predictors. In this paper we study value predictors and investigate solutions to reduce storage requirements while imposing negligible coverage cost. Our solutions build on the observation that c...
In this work we introduce power optimizations relying on partial tag comparison (PTC) in snoop-based chip multiprocessors. Our optimizations rely on the observation that detecting tag mismatches in a snoop-based chip multiprocessor does not require aggressively processing the entire tag. In fact, a high percentage of cache mismatches could be detec...
In this paper, we propose a systematic design methodology in the category of hybrid-CMOS logic style. Our methodology is based on using different basic cells and optimizations. We start with selecting a basic cell including two independent inputs and two complementary outputs. Next we combine this basic cell with various correction and optimization...
Power and temperature management continue to impose challenging issues in high-performance processor design. Processor power density is growing and has made building efficient cooling systems expensive. While Dynamic Thermal Management (DTM) techniques aim at reducing cooling systems cost, previously suggested low-temperature design could potential...
Chip multiprocessors (CMPs) issue write invalidations (WIs) to assure program correctness. In conventional snoop-based protocols,
writers broadcast invalidations to all nodes as soon as possible. In this work we show that this approach, while protecting
correctness, is inefficient due to two reasons. First, many of the invalidated blocks are not ac...
Least recently used (LRU) is a widely used replacement policy as it offers simplicity and relatively acceptable performance. However, there is a considerable performance gap between LRU and Belady's theoretical optimal replacement policy in highly associative caches. We study non-optimal LRU decisions (NODs) in chip multiprocessors and investigate...
Previously suggested transistor sizing algorithms assume that all input transitions are equally important. In this work we show that this is not an accurate assumption as input transitions appear in different frequencies. We take advantage from this phenomenon and introduce application specific transistor sizing. In application specific transistor...
Intel’s XScale which has powered many multimedia applications uses scoreboard to control instruction execution. Scoreboard stalls the pipeline whenever a source operand or functional unit is needed but not available. While waiting for the availability of the resources, the processor accesses the scoreboard every cycle. Such accesses consume energy...
In software transactional memory (STM) systems, read validation ensures that a transaction always has a consistent view of
the memory. Existing read validation policies follow a static approach and use one policy across all applications. However,
no single universal read validation policy offers optimal performance across all applications. We propo...
We study non-optimal LRU decisions (NODs) in single processors. We study how NOD frequency changes from one application to another and from one phase to another within an application. Moreover we introduce Hasty and Predictable blocks as more inclusive extensions of previously suggested classifications. We discuss implementation issues and present...
Conventional snoopy-based chip multiprocessors take an aggressive approach broadcasting snoop requests to all nodes. In addition each node checks all received requests. This approach reduces the latency of cache to cache transfer misses at the expense of increasing power. In this paper we show that a large portion of interconnect/cache transactions...
In this work we modify the conventional row buffer allocation mechanism used in DDR2 SDRAM banks to improve average memory latency and overall processor performance. Our method assigns row buffers to different banks dynamically and by taking into account program cyclic behavior and bank row buffer demand. As we show in this work, memory requests go...
In this work we study how cache complexity impacts energy and performance in high performance processors. Moreover, we estimate cache energy budget for two high performance processors. We calculate energy and latency break-even points for realistic and ideal cache organizations for different applications. We show that design efforts made to reduce...