Amirali Baniasadi

Amirali Baniasadi
University of Victoria | UVIC · Department of Electrical and Computer Engineering (ECE)

About

139
Publications
7,583
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
735
Citations

Publications

Publications (139)
Preprint
Full-text available
Large Language Models (LLMs) are emerging as promising tools in hardware design and verification, with recent advancements suggesting they could fundamentally reshape conventional practices. In this survey, we analyze over 54 research papers to assess the current role of LLMs in enhancing automation, optimization, and innovation within hardware des...
Article
Full-text available
Condition monitoring (CM) is essential for maintaining operational reliability and safety in complex machinery, particularly in robotic systems. Despite the potential of deep learning (DL) in CM, its ‘black box’ nature restricts its broader adoption, especially in mission-critical applications. Addressing this challenge, our research introduces a r...
Preprint
Full-text available
In many Internet of Things (IoT) applications, knowing the device's location can be quite important for some reasons such as asset tracking and inventory management, geolocation services, safety and security, environmental monitoring, and proximity-based interactions. Mobile users often experience mobile services/applications within an indoor or ou...
Article
Optical Network on Chip (ONoC) is now considered a promising alternative to traditional electrical interconnects. Meanwhile, several challenges such as temperature and process variations, aging, crosstalk noise, and insertion loss endanger the data transmission reliability of ONoCs. Many investigations have been made to evaluate the effect of these...
Article
Deep Convolutional Neural Networks (CNNs) have achieved impressive performance in edge detection tasks, but their large number of parameters often leads to high memory and energy costs for implementation on lightweight devices. In this paper, we propose a new architecture, called Efficient Deep-learning Gradients Extraction Network (EDGE-Net), that...
Conference Paper
Full-text available
Deep Convolutional Neural Networks (CNNs) have achieved human-level performance in edge detection. However, there have not been enough studies on how to efficiently utilize the parameters of the neural network in edge detection applications. Therefore, the associated memory and energy costs remain high. In this paper, inspired by Depthwise Separabl...
Conference Paper
Magnetic resonance imaging (MRI) is one of the best imaging techniques that produce highquality images of objects. The long scan time is one of the biggest challenges in MRI acquisitions. To address this challenge, many researchers have aimed at finding methods to speed up the process. Faster MRI can reduce patient discomfort and motion artifacts....
Chapter
Capsule Networks (CapsNets) are a generation of image classifiers with proven advantages over Convolutional Neural Networks (CNNs). Better robustness to affine transformation and overlapping image detection are some of the benefits associated with CapsNets. However, CapsNets cannot be classified as a resource-efficient deep learning architecture du...
Article
Full-text available
Over the past several years, I have been interacting with an increasing number of Iranian scientists, including those currently living in Iran as well as others who are being educated elsewhere or have independent positions outside of that country. In all circumstances, the resulting collaborations have extended my own knowledge and allowed me to c...
Article
Full-text available
Automatic Modulation Classification (AMC) is a well-known problem in the Radio Frequency (RF) domain. Solving this problem requires determining the modulation of an RF signal. Once the modulation is determined, the signal could be demodulated making it possible to analyse the signal for various purposes. Deep Neural Networks (DNNs) have recently pr...
Article
Full-text available
A Capsule Network (CapsNet) is a relatively new classifier and one of the possible successors of Convolutional Neural Networks (CNNs). CapsNet maintains the spatial hierarchies between the features and outperforms CNNs at classifying images including overlapping categories. Even though CapsNet works well on small-scale datasets such as MNIST, it fa...
Chapter
Capsule Network (CapsNet) is among the promising classifiers and a possible successor of the classifiers built based on Convolutional Neural Network (CNN). CapsNet is more accurate than CNNs in detecting images with overlapping categories and those with applied affine transformations. In this work, we propose a deep variant of CapsNet consisting of...
Conference Paper
The wide usage of computer vision has become popular in the recent years. One of the areas of computer vision that has been studied is facial emotion recognition, which plays a crucial role in the interpersonal communication. This paper tackles the problem of intraclass variances in the face images of emotion recognition datasets. We test the syste...
Conference Paper
The Activity and Event Network Model (AEN) is a new security knowledge graph that leverages large dynamic uncertain graph theory to capture and analyze stealthy and longterm attack patterns. Because the graph is expected to become extremely large over time, it can be very challenging for security analysts to navigate it and identify meaningful info...
Article
Full-text available
The focus of this work is to explore the use of quantum annealing solvers for the problem of phase unwrapping of synthetic aperture radar (SAR) images. Although solutions to this problem exist based on network programming, these techniques do not scale well to larger sized images. Our approach involves formulating the problem as a quadratic unconst...
Preprint
Using Error Detection Code (EDC) and Error Correction Code (ECC) is a noteworthy way to increase cache memories robustness against soft errors. EDC enables detecting errors in cache memory while ECC is used to correct erroneous cache blocks. ECCs are often costly as they impose considerable area and energy overhead on cache memory. Reducing this ov...
Preprint
Full-text available
The focus of this work is to explore the use of quantum annealing solvers for the problem of phase unwrapping of synthetic aperture radar (SAR) images. Although solutions to this problem exist based on network programming, these techniques do not scale well to larger-sized images. Our approach involves formulating the problem as a quadratic unconst...
Article
The branch predictor unit (BPU) is among the main energy consuming components in out-of-order (OoO) processors. For integer applications, we find 16% of the processor energy is consumed by the BPU. BPU is accessed in parallel with the instruction cache before it is known if a fetch group contains control instructions. We find 85% of BPU lookups are...
Article
It has been more than a decade since general porous applications targeted GPUs to benefit from the enormous processing power they offer. However, not all applications gain speedup running on GPUs. If an application does not have enough parallel computation to hide memory latency, running it on a GPU will degrade the performance compared to what it...
Article
Using software-managed cache in CUDA programming provides significant potential to improve memory efficiency. Employing this feature requires the programmer to identify data tiles associated with thread blocks and bring them to the cache explicitly. Despite the advantages, the development effort required to exploit this feature can be significant....
Article
During recent years, GPU micro-architectures have changed dramatically, evolving into powerful many-core deep-multithreaded platforms for parallel workloads. While important micro-architectural modifications continue to appear in every new generation of these processors, unfortunately, little is known about the details of these innovative designs....
Article
The OpenACC programming model has been developed to simplify accelerator programming and improve development productivity. In this article, we investigate the main limitations faced by OpenACC in harnessing all capabilities of GPU-like accelerators. We build on our findings and discuss the opportunity to exploit a software-managed cache as (i) a fa...
Conference Paper
Different applications have different memory and computational demands. Therefore, obtainable performance and energy efficiency on a GPU depends on how well the GPU resources and application demands are balanced. In this study, we are presenting a Neural Network based predictor to model power and performance of GPGPU applications. The proposed mode...
Article
In this paper we propose a scalable, low-power and highly accurate prediction-based cache coherence solution. We introduce Speculative Multicasting (or simply SM) as a multi-casting cache coherency solution. SM maintains coherency by selectively sending snoop requests to a subset of nodes. This subset is selected by recording previous snoop outcome...
Article
Full-text available
The graphics processing unit (GPU) is the most promising candidate platform for achieving faster improvements in peak processing speed, low latency and high performance. The highly programmable and multithreaded nature of GPUs makes them a remarkable candidate for general purpose computing. However, supporting non-graphics computing on graphics pro...
Article
Full-text available
In this paper we introduce IPMACC, a framework for translating OpenACC applications to CUDA or OpenCL. IPMACC is composed of set of translators translating OpenACC for C applications to CUDA or OpenCL. The framework uses the system compiler (e.g. nvcc) for generating final accelerator's binary. The framework can be used for extending the OpenACC AP...
Conference Paper
In this paper, we study application behavior in GPGPUs. We investigate how data type impacts performance in different applications. As we show, expectedly, some applications can take significant advantage of small data types. Such applications benefit from small data types as a result of increasing cache effective capacity, reducing memory pressure...
Article
Reliability of the current microprocessor technology is seriously challenged by radiation-induced soft errors. Accurate Vulnerability Factor (VF) modeling of system components is crucial in designing cost-effective protection schemes in high-performance processors. Although Statistical Fault Injection (SFI) techniques can be used to provide relativ...
Article
Voltage scaling can reduce power dissipation significantly. SRAM cells (which are traditionally implemented by using six-transistor cells) can limit voltage scaling because of stability concerns. Eight-transistor (8T) cells were proposed to enhance cell stability under voltage scaling. 8T cells, however, suffer from costly write operations caused b...
Article
Full-text available
SIMT accelerators are equipped with thousands of computational resources. Conventional accelerators, however, fail to fully utilize available resources due to branch and memory divergences. This underutilization is manifested in two underlying inefficiencies: pipeline width underutilization and pipeline depth underutilization. Width underutilizatio...
Article
In this work we introduce a hybrid CLA-Ripple Power-aware adder (or simply HICPA) for high performance processors. HICPA is a multi-component adder that saves power by avoiding aggressive usage of the Carry Look-Ahead adder for add operations using small operands. Instead, for small size operands, HICPA uses a small and power efficient Ripple Carry...
Conference Paper
The Graphics Processing Unit (GPU) is the most promising candidate platform for faster rate of improvement in peak processing speed, low latency and high performance. The highly programmable and multithreaded nature of GPUs makes them a remarkable candidate for general purpose computing. However, supporting non-graphics computing on graphics proces...
Article
Software transactional memory (STM) is a promising paradigm which simplifies concurrent programming for chip multiprocessors. Although the implementations of STMs are simple and efficient, they display inconsistent performance: different design decisions lead to systems performing best under different circumstances, often by a large margin.In this...
Conference Paper
Previous studies have suggested using drowsy caches to reduce leakage power in caches. Such studies often move an entire cache line in and out of the drowsy mode to reduce leakage power while maintaining performance. In this work we extend previous work and introduce Application Specific Low Leakage Cache (ASL) as an alternative power-aware data ca...
Conference Paper
Selecting the right GPU configuration can impact the overall design in many ways. One of the critical parameters in a GPU is warp size. Smaller warps come with branch divergence reduction while larger warps provide better memory coalescing. In this work we are interested in two possible design choices and their impacts on GPUs: using small warps an...
Conference Paper
This paper introduces an alternative FaultTolerant Power-Aware Hybrid Adder (or simply FARHAD). FARHAD is a highly power efficient protection solution against errors in application with high number of additions. FARHAD, similar to earlier studies, relies on performing add operations twice to detect errors. Unlike previous studies, FARHAD uses an ag...
Conference Paper
There are a number of design decisions that impact a GPU's performance. Among such decisions deciding the right warp size can deeply influence the rest of the design. Small warps reduce the performance penalty associated with branch divergence at the expense of a reduction in memory coalescing. Large warps enhance memory coalescing significantly bu...
Conference Paper
GPUs employ thousands of threads per core to achieve high throughput. These threads exhibit localities in control-flow, instruction and data addresses and values. In this study we investigate inter-warp instruction temporal locality and show that during short intervals a significant share of fetched instructions are fetched unnecessarily. This obse...
Conference Paper
GPUs spend significant time on synchronization stalls. Such stalls provide ample opportunity to save leakage energy in GPU structures left idle during such periods. In this paper we focus on the register file structure of NVIDIA GPUs and introduce sync-aware low leakage solutions to reduce power. Accordingly, we show that applying the power gating...
Conference Paper
The goal of this work is to revisit GPU design and introduce a fast, low-cost and effective approach to optimize resource allocation in future GPUs. We have achieved this goal by using the Plackett-Burman methodology to explore the design space efficiently. We further formulate the design exploration problem as that of a constraint optimization. Ou...
Conference Paper
Voltage scaling can reduce power dissipation significantly. SRAM cells (which are traditionally implemented using six-transistor cells) can limit voltage scaling due to stability concerns. Eight-transistor (8T) cells were proposed to enhance cell stability under voltage scaling. 8T cells, however, suffer from costly write operations caused by the c...
Article
In this work we propose using heterogeneous interconnects in power-aware chip multiprocessors (also referred to as Helia). Helia improves energy efficiency in snoop-based chip multiprocessors as it eliminates unnecessary activities in both interconnect and cache. This is achieved by using innovative snoop filtering mechanisms coupled with wire mana...
Article
Modern processors are highly dependent on the speculation of the control flow of the application by means of branch prediction. To provide accurate speculation of the direction of each branch, processors need a complex branch predictor with a large Branch Target Buffer. The branch target buffer is a memory area in which the target addresses of the...
Conference Paper
Modern GPUs synchronize threads grouped in warps. The number of threads included in each warp (or warp size) affects divergence, synchronization overhead, and the efficiency of memory access coalescing. Small warps reduce the performance penalty associated with branch and memory divergence at the expense of a reduction in memory coalescing. Large w...
Article
Full-text available
Modern GPUs synchronize threads grouped in a warp at every instruction. These results in improving SIMD efficiency and makes sharing fetch and decode resources possible. The number of threads included in each warp (or warp size) affects divergence, synchronization overhead and the efficiency of memory access coalescing. Small warps reduce the perfo...
Article
Full-text available
There are a number of design decisions that impact a GPU's performance. Among such decisions deciding the right warp size can deeply influence the rest of the design. Small warps reduce the performance penalty associated with branch divergence at the expense of a reduction in memory coalescing. Large warps enhance memory coalescing significantly bu...
Conference Paper
In this work we introduce a hybrid CLA-Ripple Power-aware adder (or simply HICPA) for high performance processors. HICPA is a multi-component adder that saves power by avoiding aggressive usage of the Carry Look-Ahead adder for add operations using small operands. Instead, for small size operands, HICPA uses a small and power efficient Ripple Adder...
Conference Paper
In this work we study cache peak temperature variation under different cache access patterns. In particular we show that unbalanced cache access results in higher cache peak temperature. This is the result of frequent accesses made to overused cache sets. Moreover we study cache peak temperature under cache access balancing techniques and show that...
Conference Paper
Full-text available
Using Error Detection Code (EDC) and Error Correction Code (ECC) is a noteworthy way to increase cache memories robustness against soft errors. EDC enables detecting errors in cache memory while ECC is used to correct erroneous cache blocks. ECCs are often costly as they impose considerable area and energy overhead on cache memory. Reducing this ov...
Article
Full-text available
The authors introduce a history-aware, resource-based dynamic (or simply HARD) scheduler for heterogeneous chip multi-processors (CMPs). HARD relies on recording application resource utilisation and throughput to adaptively change cores for applications during runtime. The authors show that HARD can be configured to achieve both performance and pow...
Conference Paper
Full-text available
u Abstract-In this work we study control independence in embedded processors. We classify control independent instruc­ tions to data dependent and data independent and measure each group's frequency and behavior. Moreover, we study how control independent instructions impact power dissipation and resource utilization. We also investigate control in...
Conference Paper
Full-text available
Over the past few years, radiation-induced transient errors, also referred to as soft errors, have been a severe threat to the data integrity of high-end and mainstream processors. Recent studies show that cache memories are among the most vulnerable components to soft errors within high-performance processors. Accurate modeling of the Vulnerabilit...
Conference Paper
Full-text available
In this paper, we enhance previously suggested vulnerability estimation techniques by presenting a detailed modeling technique based on Input-to-Output Masking (IOM). Moreover we use our model to compute the System-level Vulnerability Factor (SVF) for data-path components in a high-performance processor. As we show, recent suggested estimation tech...
Conference Paper
In this work we introduce Heterogeneous Interconnect for Low Resolution Cache Access (Helia). Helia improves energy efficiency in snoop-based chip multiprocessors as it eliminates unnecessary activities in both interconnect and cache. This is achieved by using innovative snoop filtering mechanisms coupled with wire management techniques. Our optimi...
Article
Modern processors access the branch target buffer (BTB) every cycle to speculate branch target addresses. This aggressive approach improves performance as it results in early identification of target addresses. However, unfortunately, such accesses, quite often, are unnecessary as there is no control flow instruction among those fetched.In this wor...
Conference Paper
Despite the huge potential, value predictors have not been used in modern processors. This is partially due to the complex structures associated with such predictors. In this paper we study value predictors and investigate solutions to reduce storage requirements while imposing negligible coverage cost. Our solutions build on the observation that c...
Conference Paper
Full-text available
In this work we introduce power optimizations relying on partial tag comparison (PTC) in snoop-based chip multiprocessors. Our optimizations rely on the observation that detecting tag mismatches in a snoop-based chip multiprocessor does not require aggressively processing the entire tag. In fact, a high percentage of cache mismatches could be detec...
Article
Full-text available
In this paper, we propose a systematic design methodology in the category of hybrid-CMOS logic style. Our methodology is based on using different basic cells and optimizations. We start with selecting a basic cell including two independent inputs and two complementary outputs. Next we combine this basic cell with various correction and optimization...
Article
Power and temperature management continue to impose challenging issues in high-performance processor design. Processor power density is growing and has made building efficient cooling systems expensive. While Dynamic Thermal Management (DTM) techniques aim at reducing cooling systems cost, previously suggested low-temperature design could potential...
Conference Paper
Chip multiprocessors (CMPs) issue write invalidations (WIs) to assure program correctness. In conventional snoop-based protocols, writers broadcast invalidations to all nodes as soon as possible. In this work we show that this approach, while protecting correctness, is inefficient due to two reasons. First, many of the invalidated blocks are not ac...
Conference Paper
Least recently used (LRU) is a widely used replacement policy as it offers simplicity and relatively acceptable performance. However, there is a considerable performance gap between LRU and Belady's theoretical optimal replacement policy in highly associative caches. We study non-optimal LRU decisions (NODs) in chip multiprocessors and investigate...
Conference Paper
Previously suggested transistor sizing algorithms assume that all input transitions are equally important. In this work we show that this is not an accurate assumption as input transitions appear in different frequencies. We take advantage from this phenomenon and introduce application specific transistor sizing. In application specific transistor...
Article
Intel’s XScale which has powered many multimedia applications uses scoreboard to control instruction execution. Scoreboard stalls the pipeline whenever a source operand or functional unit is needed but not available. While waiting for the availability of the resources, the processor accesses the scoreboard every cycle. Such accesses consume energy...
Conference Paper
In software transactional memory (STM) systems, read validation ensures that a transaction always has a consistent view of the memory. Existing read validation policies follow a static approach and use one policy across all applications. However, no single universal read validation policy offers optimal performance across all applications. We propo...
Conference Paper
We study non-optimal LRU decisions (NODs) in single processors. We study how NOD frequency changes from one application to another and from one phase to another within an application. Moreover we introduce Hasty and Predictable blocks as more inclusive extensions of previously suggested classifications. We discuss implementation issues and present...
Article
Conventional snoopy-based chip multiprocessors take an aggressive approach broadcasting snoop requests to all nodes. In addition each node checks all received requests. This approach reduces the latency of cache to cache transfer misses at the expense of increasing power. In this paper we show that a large portion of interconnect/cache transactions...
Conference Paper
In this work we modify the conventional row buffer allocation mechanism used in DDR2 SDRAM banks to improve average memory latency and overall processor performance. Our method assigns row buffers to different banks dynamically and by taking into account program cyclic behavior and bank row buffer demand. As we show in this work, memory requests go...
Article
In this work we study how cache complexity impacts energy and performance in high performance processors. Moreover, we estimate cache energy budget for two high performance processors. We calculate energy and latency break-even points for realistic and ideal cache organizations for different applications. We show that design efforts made to reduce...

Network

Cited By