Sied Mehdi Fakhraie’s research while affiliated with University of Tehran and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (92)


Fig. 1. The Broken-Booth Multiplier Type0 (a) and Type1 (b) for WL = 12 and VBL = 7. Adopted from [10].
Fig. 2. The percentage of error distribution of Broken-Booth Multiplier Type0 for WL = 10 and VBL = 9.
New Approximate Multiplier for Low Power Digital Signal Processing
  • Preprint
  • File available

March 2020

·

149 Reads

·

·

Sied Mehdi Fakhraie

In this paper a low power multiplier is proposed. The proposed multiplier utilizes Broken-Array Multiplier approximation method on the conventional modified Booth multiplier. This method reduces the total power consumption of multiplier up to 58% at the cost of a small decrease in output accuracy. The proposed multiplier is compared with other approximate multipliers in terms of power consumption and accuracy. Furthermore, to have a better evaluation of the proposed multiplier efficiency, it has been used in designing a 30-tap low-pass FIR filter and the power consumption and accuracy are compared with that of a filter with conventional booth multipliers. The simulation results show a 17.1% power reduction at the cost of only 0.4dB decrease in the output SNR.

Download

Soft Realization: a Bio-inspired Implementation Paradigm

December 2018

·

26 Reads

Researchers traditionally solve the computational problems through rigorous and deterministic algorithms called as Hard Computing. These precise algorithms have widely been realized using digital technology as an inherently reliable and accurate implementation platform, either in hardware or software forms. This rigid form of implementation which we refer as Hard Realization relies on strict algorithmic accuracy constraints dictated to digital design engineers. Hard realization admits paying as much as necessary implementation costs to preserve computation precision and determinism throughout all the design and implementation steps. Despite its prior accomplishments, this conventional paradigm has encountered serious challenges with today's emerging applications and implementation technologies. Unlike traditional hard computing, the emerging soft and bio-inspired algorithms do not rely on fully precise and deterministic computation. Moreover, the incoming nanotechnologies face increasing reliability issues that prevent them from being efficiently exploited in hard realization of applications. This article examines Soft Realization, a novel bio-inspired approach to design and implementation of an important category of applications noticing the internal brain structure. The proposed paradigm mitigates major weaknesses of hard realization by (1) alleviating incompatibilities with today's soft and bio-inspired algorithms such as artificial neural networks, fuzzy systems, and human sense signal processing applications, and (2) resolving the destructive inconsistency with unreliable nanotechnologies. Our experimental results on a set of well-known soft applications implemented using the proposed soft realization paradigm in both reliable and unreliable technologies indicate that significant energy, delay, and area savings can be obtained compared to the conventional implementation.


Lifetime improvement by exploiting aggressive voltage scaling during runtime of error-resilient applications

November 2017

·

72 Reads

·

19 Citations

Integration

Farzaneh Nakhaee

·

·

·

[...]

·

In this paper, we present an accuracy-aware operating voltage management unit to improve the lifetime of processors by considering the error-resilient nature of some applications. This unit is placed in the power management unit of the processor and its operation is based on the aggressive operating voltage reduction during the runtime of error resilient applications. This unit determines the operating voltage of the processor based on the type of the running application, the predefined minimum acceptable quality, and the operating voltage level specified by the dynamic voltage frequency scaling controller of the processor power management unit. The determined operating voltage by this unit results in lifetime improvement and power reduction at the cost of timing violations that are tolerable by error-resilient applications. In addition, the proposed unit dynamically adjusts the minimum acceptable operating voltage based on the impact of aging mechanisms. The aging mechanisms considered in this work include Negative Bias Temperature Instability, Hot Carrier Injection, Time Dependent Dielectric Breakdown, Thermal Cycling, Electro-Migration, and Stress Migration. The efficacy of the proposed operating voltage management is investigated by applying it to some exact and error-resilient applications. The results show that the proposed unit lead to, on average, 38.82% lifetime improvement as well as 41.8% power consumption reduction.


Efficient utilization of imprecise computational blocks for hardware implementation of imprecision tolerant applications

March 2017

·

38 Reads

·

6 Citations

Microelectronics Journal

Recently, it has been reported that exploiting imprecise arithmetic building blocks such as adders and multipliers significantly improves digital implementation costs as well as performance of an important category of systems named as Imprecision Tolerant (IT) applications. We have categorized this new type of functional units as Bio-inspired Imprecise Computational (BIC) blocks. To efficiently exploit BICs in an IT application, however, the traditional hardware design flow should also be customized based on unique features of this new type of computing blocks. The most significant modification on traditional design flow to maximize the cost-performance of BIC implementations is to verify that the application is capable of tolerating those types of errors which are inherently introduced by the selected BIC based on its internal structure. We call this “error-behavior compatibility matching” between the system and the selected BIC building blocks. In this paper, we introduce and explain a customized hardware design flow for BICs with the main focus on error-behavior compatibility matching process as the main difference between traditional and BIC design flows. Two different error-behavior compatibility matching strategies are also introduced and their applicability is verified by applying them to exploit BICs for hardware implementation of some significant case studies including a general multiply-accumulate (MAC) block as the basic building block of many signal processing applications as well as an Artificial Neural Network (ANN) as a critical instance of MAC-based IT applications.


Fig. 6: Predicted maximum swing of various nodes in a 122-order FIR filter. 
Fig. 7: Predicted maximum swing of various nodes in a 5-order DF2 high pass filter. 
A Novel Integer-Bit Estimation Scheme in Digital Filters Based on Probabilistic Behavior of Signals in the Internal Nodes

October 2016

·

66 Reads

·

1 Citation

IEEE Transactions on Circuits and Systems II: Express Briefs

Accurate estimation of maximum swing of internal nodes in digital filters are highly desired. It is translated into better implementation on hardware in terms of area, power and reliability. In this paper, an accurate semi-analytical method for estimating the internal node swing is presented. We use multiple word length method in which the fractional and the integer word lengths are not the same for various nodes. In the proposed method, the probability distribution of signals are considered and based on that, an accurate evaluation for maximum swing of the internal nodes is provided by using only one simple simulation. This analytical method is confirmed by some theorems on behavior of linear filters which are presented in this paper. Despite most of previous methods, our method is independent of the input signal, thus, the calculated swings are valid for all kinds of inputs. Simulation results for various types of filters such as FIR, DF2 and cascade shows that the approximated range interval by the proposed method is tighter than other methods, which means our proposed method is capable of saving more area and power of digital filters.


Reliability aware throughput management of chip multi-processor architecture via thread migration

April 2016

·

63 Reads

·

6 Citations

The Journal of Supercomputing

Integrating the large number of transistor in a single chip leads to significant improvement on the performance of processors. More performance is achieved by putting multiple CPU cores on a single chip which is named as chip multiprocessor (CMP) architecture. On the other hand, miniaturization and integration of the large number of transistors in new silicons such as CMPs increase susceptibility to soft errors and degrade the reliability. Previous researches have exploited traditional redundancy techniques such as dual and triple cores redundancy to tolerate fault in CMP architecture while these methods impose significant performance and energy overheads. In this paper, we present a performance efficient soft error protection scheme for CMP architecture which is based on simultaneous multithreading. Fortunately, some of soft errors are masked at architectural level and don’t cause visible output error. Soft error masking effect can be used to decrease a lot of overheads in reliability enhancement techniques against soft errors. Recently, architectural vulnerability factor (AVF) is widely used for estimating the portion of soft errors which are masked. In this article, we propose a reliability aware CMP architecture which use online AVF estimation to specify level of protection. To meet system reliability demands, the estimated AVF is used to exploit partial redundancy against soft errors which leads to significant performance improvement. Also, we introduce a dynamic scheduling method for mapping threads on the cores to enhance total throughput of CMP architecture. Our dynamic scheduling applies thread migration among cores by simultaneous considering to the total vulnerability and throughput of cores. Thread migration between cores balances loads between cores and improves performance. Our experimental results on SPEC CPU2006 show up to 38 % improvement in core throughput in different phases of thread migration compared to static mapping of threads on the cores.



Power Efficient High-Level Synthesis by Centralized and Fine-Grained Clock Gating

December 2015

·

27 Reads

·

14 Citations

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Nowadays, power is a primary concern in digital circuits and clock distribution networks are particularly a significant power consumer. Therefore, clock gating is an effective technique in saving dynamic power by reducing the switching activities. In this paper, we propose a centralized and fine-grained microarchitecture-level clock gating for low power hardware accelerators which are automatically designed by high-level synthesis (HLS) tool. The basic principium of our idea is not to use any extra computation for generating clock enabled signals and exploit exiting signals of finite state machine for controlling the datapath clock network. After determining the current state in finite state machine, clock sub-tree of current state is enabled and the other sub-trees are disabled with a slight increase in circuit area. Our approach is implemented within an HLS design flow for automatic low power hardware accelerator generation in application specific integrated circuit design. Experimental results are obtained on a set of representative benchmark programs. Depending on the circuit size and number of registers, it is shown that 47%-86% reduction in power dissipation is observed.


A 256-kb 9T Near-Threshold SRAM With 1k Cells per Bitline and Enhanced Write and Read Operations

November 2015

·

181 Reads

·

79 Citations

IEEE Transactions on Very Large Scale Integration (VLSI) Systems

In this paper, we present a new 9T SRAM cell that has good write ability and improves read stability at the same time. Simulation results show that the proposed design increases read static noise margin and I-ON/I-OFF of read path by 219% and 113%, respectively, at supply voltage of 300-mV over conventional 6T SRAM cell in a 90-nm CMOS technology. The proposed design lets us reduce the minimum operating voltage of SRAM (VDDmin) to 350 mV, whereas conventional 6T SRAM cannot operate successfully with an acceptable failure rate at supply voltages below 725 mV. We also compared our design with three other SRAM cells from recent literature. To verify the proposed design, a 256-kb SRAM is designed using new 9T and conventional 6T SRAM cells. Operating at their minimum possible V-DDs, the proposed design decreases write and read power per operation by 92% and 93%, respectively, over the conventional rival. The area of the proposed SRAM cell is increased by 83% over a conventional 6T one. However, due to large I-ON/I-OFF of read path for 9T cell, we are able to put 1k cells in each column of 256-kb SRAM block, resulting in the possibility for sharing write and read circuitries of each column between more cells compared with conventional 6T. Thus, the area overhead of 256-kb SRAM based on new 9T cell is reduced to 37% compared with 6T SRAM.


Process variation-aware approximation for efficient timing management of digital circuits

September 2015

·

45 Reads

·

1 Citation

With the ever decreasing transistor feature size in recent years, process variation has become a serious challenge in digital system design, mainly due to the undesirable timing uncertainty it introduces in circuits. This variation in the size of circuit components changes the logic and wire delays that in turn, may result in delay faults or clock degradation. In this paper, we propose to use the emerging approximate computing technique to tackle the process variation problem in new technology points. In this method, functional units (sub-blocks) along critical paths are replaced with a faster approximate version to guarantee timing closure in presence of process variation. Approximation not only applies to critical path sub-blocks but also to the paths with the length close enough to critical path length that variation can potentially increase their delay over the critical path. Evaluation results under several benchmarks show that this method can give a required safety margin against delay faults, while its side-effect, i.e. accuracy loss, is still low and acceptable.


Citations (59)


... Similarly, in the case of low-level synthesis, there are methods leading to reducing power consumption. The most important ideas leading to reducing power consumption include the idea of locally reducing the supply voltage [16-19], "power gating" [13-24], or temporarily blocking the clock signal (clock gating) [25][26][27][28][29][30][31]. Alternatively, to block the clock signal, it is sometimes better to only lower the clock frequency in some parts of the circuit [32,33]. ...

Reference:

Multi-Level Sum of Product (SOP) Network Power Optimization Based on Switching Graph
Power Efficient High-Level Synthesis by Centralized and Fine-Grained Clock Gating
  • Citing Article
  • December 2015

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

... Main part of power dissipation in SRAMs is its leakage power; power consumption related to switching of big capacitances on bitlines and word-lines of SRAM array. One effective solution to reduce the power consumption is reducing supply voltage of the chip [3]. However, low voltage operation places several duress on SRAM functionality such as reducing noise margin, increasing sensitivity to process variation, small on to off current ratio, decreasing write ability, etc. ...

A New Tri-State Based Static Random Access Memory (SRAM) with Improved Write-Ability and Read Stability

... In addition, the timing errors encountered in the last row of MACs will not be tackled by TE-Drop, also resulting in an accuracy drop. A naive approach to tackle timing violations is to allow the erroneous data to flow through the successive stages of operations [12], [15]. This technique undermines the effects of the erroneous data in DNN computations, as a large number of timing errors causes a significant drop in the inference accuracy [16]. ...

Lifetime improvement by exploiting aggressive voltage scaling during runtime of error-resilient applications
  • Citing Article
  • November 2017

Integration

... In addition, dedicated systems are needed in many industrial applications to meet lower power and space requirements. Dedicated hardware is more plausible and can offer enormous speed and embedded ness advantages that may be needed in different applications (Fakhraie & Smith, 1997;Reyneri, 1998Reyneri, , 2003. These systems have distribution of computing elements and parallel processing behavior similar to biological systems. ...

VLSI — Compatible Implementations for Artificial Neural Networks
  • Citing Book
  • January 1997

... 1. Evaluating error with meaningful metrics: The first consideration is on examining the target application or system and finding out which error metric can better represent the system inaccuracies [3,4]. Extensive simulations on Multiplyaccumulate (MAC)units show that this components are more sensitive to the Mean Error (ME) rather than other metrics [3]. ...

Efficient utilization of imprecise computational blocks for hardware implementation of imprecision tolerant applications
  • Citing Article
  • March 2017

Microelectronics Journal

... There are several studies that improve the performance of RMT techniques by efficient scheduling of threads on manycore architectures. 12,[15][16][17][18][19] The studies in the literature differentiate the priorities of redundant threads since there is information flow from leaders to trailers in hardware-level RMT techniques. Kalayappan and Sarangi 12 prioritized the leaders over the trailers and pinned them onto the dedicated cores. ...

Reliability aware throughput management of chip multi-processor architecture via thread migration

The Journal of Supercomputing

... It also takes up additional hardware resources and is very time-consuming in terms of system re-engineering [6], [7]. Evolution Hardware has the characteristics of self-organization, self-adaption and self-repair, and thus, it is a good match to fault-tolerant systems [10], [11]. NASA / JPL first started research on the fault tolerance of programmable devices. ...

Implementation of image processing applications with evolutionary fault recovery scheme
  • Citing Conference Paper
  • May 2014

... The effect of slew and jitter in clock network were analyzed by considering the supply noise. A separate procedure is followed for power supply noise variation, which analyzes the slew and jitter [19,20]. A selectively switching the parts of logic network to off condition is proposed. ...

_______
  • Citing Article
  • December 2015

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

... A NEED for fast determining similarity between two multidimensional vectors appears in many areas of signal processing nowadays. This concerns, among others, neural networks implemented in hardware [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], especially so–called self–organizing ones. One of important problems in such networks is how to learn them on silicon in a precise and fast way. ...

VLSI — Compatible Implementations for Artificial Neural Networks
  • Citing Chapter
  • January 1997

... Computation complexity is a major problem of intelligent systems in their industrial aspect [1]. These days, implementation of intelligent systems on DSPs, FPGA, ASIC and other hardware platforms is possible and cost effective through developments in VLSI technology [1][2][3][4]. As an example, DSPs (digital signal processor) are used widely in communications, controls, and speech processing and alike applications [5]. ...

VLSI — Compatible Implementations for Artificial Neural Networks
  • Citing Chapter
  • January 1997