Oscar GustafssonLinköping University | LiU · Department of Electrical Engineering (ISY)
Oscar Gustafsson
PhD
About
212
Publications
49,552
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,952
Citations
Introduction
Skills and Expertise
Publications
Publications (212)
The time-to-market of a new product is one of its most crucial factors for success, therefore, reducing this time is of utter importance. However, this reduction must not come at the expense of a less thorough development process. This paper presents a compiler-driven approach for automatically analyzing metrics such as transaction delays or bus th...
A design method for chromatic dispersion compensation filters realized using overlap-save processing in the frequency domain is proposed. Based on the idea to use the values that are normally zero-padded, better results than using optimal time-domain design are obtained without any modification of the overlap-save processing complexity.
Spade is a new open source hardware description language (HDL) designed to increase developer productivity without sacrificing the low-level control offered by HDLs. It is a standalone language which takes inspiration from modern software languages, and adds useful abstractions for common hardware constructs. It also comes with a convenient set of...
Spade is a new open source hardware description language (HDL) designed to increase developer productivity without sacrificing the low-level control offered by HDLs. It is a standalone language which takes inspiration from modern software languages, and adds useful abstractions for common hardware constructs. It also comes with a convenient set of...
FIR filtering realized in frequency domain can use different FFT sizes leading to different arithmetic complexities. The implementation results indicate that not only arithmetic complexities must be considered for minimal power consumption.
Approaches to shift-and-add realization of time-domain chromatic dispersion compensation FIR filters are considered. The coefficient word length has larger impact than filter length on both BER penalty and adder complexity.
User activity detection in grant-free random access massive machine type communication (mMTC) using pilot-hopping sequences can be formulated as solving a non-negative least squares (NNLS) problem. In this work, two architectures using different algorithms to solve the NNLS problem is proposed. The algorithms are implemented using a fully parallel...
Finite word length effects for frequency-domain implementation of chromatic dispersion compensation is analyzed. The results show a significant difference for the dif- ferent factors when it comes to power consumption and receiver penalty.
In this work, we present an approach to alleviate the potential benefit of adder graph algorithms by solving the transposed form of the problem and then transposing the solution. The key contribution is a systematic way to obtain the transposed realization with a minimum number of cascaded adders subject to the input realization. In this way, wide...
User activity detection in grant-free random access massive machine type communication (mMTC) using pilot-hopping sequences can be formulated as solving a non-negative least squares (NNLS) problem. In this work, two architectures using different algorithms to solve the NNLS problem is proposed. The algorithms are implemented using a fully parallel...
The most challenging aspect of particle filtering hardware implementation is the resampling step. This is because of high latency as it can be only partially executed in parallel with the other steps of particle filtering and has no inherent parallelism inside it. To reduce the latency, an improved resampling architecture is proposed which involves...
We perform exploratory ASIC design of key DSP and FEC units for 400-Gbit/s coherent data-center interconnect receivers. In 22-nm CMOS, the considered units together dissipate 5 W, suggesting implementation feasibility in power-constrained form factors.
Prime factor algorithms are beneficial in fully parallel frequency-domain implementation of CDC filters and enable a more continuous scaling of filter lengths. ASIC implementation results in 28-nm CMOS for 60 GBd are provided.
When optimising the vehicle trajectory and powertrain energy management of hybrid electric vehicles, it is important to include look-ahead information such as road conditions and other traffic. One method for doing so is dynamic programming, but the execution time of such an algorithm on a general purpose CPU is too slow for it to be useable in rea...
In this paper, we present the first implementation of a 1 million-point fast Fourier transform (FFT) completely integrated on a single field-programmable gate array (FPGA), without the need for external memory or multiple interconnected FPGAs. The proposed architecture is a pipelined single-delay feedback (SDF) FFT. The architecture includes a spec...
Optimization problem formulation for semi-digital FIR digital-to-analog converter (SDFIR DAC) is investigated in this work. Magnitude and energy metrics with variable coefficient precision are defined for cascaded digital \(\varSigma \varDelta\) modulators, semi-digital FIR filter, and Sinc roll-off frequency response of the DAC. A set of analog me...
A direct digital-to-RF converter (DRFC) is presented in this work. Due to its digital-in-nature design, the DRFC benefits from technology scaling and can be monolithically integrated into advance digital VLSI systems. A fourth-order single-bit quantizer bandpass digital ΣΔ modulator is used preceding the DRFC, resulting in a high in-band signal-to-...
In this paper, we present a systematic approach to design hardware circuits for bit-dimension permutations. The proposed approach is based on decomposing any bit-dimension permutation into elementary bit-exchanges. Such decomposition is proven to achieve the theoretical minimum number of delays required for the permutation. This offers optimum solu...
In this paper, a fast Fourier transform (FFT) hardware architecture optimized for field-programmable gate-arrays (FPGAs) is proposed. We refer to this as the single-stream FPGA-optimized feedforward (SFF) architecture. By using a stage that trades adders for shift registers as compared with the single-path delay feedback (SDF) architecture the effi...
In this work, a scalable and modular architecture for massive MIMO base stations with distributed processing is proposed. New antennas can readily be added by adding a new node as each node handles all the additional involved processing. The architecture supports conjugate beamforming, zero-forcing, and MMSE, where for the two latter cases a centra...
In this paper, an efficient mapping of the pipeline single-path delay feedback (SDF) fast Fourier transform (FFT) architecture to field-programmable gate arrays (FPGAs) is proposed. By considering the architectural features of the target FPGA, significantly better implementation results are obtained. This is illustrated by mapping an R2
<sup xmlns:...
This paper presents a new type of FFT hardware architectures called serial commutator (SC) FFT. The SC FFT is characterized by the use of circuits for bit-dimension permutation of serial data. The proposed architectures are based on the observation that in the radix-2 FFT algorithm only half of the samples at each stage must be rotated. This fact,...
In this brief, we propose a novel approach to implement multiplierless unity-gain single-delay feedback fast Fourier transforms (FFTs). Previous methods achieve unity-gain FFTs by using either complex multipliers or nonunity-gain rotators with additional scaling compensation. Conversely, this brief proposes unity-gain FFTs without compensation circ...
This paper considers frequency-domain implementation of finite-length impulse response filters. In practical fixedpoint arithmetic implementations, the overall system corresponds to a time-varying system which can be represented with either a multirate filter bank, and the corresponding distortion and aliasing functions, or a periodic time-varying...
In this paper a new structure for frequency masking FIR fil-ters is introduced. By using identical model and masking fil-ters (except for the periodicity) folding can be used to efficiently implement the filter by mapping all subfilters to a single hardware structure. The resulting structure will for most cases require fewer multipliers and adders...
This paper analyzes the limits of FFT performance on FPGAs. For this purpose, a FFT generation tool has been developed. This tool is highly parameterizable and allows for generating FFTs with different FFT sizes and amount of parallelization. Experimental results for FFT sizes from 16 to 65536, and 4 to 64 parallel samples have been obtained. They...
In this brief, we present the CORDIC II algorithm. Like previous CORDIC algorithms, the CORDIC II calculates rotations by breaking down the rotation angle into a series of microrotations. However, the CORDIC II algorithm uses a novel angle set, different from the angles used in previous CORDIC algorithms. The new angle set provides a faster converg...
In this brief, we propose how the hardware complexity of arbitrary-order digital multibit error–feedback delta–sigma modulators can be reduced. This is achieved by splitting the combinatorial circuitry of the modulators into two parts, i.e., one producing the modulator output and another producing the error signal fed back. The part producing modul...
This paper presents a new approach to design multiplierless constant rotators. The approach is based on a combined coefficient selection and shift-and-add implementation (CCSSI) for the design of the rotators. First, complete freedom is given to the selection of the coefficients, i.e., no constraints to the coefficients are set in advance and all t...
An oversampled digital-to-analog converter including digital ΣΔ modulator and semi-digital FIR filter can be employed in the transmitter of the VDSL2 technology. To select the optimum set of coefficients for the semi-digital FIR filter, an integer optimization problem is formulated in this work, where the model includes the FIR filter magnitude met...
Logarithmic number system (LNS) is an attractive alternative to realize finite-length impulse response filters because of multiplication in the linear domain being only addition in the logarithmic domain. In the literature, linear coefficients are directly replaced by the logarithmic equivalent. In this paper, an approach to directly optimize the f...
This paper proposes optimal finite-length impulse response (FIR) digital filters, in the least-squares (LS) sense, for compensation of chromatic dispersion (CD) in digital coherent optical receivers. The proposed filters are based on the convex minimization of the energy of the complex error between the frequency responses of the actual CD compensa...
In this work an FIR filter architecture requiring only half the number of multiplications compared to a direct realization is proposed. The proposed filter architecture is independent on coefficient selection and is therefore suitable for the realization of FIR filters where the filter impulse response is not symmetric/anti-symmetric. The filter ar...
This paper presents a reconfigurable FFT architecture for variable-length and multi-streaming WiMax wireless standard. The architecture processes 1 stream of 2048-point FFT, up to 2 streams of 1024-point FFT or up to 4 streams of 512-point FFT. The architecture consists of a modified radix-2 single delay feedback (SDF) FFT. The sampling frequency o...
In this paper, the fixed-point implementation of adjustable fractional-delay filters using the Farrow structure is considered. Based on the observation that the sub-filters approximate differentiators, closed-form expressions for the $L_2$-norm scaling values at the outputs of each sub-filter as well as at the inputs of each delay multiplier are de...
A unified hardware architecture that can be reconfigured to calculate 2, 3, 4, 5, or 7-point DFTs is presented. The architecture is based on the Winograd Fourier transform algorithm and the complexity is equal to a 7-point DFT in terms of adders/subtractors and multipliers plus only seven multiplexers introduced to enable reconfigurability. The pro...
The appearance of radix-2
<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup>
was a milestone in the design of pipelined FFT hardware architectures. Later, radix-2
<sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup>
was extended to radix-2
k
. How...
Rotations by angles that are fractions of the unit circle find applications in e.g. fast Fourier transform (FFT) architectures. In this work we propose a new rotator that consists of a series of stages. Each stage calculates a micro-rotation by an angle corresponding to a power-of-three fractional parts. Using a continuous powers-of-three range, it...
Many contemporary FPGAs have introduced a pre-adder before the hard multipliers, primarily aimed at linear-phase FIR filters. In this work, structural modifications are proposed with the aim of reducing the LUT resource utilization and, finally, using the pre-adder for implementing single path delay feedback pipeline FFTs. The results show that two...
The coefficient decimation technique for reconfigurable FIR filters was recently proposed as a filter structure with low computational complexity. In this brief, we propose to design these filters using linear programming taking all configuration modes into account, instead of only considering the initial reconfiguration mode as in previous works....
In this work we consider optimized twiddle factor multipliers based on shift-and-add-multiplication. We propose a low-complexity structure for twiddle factors with a resolution of 32 points. Furthermore, we propose a slightly modified version of a previously reported multiplier for a resolution of 16 points with lower round-off noise. For completen...
This paper presents a 512-point feedforward FFT architecture for wireless personal area network (WPAN). The architecture processes a continuous flow of 8 samples in parallel, leading to a throughput of 2.64 GSamples/s. The FFT is computed in three stages that use radix-8 butterflies. This radix reduces significantly the number of rotators with resp...
In this work, we consider the computational complexity of different polynomial evaluation schemes. By considering the number of operations of different types, critical path, pipelining complexity, and latency after pipelining, high-level comparisons are obtained. These can then be used to short list suitable candidates for an implementation given t...
When generating a sine table to be used in, e.g., frequency synthesis circuits, a widely used way to assign the table content is to simply take a sine wave with the desired amplitude and quantize it using rounding. This results in uncontrolled rounding of up to 0.5 LSB, causing some noise. In this paper we present a method for increasing the signal...
The complexity of narrow transition band FIR filters is high and can be reduced by using frequency response masking (FRM) techniques. These techniques use a combination of periodic model filters and masking filters. In this paper, we show that time-multiplexed FRM filters achieve lower complexity, not only in terms of multipliers, but also logic el...
This brief presents novel circuits for calculating bit reversal on a series of data. The circuits are simple and consist of buffers and multiplexers connected in series. The circuits are optimum in two senses: they use the minimum number of registers that are necessary for calculating the bit reversal and have minimum latency. This makes them very...
This letter describes an efficient architecture for the computation of fast Fourier transform (FFT) algorithms with single-bit input. The proposed architecture is aimed for the first stages of pipelined FFT architectures, processing one sample per clock cycle, hence making it suiable for real-time FFT computation. Since natural input order pipeline...
The use of rate-compatible error correcting codes offers several advantages as compared to the use of fixed- rate codes: a smooth adaptation to the channel conditions, the possibility of incremental Hybrid ARQ schemes, as well as simplified code representations in the encoder and decoder. In this paper, the implementation of a decoder for rate-comp...
In this work a systematic method to generate all possible fast Fourier transform (FFT) algorithms is proposed based on the relation to binary trees. The binary tree is used to represent the decomposition of a discrete Fourier transform (DFT) into sub-DFTs. The radix is adaptively changed according to compute sub-DFTs in proposed decomposition. In t...
The Discrete wavelet transform (DWT) has been used in a wide range of real-time application. Algebraic integer quantization (AIQ) encoding has been proposed to represent the irrational transform basis of the wavelet transform as polynomials with integer coefficients. In this paper, we suggest to restate these polynomials to obtain simpler coefficie...
Matrix inversion is sensitive towards the number representation used. In this paper simulations of matrix inversion with numbers represented in the fixed-point and logarithmic number systems (LNS) are presented. A software framework has been implemented to allow extensive simulation of finite wordlength matrix inversion. Six different algorithms ha...
When using the fixed-point format for storing a number we store the nearest quantized number. When a number is represented in the logarithmic number system (LNS) we instead of quantization and storing the actual number, quantize and store the logarithm of the number. We could for this purpose use the logarithm of any base but here we will assume th...
Frequency-response masking (FRM) is a set of techniques for lowering the computational complexity of narrow transition band FIR filters. These FRM use a combination of sparse periodic filters and non-sparse filters. In this work we consider the implementation of these filters in a time-multiplexed manner on FPGAs. It is shown that the proposed arch...
In this work we propose a graph based minimum adder depth algorithm for the multiple constant multiplication (MCM) problem. Hence, all multiplier coefficients are here guaranteed to be realized at the theoretically lowest depth possible. The motivation for low adder depth is that this has been shown to be a main factor for the power consumption. An...
We connect to the two recent IEEE Signal Processing Magazine special issues on digital signal processing (DSP) on multicore processors (November 2009 and March 2010) and address an issue that was not addressed in the articles there, which we believe has important consequences for DSP algorithm design in the future. The basic observation that we sta...
In this work we discuss the realization of constant multiplication using a minimum number of carry-save adders. We consider both non-redundant and carry-save representation for the input data. For both cases we present all possible interconnection topologies, using up to six and five adders, respectively. These are sufficient to realize constant mu...
Complex rotations find use in common transforms such as the Discrete Cosine Transform (DCT) and the Discrete Fourier Transform (DFT). In this work we consider low-complexity realization of constant angle rotators based on shifts, adders, and subtracters. The results show that redundant CORDIC and scaled constant multiplication are providing the bes...
The generation of a canonical signed digit representation from a binary representation is revisited. Based on the property that each nonzero digit is surrounded by a zero digit, a hardware-efficient conversion method using bypass instead of carry propagation is proposed. The proposed method requires less area per digit and the required bypass signa...
The use of rate-compatible error correcting codes offers several advantages as compared to the use of fixed-rate codes: a smooth adaptation to the channel conditions, the possibility of incremental Hybrid ARQ schemes, as well as sharing of the encoder and decoder implementations between the codes of different rates. In this paper, the implementatio...
The problem of reconfigurable multiple constant multiplication (ReMCM) is about finding an cost-effective network of shifts, additions, subtractions, and multiplexers to implement the multiplication of a single input variable with one out of several sets of coefficients. Most previous publications only focus on the problem with a single output, whe...
In this work, we investigate the problem of computing any requested set of power terms in parallel using summations trees. This problem occurs in applications like polynomial approximation, Farrow filters (polynomial evaluation part) etc. In the proposed technique, the partial product of each power term is initially computed independently. A redund...
In this paper, we propose higher point FFT (fast Fourier transform) algorithms for a single delay feedback pipelined FFT architecture considering the 4096-point FFT. These algorithms are different from each other in terms of twiddle factor multiplication. Twiddle factor multiplication complexity comparison is presented when implemented on Field-Pro...
In this work, a method for estimation of the switching activity in integrators is presented. To achieve low power, it is always necessary to develop accurate and efficient methods to estimate the switching activity. The switching activities are then used to estimate the power consumption. In our work, the switching activity is first estimated for t...
The core of many DSP tasks is Multiplication of one data with several constants, i.e. in Digital filtering, image processing DCT and DFT. The Modern Portable equipments like Cellular phones and MP3 players which has DSP circuits, involve large number of multiplications of one variable with several constants (MCM) which leads to large area, delay an...
An optimization algorithm for the design of puncturing patterns for low-density parity-check codes is proposed. The algorithm is applied to the base matrix of a quasi-cyclic code, and is expanded for each block size used. Thus, storing puncturing patterns specific to each block size is not required. Using the optimization algorithm, the number of 1...
The power modeling of different realizations of cascaded integrator-comb (CIC) decimation filters has been a subject of several recent works. In this work we have extended these with modeling of leakage power, which is an important factor since the input sample rate may differ several orders of magnitude. Furthermore, we have pointed out the import...
Sub-expression sharing is a technique that can be applied to reduce the complexity of linear time-invariant non-recursive computations by identifying common patterns. It has recently been proposed that it is possible to improve the performance of single and multiple constant multiplication by identifying overlapping digit patterns. In this work we...
In this chapter fundamentals of arithmetic operations and number representations used in DSP systems are discussed. Different relevant number systems are outlined with a focus on fixed-point representations. Structures for accelerating the carry-propagation of addition are discussed, as well as multi-operand addition. For
multiplication, different...
Wireless communication standards are developed at an ever-increasing rate of pace, and significant amounts of effort is put into research for new communication methods and concepts. On the physical layer, such topics include MIMO, cooperative communication, and error control coding, whereas research on the medium access layer includes link control,...
In this paper, we propose equivalent radix-2<sup>2</sup> algorithms and evaluate them based on twiddle factor switching activity for a single delay feedback pipelined FFT architecture. These equivalent pipeline FFT algorithms have the same number of complex multipliers with the same resolution as the radix-2<sup>2</sup>. It is shown that the twiddl...
In this work, a design method for narrow-band and wide-band frequency-response masking FIR filters is proposed. As opposed to most previous works, the design method is not based on a periodic model filter. Instead, the masking filter is designed for a given stopband edge. The model filter design is based on optimizing the sparseness of the filter,...
In this work we consider high-speed FIR filter architectures implemented using, possibly pipelined, carry-save adder trees for accumulating the partial products. In particular we focus on the mapping between partial products and full adders and propose a technique to reduce the number of carry-save adders based on the inherent redundancy of the par...
Analog-to-digital converters based on sigma-delta modulation have shown promising performance, with steadily increasing bandwidth. However, associated with the increasing bandwidth is an increasing modulator sampling rate, which becomes costly to decimate in the digital domain. Several architectures exist for the digital decimation filter, and amon...
In this work a new technique for design of narrow-band and wide-band linear-phase finite-length impulse response (FIR) frequency-response masking based filters is introduced. The technique is based on a sparse FIR filter design method for both the model (bandedge shaping) filter as well as the mask-ing filter using mixed integer linear programming...
In this work we consider three different techniques for avoiding sign-extension in constant multiplication based on shifts, additions, and subtraction. Especially, we consider the multiple constant multiplication (MCM) problem arising in transposed direct form FIR filters. The main advantage of avoiding sign-extension is the reduced load of the sig...
In this work, the trade-offs in FIR filter design are studied. This includes the adder depth for the constant filter coefficients, the number of adders, and the number of delay elements, i.e., the filter order. It is shown that the proposed design algorithm can be used to decrease both the overall arithmetic complexity and the adder depth, possibly...
In this work, we analyze different approaches to store the coefficient twiddle factors for different stages of pipelined Fast Fourier Transforms (FFTs). The analysis is based on complexity comparisons of different algorithms when implemented on Field-Programmable Gate Arrays (FPGAs) and ASIC for different radix-2<sup>i</sup> algorithms. The objecti...
In this work we consider scaling of fractional delay filters using the Farrow structure. Based on the observation that the subfilters approximate the Taylor expansion of a differentiator, we derive estimates of the L<sub>2</sub>-norm scaling values at the outputs of each subfilter as well as at the inputs of each delay multiplier. The scaling value...