Analysis and Implementation of Multiple–Input, Multiple–Output VBLAST Receiver From Area and Power Efficiency Perspective
ABSTRACT This paper presents an analysis of the vertical Bell Laboratories layered space time (VBLAST) receiver used in a multiple-input multiple-output (MIMO) wireless system from the hardware implementation perspective and identifies those processing elements that consume more area and power due to complex signal processing. This paper models a scalable VBLAST receiver based on minimum mean square error (MMSE) nulling criteria assuming a block flat fading channel. After identifying the major area and power consuming blocks, this paper proposes two area and power efficient VLSI architectures for the block that computes pseudoinverse of the channel matrix. This paper discusses different tradeoff issues in both architectures and compares them with the architectures in the literature
Conference Proceeding: A comparative study of MIMO detection algorithms for wideband spatial multiplexing systems[show abstract] [hide abstract]
ABSTRACT: The implementation of wideband MIMO systems poses a major challenge to hardware designers due to the huge processing power required for MIMO detection. To achieve this goal with a complete VLSI solution, channel coding and MIMO detection are preferably separated so that each of them can be fitted into a single chip. In this paper, a comparative study is presented regarding various uncoded adaptive and non-adaptive MIMO detection algorithms. Intended to serve as a reference for system designers, this comparison is performed from several different perspectives including theoretical formulation, simulated BER/PER performance, and hardware complexity. All the simulations are conducted within MIMO-OFDM framework and with a packet structure similar to that of the IEEE 802.11a/g standard. As the comparison results show, the RLS algorithm appears to be an affordable solution for a wideband MIMO system targeted at gigabit wireless transmission. As a direct result of this work, an ASIC for a 25 MHz wideband 8 × 8 MIMO-OFDM system using RLS has been designed and fabricated.Wireless Communications and Networking Conference, 2005 IEEE; 04/2005
Conference Proceeding: An efficient square-root algorithm for BLAST[show abstract] [hide abstract]
ABSTRACT: Bell Labs Layered Space-Time (BLAST) is a scheme for transmitting information over a rich-scattering wireless environment using multiple receive and transmit antennas. The main computational bottleneck in the BLAST algorithm is a “nulling and cancellation” step, where the optimal ordering for the sequential estimation and detection of the received signals is determined. To reduce the computational cost of BLAST, we develop an efficient square-root algorithm for the nulling and cancellation step. The main features of the algorithm include efficiency: the computational cost is reduced by 0.7 M, where M is the number of transmit antennas, and numerical stability: the algorithm is division-free and uses only orthogonal transformations. In a 14 antenna system designed for transmission of 1 Mbit/s over a 30 kHz channel, the nulling and cancellation computation is reduced from 190 MFlops/s to 19 MFlops/s, with the overall computations being reduced from 220 MFlops/s to 49 MFlops/s. The numerical stability of the algorithm also make it attractive for implementation in fixed-point (rather than floating-point) architecturesAcoustics, Speech, and Signal Processing, 2000. ICASSP '00. Proceedings. 2000 IEEE International Conference on; 02/2000 · 4.63 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: This thesis presents a systolic algorithm for the SVD of arbitrary complex matrices, based on the cyclic Jacobi method with "parallel ordering". As a basic step in the algorithm, a two-step, two-sided unitary transformation scheme is employed to diagonalize a complex 2 Theta 2 matrix. The transformations are tailored to the use of CORDIC (COordinate Rotation Digital Computer) algorithms for high speed arithmetic. The complex SVD array is modeled on the Brent-Luk-VanLoan array for real SVD. An array with O(n 2 ) processors is required to compute the SVD of a n Theta n matrix in O(n log n) time. An architecture for the complex 2 Theta 2 processor with an area complexity twice that of a real 2 Theta 2 processor, is shown to have the best area/time tradeoff for VLSI implementation. Despite the involved nature of computations on complex data, the computation time for the complex SVD array is less than three times that forarealSVD array with a similar CORDIC based implementation. Acknowledgments I wish to express my heartfelt appreciation and gratitude to the many people without whose constant encouragement and support, this thesis would not have been possible. First and foremost, I deeply acknowledge the tutelage and guidance of Dr. Joseph R. Cavallaro and thank him for giving me the opportunity to do research. He has truly been a "friend, philosopher and guide", in every sense of the phrase. I am sincerely grateful for the useful comments and suggestions afforded by Dr. Peter J. Varman and Dr. Dan C. Sorensen and their consenting to serve on the thesis committee. Special thanks is due to my friend and fellow graduate student Kishore `kotax' Kota, and my other office-mates Vinay Pai, Jim Carson and Jay Greenwood for making graduate academic life that much mor...01/2000;
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 11, NOVEMBER 2006 1281
 J.Lillis,C.K.Cheng,andT.T. Y.Lin,“Optimal wiresizingandbuffer
insertion for low power and a generalized delay model,” in Proc. IEEE
Int. Conf. Comput.-Aided Des., 1995, pp. 138–143.
 C. J. Alpert and A. Devgan, “Wire segmenting for improved buffer
insertion,” in Proc. Des. Autom. Conf., 1997, pp. 588–593.
 C. J. Alpert, A. Devgan, and S. T. Quay, “Buffer insertion for noise
and delay optimization,” in Proc. 34th ACM/IEEE DAC, 1999, pp.
 C. Chu and D. F. Wong, “Closed form solution to simultaneous buffer
insertion/sizing and wire sizing,” ACM Trans. Design Autom. Electron.
Syst., vol. 6, no. 3, pp. 343–371, Jul. 2001.
 S. Dhar and M. A. Franklin, “Optimum buffer circuits for driving long
uniform lines,” IEEE J. Solid-State Circuits, vol. 26, pp. 32–40, Jan.
 Berkeley Predictive Technology Model (BPTM), (2005) [Online].
 J. A. Davis, R. Venkatesan, K. A. Bowman, and J. D. Meindl, “Gi-
gascale integration (GSI) interconnect limits and ?-tier multilevel in-
terconnect architectural solutions,” in Proc. Int. Workshop Syst. Level
Intercon. Prediction, 2000, pp. 147–148.
 E. Hokenek, R. K. Montoye, and P. W. Cook, “Second-generation
RISC floating point with multiply-add fused,” IEEE J. Solid-State Cir-
cuits, vol. 25, pp. 1207–1213, Oct. 1990.
 B. S. Amrutur and M. A. Horowitz, “Fast low-power decoders for
RAMS,” IEEE J. Solid-State Circuits, vol. 36, pp. 1506–1514, Oct.
 J. Cong, “An interconnect-centric design flow for nanometer technolo-
gies,” Proc. IEEE, vol. 89, no. 4, pp. 505–528, Apr. 2001.
Analysis and Implementation of Multiple–Input,
Multiple–Output VBLAST Receiver From Area and Power
Zahid Khan, Tughrul Arslan, John S. Thompson, and
Ahmet T. Erdogan
Abstract—This paper presents an analysis of the vertical Bell Laborato-
ries layered space time (VBLAST) receiver used in a multiple-input mul-
tiple-output (MIMO) wireless system from the hardware implementation
perspective and identifies those processing elements that consume more
area and power due to complex signal processing. This paper models a
scalable VBLASTreceiverbasedon minimummeansquareerror (MMSE)
nulling criteria assuming a block flat fading channel. After identifying the
major area and power consuming blocks, this paper proposes two area and
power efficient VLSI architectures for the block that computes pseudoin-
verse of the channel matrix. This paper discusses different tradeoff issues
in both architectures and compares them with the architectures in the lit-
Index Terms—CORDIC, Jacobi rotation, minimum mean square error
(MMSE), multiple-input multiple-output (MIMO) wireless system, pseu-
doinverse, square root algorithm, vertical Bell Laboratories layered space
time (VBLAST), VLSI.
Multiple-input multiple-output (MIMO) wireless communication
promises to significantly improve the range, reliability, and data speed
of existing wireless systems , . Because of its advantages, MIMO
Manuscript received March 9, 2006; revised June 8, 2006. This work was
supported by the Ministry of Science and Technology, Government of Pakistan,
under Grant TROSS/35.
The authors are with the School of Engineering and Electronics, the
University of Edinburgh, EH9 3JL, U.K. (e-mail: firstname.lastname@example.org; tughrul.ar-
Digital Object Identifier 10.1109/TVLSI.2006.886403
COST COMPARISON OF MIMO ALGORITHMS FOR ? ? ? 
is being studied for use in almost every wireless network such as
OFDM, CDMA2000, and WCDMA . However, due to complex
signal processing, MIMO is highly expensive with regard to area and
power consumption if implemented in hardware .
detection algorithm  that provides a good tradeoff between bit error
rate (BER) performance and computational complexity compared to
its counter parts. A comparison of computational complexity for dif-
ferent algorithms is given in Table I  which shows that zero forcing
(ZF) and minimum mean square error (MMSE) detectors are compu-
tationally less expensive than VBLAST; however, they provide poor
BER performance compared to VBLAST . The maximum likeli-
hood (ML)  detection provides the optimal solution, however, its
computational complexity is exponential with the number of transmit
antennas ??? and is prohibitively high for more than four antennas
. Therefore, the ML algorithm cannot usually be implemented on
mobile platforms due to its high overhead of area and power , .
VBLAST can provide BER performance close to ML at a computa-
tional cost much less than ML . In VBLAST itself, the bottlenecks
are repeated pseudoinverse calculation required to compute optimal
ordering and nulling vectors. The pseudoinverse can be computed
using the complex singular value decomposition (SVD) method .
However, such computation is expensive both in silicon and power
consumption. For equal number of transmit and receiver antennas
?? ? ??, the complexity of the pseudoinverse through SVD in
MMSE-VBLAST is ???????? which is quite large. The square
root algorithm  not only computes pseudoinverse but also avoids
repeated pseudoinverse computations and reduces the complexity of
VBLAST to ????? without degrading performance .
pseudoinverse of the channel matrix used in a MIMO wireless system.
The first is based on mixed CORDICs and multipliers while the second
uses only multipliers. Both architectures are compared with regard to
area, power, and throughput. They are also compared with the archi-
tectures available in the literature , . The architectures are first
modeled in MatLab to validate their correctness and determine their
BER performance. They are then implemented in hardware and simu-
lated for power and performance evaluation.
II. MIMO MATLAB MODELING
bols are transmitted from ? antennas and received by an array of ?
antennas via a rich scattering environment. A matrix of complex coef-
ficients is used to represent rayleigh flat fading channel in a baseband
equivalent model. The received symbol vector is represented in (1),
where ? is the ? ? ? received symbol
??? ? ?
vector, ? is the ? ? ? channel matrix, ? is the ? ? ? information
symbol vector, and ? is the ? ? ? vector of complex additive white
Gaussian noise (AWGN). The average power of the components in ?
1063-8210/$20.00 © 2006 IEEE
1282IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 11, NOVEMBER 2006
Fig. 1. Modeling MIMO-VBLAST algorithm.
Fig. 2. Jacobi rotation. (a) Serial. (b) Parallel.
at each receiving antenna and is independent of the number of trans-
Both transmitter and receiver are modeled as shown in Fig. 1.
Space-time (ST) transmitter demultiplexes the data among ? transmit
antennas. Each substream is individually modulated using quadrature
phase shift keying (QPSK) modulation. The outputs of the antennas
are mixed with rayleigh flat fading channels. AWGN noise is then
added to each of the ? signals transmitted. The corrupted signals
are collected in the ST receiver to form the received signal vector.
VBLAST detector  then iteratively detects symbols. The detected
symbols are then demodulated and the recovered binary data is com-
pared with the corresponding transmitted data in the BER module to
estimate the BER performance.
A. Pseudoinverse Computation
In MMSE-VBLAST modeling, the pseudoinverse is first calculated
using standard MatLab functions and then using the square root algo-
rithm  that finds the ?? decomposition of the augmented channel
matrix given in (2)
? ?? ?
here ? denotes the entries that are not relevant. After ?? decomposi-
tion, the algorithm computes ????? ???. Once ?? and ????are
computed, repeated pseudoinverse calculations can be avoided. Both
??and ????are computed together in a series of unitary transforma-
tions given as follows :
? ? ??
? ? ??
?? ???? ??? ????????????? ?????????????
The first step in pseudoinverse calculation through the square root
algorithm is formation of the prearray matrix ? . For a 4 ? 4 system
the size of ? is 9 ? 5. The first row of ? (as shown in Fig. 2) consists
of a constant value 1 and a row of the channel matrix weighted by the
Fig. 3. Performance of MMSE-VBLAST.
matrix ????. Other elements in ? are as according to [7, Step 1), Sec.
III-D]. After prearray formation, each complex weighted channel ele-
ment in the first row of ? is zeroed by rotating it against the element in
the first row and first column of ?. The complex rotation is carried out
using (3). Rotation angles are calculated using (4) and (5). The sequen-
tial method of applying complex Given’s rotations in each iteration of
the inner loop is given in Fig. 2(a) and ordered as (1,2), (1,3), (1,4),
(1,5), where the pairs represent row 1 and columns 1–5 of the matrix
?. The vertical axis in Fig. 2 shows inner steps inside the outer-loop
iteration and the horizontal axis represents the first row of ?.
The complexGiven’srotation canalso be implementedinparallel as
shown in Fig. 2(b). By including parallelism, the number of inner-loop
iterations (also called inner steps) can be reduced from 4 to 3.
The Jacobi rotation of (3) would result in very complex hardware
implementation due to the simultaneous use of two rotation angles.
This complexity problem can be avoided by adopting the methodology
presented in , where the complex Jacobi rotation is divided into
two rotations, named ? and ? rotations. In the ? rotation, the imaginary
parts of the complex weighted channel elements are rotated to zero and
the real parts take the values of the Euclidean norms of the respective
complex weighted channel elements. In ? rotation, these real numbers
obtained from the ? rotation are rotated to zero. The idea is the same
and there is no structural change to the Jacobi rotation. In fact in ?
rotation, the real part is taken as one column and the imaginary part
as the second column. The author in  has applied this technique to
find the ?? decomposition of the complex channel matrix.
B. Model Simulation
The algorithm is simulated for SNR from ?6 to 29 dB in steps of
5 dB. The burst length from each antenna is 120 symbols. The channel
assumed is block flat fading. These symbols are transmitted from each
antenna and then recovered at the receiver. After demodulation, the
BER is calculated. The process is repeated until all 120000 symbols
are transmitted and recovered at the receiver. The net BER is the av-
erage of the BER of all ??????????? ????? bursts. Fig. 3 shows the
BER versus SNR performance of MMSE-VBLAST systems for 4 ? 4
In VBLAST the computational bottlenecks are the pseudoinverse,
sorting, and nulling modules. The pseudoinverse has a complexity of
????? . The computational complexity is directly related to area
and power consumption. Therefore, in this paper, the pseudoinverse
module is selected for area efficient and low-power hardware imple-
mentation while sorting and nulling modules are left for future work.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 11, NOVEMBER 20061283
Fig. 4. Multiplier-based pseudoinverse module.
III. VLSI ARCHITECTURES FOR PSEUDOINVERSE MODULE
This section presents two VLSI architectures for computing pseu-
doinverse of channel matrix. The first uses only multipliers while the
second uses both CORDICs and multipliers.
A. Multiplier-Based Pipelined VLSI Architecture
Thisarchitecture, as shownin Fig.4, usesone divider andeight mul-
using the square root algorithm. Since this architecture is more suited
for division and multiplication, [6, Algorithm 3.4-1] is chosen, which
involves only multiplication, division, and square rooting to calculate
the coefficients “?” and “?” of Jacobi rotation.
Given ????and indices ? and ? that satisfy ? ? ? ? ? ? ?, the
algorithm that follows computes ? ? ?????? and ? ? ?????? such that
the ?th component of ???????? ? ? is zero:
? ? ? and ? ? ?
???? ? ?????
? ?? ??????? ??
?? ? ???????
?? ? ???????
? ?? ??
? ?? ??????? ??? ?? ???
In channel estimation or pseudoinverse computation, it is highly un-
likely that ? ? ?, ? will have a value less than 1. This implies that ? or
? can be approximated using the Taylor’s series representation and can
be implemented using adders, shifters, and multipliers as shown in (8)
? ?? ?? ? ???????? ? ? ?????? ? ? ??????
1) Sequence of Operations: The module shown in Fig. 4 has been
designed for a 4 ? 4 antenna system. The flow graph for the sequence
Fig. 5. Sequence of operations.
Fig. 6. Prearray matrix in duram1 and duram2.
of operations is shown in Fig. 5. The 16 complex channel elements
from channel estimation module, are fed to the pseudoinverse module
and stored in two dual-port RAMs, duram1 and duram2 as shown in
Fig. 6. Duram1 stores the first and second elements of each row of
the channel matrix, while duram2 stores the third and fourth elements.
The two RAMs are selected in order to facilitate the parallel Jacobi
rotation. Fig. 6 shows storage distribution of channel elements as well
in Fig. 5.
Step 1) In prearray matrix update, two operations are performed.
1284IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 11, NOVEMBER 2006
a) Compute ??????
? is given in (9)]. This is carried out using both
block1 and block2 in Fig. 4. Block1 computes ????
and ????, while block2 computes ???? and ????.
Each block consists of four real multipliers and an
accumulator to carry out one complex multiplication
and accumulation. It takes each block nine cycles to
compute two weighted channel elements which are
then stored in both durams (Fig. 6).
b) Store ?????????????????? in the third column of
duram 1 and ? at appropriate locations in both durams
Step 2) In this step, the four complex weighted channel elements in
the first row of both durams are converted to real.
a) Calculate “?” using divider for each of the weighted
channel element (????, ????, ????, and ????).
b) Use “?” to compute “? or ?” by applying “?” to mul-
tipliers and adders in block1. Store “? or ?” in ?? in
block1. Compute “? or ?” using mult4 and then store
in the second ?? in block1.
c) Use the values of “?” and “?” to rotate the imaginary
parts of the corresponding channel elements to zero
using block1 for ????and ????and block2 for ????
and ????, respectively.
????[an example for the first row of
memory and applying it to the divider, the output of which
is the corresponding “?” value which is not only stored but
also applied to the multipliers to compute either “?” or
“?.” For example, if “?” is calculated using (6), then “?”
is computed by applying “?” to mult1 to form “??” which
is then applied not only to mult2 to compute “??” but also
to “??? ? ?” module to compute “??????.” The value
“??” is applied to mult3 and shifter “? ?” to compute
? ? ???? ??. These are then added in adder “????” to com-
pute “?.” The “?” is then applied to mult4 with the corre-
sponding “?” to compute “?.” This is done in pipeline for
are rotated to zero by applying both the real and imaginary
parts of the weighted channel elements together with their
corresponding “?” and “?” values to the two blocks of mul-
tipliers in Jacobi rotation.
Step 3) Rotate the elements in the first row of ? to zero. (????,
????, ????, and ????used in this section carry different
values from step 2 and they are all real).
a) Calculate “?” for ????, ????, and for ????, ????.
b) Compute “?” and “?” for the corresponding “?.”
c) Rotate ????and ????to zero
d) Calculate “?,” “?,” and “?” values for “1” and ????.
e) Rotate ????to zero.
f) Calculate “?,” “?,” and “?” for ????and “1” modified
g) Rotate ????to zero.
For this, ????is rotated against ????while ????is rotated against
????. The calculation of “?,” “?,” and “?” are previously explained.
These “?” and “?” values are then applied to the two sets of four multi-
pliers to rotate the entire column of ????against the entire column of
????.The eight multipliers are needed for the rotation of two complex
Fig. 7. Scalable architecture for pseudoinverse computation.
numbers. This can be explained by assuming ??? ? ??? ???? and
??? ? ??? ????, then
From the previous expression, it is clear that eight multipliers are
used to compute real and imaginary values of the two columns that are
involved in rotation. Though ????and ????are real but the followers
have real and imaginary parts. After rotating ???? to zero, ???? and
????are applied which rotate ????to zero.
In the second part of this step, ???? is rotated to zero against the
first element in the third column of duram1 which is located at 20?
same value which after the third rotation has been modified from “1”
to another constant value “?.” This completes the first iteration. After
four such iterations ????becomes ???and ? the unitary matrix ??.
This architecture is extended as shown in Fig. 7 for computing pseu-
doinverse of matrices for 4 ? 4 to 10 ? 10 antenna configurations. The
sequence of operation is similar to the pseudoinverse computation for
4 ? 4 antenna system. For example, for a 10 ? 10 antenna system,
the size of prearray matrix is 21 ? 11. Block 1 and 3 (in Fig. 7) are
used first to compute the coefficients “?” and “?.” Block1 computes
coefficients for the elements stored in duram1 and 2 whereas block3
computes coefficients for elements stored in duram3 and 4. Computa-
tion consists of ten iterations and each iteration consists of three steps,
which are: 1) prearray matrix update; 2) rotating the imaginary parts of
all the weighted channel elements to zero; and 3) rotating the weighted
channel elements to zero.
In the first step, i.e., prearray matrix update, block1, 2, 3, and 4
compute the ten weighted elements ???? to ?????. This step takes
a maximum of 30 clock cycles. The second step reduces all complex
weighted elements (????to ?????) to their corresponding real coun-
terparts. This step takes 71 cycles to complete this. The third step in-
volves zeroing these ten weighted elements. This zeroing of the ele-
ments is done through rotation using the same Jacobi algorithm that is
used in the fixed 4 ? 4 architecture. The third step takes 137 cycles.
B. Pipelined Mixed CORDIC and Multiplier-Based VLSI Architecture
A pipelined VSLI architecture based on the CORDIC algorithm has
been developed in our earlier work . In this paper, selective clock
gating has been extended to the multiply-accumulate (MAC) module
of the architecture in  to reduce the power consumption further.
The MAC unit in  is idle during most of the processing time. To
is gated. The MAC module is performing useful function for only 7%
of the entire computation time of pseudoinverse calculation. The gated
signal is deasserted whenever the MAC unit is required to compute the
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 11, NOVEMBER 20061285
POWER RESULTS AT 100 MHz FOR THE PSEUDOINVERSE MODULES
AREA RESULTS FOR THE PSEUDOINVERSE MODULES
weighted channel elements necessary for the prearray matrix update.
During other times, it is asserted to stop the clock going into the MAC
unit. The logic overhead is minimal compared to the size of the MAC
IV. SIMULATION AND SYNTHESIS RESULTS
ThemodifiedCORDIC-basedpseudoinverse module of
Section III-B and the multiplier-based module of Section III-A have
been synthesized and mapped to 0.18-?m CMOS technology using
Synopsys Design Compiler. The netlist of the module in Section III-B
has been simulated at 100 MHz for power comparison, which is
recorded in Table II. From Table II, it is clear that the CORDIC and
the multipliers (in the MAC unit) are the major power consuming
units. Using the modified MAC module together with clock gating
has reduced power consumption further. The reduction in power
consumption is from 20.449 to 3.5218 mW which is equivalent to
83% reduction for the MAC module. This reduction produces 48%
reduction in the overall power consumption of the pseudoinverse
module compared to the architecture in  and 12% compared to the
architecture in .
The area of the multiplier-based module is 698825 m?(Table III).
The module in  is developed and synthesized as well and its area is
1314071 ?m?(46% increase). The area occupied by the pseudoinverse
module suggested in  is 1016274 ?m?(31% increase). (In , the
area occupied by the RAM was not included for the reason to share
memory between other processing elements. However, for throughput
improvement it was decided later to reserve it for the pseudoinverse
module.) The area reduction is due to sequencing the operations to uti-
lize the same hardware resources. For example, the eight multipliers
are used not only to compute the coefficients (? and ?) but also used to
perform parallel Jacobi rotation. In addition, these multipliers also per-
form the function of MAC module used in architectures , . Area
can be reduced further by using only four multipliers instead of the
eight. However, this will result in appreciable increase in latency. At
POWER COMPARISON AT 50 MHz
present all three modules take 460 clock cycles to compute the pseu-
would be approximately 676 cycles (46% increase).
The pseudoinverse of a matrix can also be computed through SVD
using systolic arrays . Systolic arrays provide a modular approach
to design, however, because of area and power overheads, systolic ar-
ified integrated circuit (ASIC) solution.
The maximum frequency of the operation achieved with the multi-
can be improved by using a pipelined divider at the cost of latency in
reduced by half and the maximumfrequency of operation can be raised
to 100 MHz. The latency cost is just four clock cycles per iteration or
a total of 16 clock cycles. However, the introduction of the pipelined
divider is not considered further in this paper. The netlist has been sim-
ulated at 50 MHz using design power from Synopsys for power com-
parison which is recorded in Table IV. For a fair comparison the power
figures in Table II are scaled down to 50 MHz. It is clear from Tables II
and IV that power consumption of the multiplier-based pseudoinverse
module is equivalent to the power consumption of the optimized two
CORDIC-based module. The multiplier-based module is quite power
efficient comparedtothe CORDIC-based modulesdescribed inand
Since the divider is used only for 6% of time to calculate “?” values,
its operation will not result in any instability as mentioned in . The
multiplier-based module is not only area but also power efficient com-
pared to our previous architecture in  as well as the architecture in
The authors have presented an analysis of the VBLAST receiver
and identified the major power and area consuming blocks inside the
receiver. The VBLAST receiver is first simulated at high level and
its BER is recorded. After identifying the major area and power con-
tures are compared with each other as well as the architecture available
merits. The multiplier-based architecture has the merits of low power
and low area while the improved CORDIC-based architecture has the
advantages of slightly reduced area and of being able to be operated at
a higher frequency of 160 MHz. From the analysis, we conclude that
multiplier-based module would be most suited for low power and area
efficient implementation of the pseudoinverse computation.
 G. J. Foschini, “Layered space-time architecture for wireless commu-
nication in fading environments when using multiple antennas,” Bell
Labs Tech. J., vol. 2, Autumn, 1996.
 A. Adjoudani, E. C. Beck, A. P. Burg, G. M. Djuknic, T. G. Gvoth, D.
Haessig, S. Manji, M. A. Milbrodt, M. Rupp, and D. Samardzija, “Pro-
totype experience for MIMO BLAST over third-generation wireless
system,” IEEE J. SelectedAreas Commun., vol. 21, no. 3, pp. 440–451,
1286IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 14, NO. 11, NOVEMBER 2006
 D. Garrett, L. Davis, S. Brink, B. Hochwald, and G. Knagge, “Silicon
complexity for maximum likelihood MIMO detection using spherical
decoding,” IEEE J. Solid-State Circuits, vol. 39, no. 9, pp. 1544–1552,
 P. W. Wolniansky, G. J. Foschini, G. D. Golden, and R. A. Valenzuela,
“V-BLAST: An architecture for realizing very high data rates over the
rich-scattering wireless channel,” in Proc. ISSSE, 1998, pp. 295–300.
 J. Wang and B. Daneshrad, “A comparative study of MIMO detection
algorithms for wideband spatial multiplexing systems,” IEEE Wireless
Commun. Netw. Conf., vol. 1, no. 2, pp. 408–413, Mar. 2005.
 G. H. Golub and C. F. Van Loan, Matrix Computation.
Johns Hopkins Univ. Press, 1996.
 B. Hassibi, “An efficient square-root algorithm for BLAST,” in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2000, pp.
 Z. Guo and P. Nilsson, “A VLSI implementation of MIMO detection
for future wireless communications,” in Proc. IEEE 14th Personal, In-
door Mobile Radio Commun., 2003, pp. 29–49.
 Z. Khan, T. Arslan, J. S. Thompson, and A. T. Erdogan, “Dual strategy
based VLSI architecture for computing pseudo inverse of channel ma-
trix in a MIMO wireless system,” in Proc. IEEE Int. Symp. VLSI, 2006,
 N. D. Hemkumar and J. R. Cavallaro, “A systolic VLSI architecture
for complex SVD,” in Proc. IEEE Int. Symp. Circuits Syst., 1992, pp.
 C. M. Rader, “VLSI systolic arrays for adaptive nulling,” IEEE Signal
Process. Mag., vol. 13, no. 4, pp. 29–49, Jul. 1996.
Architectures for Dynamic Data Scaling in
2/4/8K Pipeline FFT Cores
Thomas Lenart and Viktor Öwall
Abstract—This paper presents architectures for supporting dynamic
data scaling in pipeline fast Fourier transforms (FFTs), suitable when
implementing large size FFTs in applications such as digital video
broadcasting and digital holographic imaging. In a pipeline FFT, data is
continuously streaming and must, hence, be scaled without stalling the
dataflow. We propose a hybrid floating-point scheme with tailored expo-
nent datapath, and a co-optimized architecture between hybrid floating
point and block floating point (BFP) to reduce memory requirements for
2-D signal processing. The presented co-optimization generates a higher
signal-to-quantization-noise ratio and requires less memory than for
instance convergent BFP. A 2048-point pipeline FFT has been fabricated
in a standard-CMOS process from AMI Semiconductor (Lenart and
Öwall, 2003), and a field-programmable gate array prototype integrating
a 2-D FFT core in a larger design shows that the architecture is suitable
for image reconstruction in digital holographic imaging.
Index Terms—Block floating point (BFP), convergent BFP (CBFP), dig-
ital holography, digital video broadcasting (DVB), dynamic data scaling,
fast Fourier transform (FFT), hybrid floating point, orthogonal frequency-
division multiplexing (OFDM).
The fast Fourier transform (FFT), is one of the most commonly used
operations in digital signal processing and, currently, the demands
increase towards larger and multidimensional transforms. Larger
Manuscript received December 19, 2005; revised March 29, 2006.
The authors are with the Lund Institute of Technology, Department of Elec-
troscience, SE-221 00 Lund, Sweden (e-mail: Thomas.email@example.com).
Digital Object Identifier 10.1109/TVLSI.2006.886407
transforms require more processing on each data sample, which in-
creases the total quantization noise. This can be avoided by gradually
increasing the wordlength inside the pipeline, but affects memory
requirements as well as the critical path in arithmetic components. For
large size FFTs, dynamic scaling is, therefore, a suitable tradeoff be-
tween arithmetic complexityand memory requirements.The following
architectures have been evaluated and compared with related work.
a) A hybrid floating-point pipeline with fixed-point input and tai-
lored exponent datapath for 1-D FFT computation.
also requires the input format to be hybrid floating point. Hence,
the hardware cost is slightly higher than in (a).
c) A co-optimized design based on a hybrid floating-point pipeline
combined with block floating point (BFP) for 2-D FFT compu-
tation. This architecture has the processing abilities of (b) with
hardware requirements comparable to (a).
The primary target application for the implemented FFT core is a
microscope based on digital holography  where visible images are
to be digitally reconstructed from an interference pattern. The pattern
is recorded on a large digital image sensor with a resolution of 2048 ?
Fourier transformation. Hence, the architectures outlined in (b) and (c)
are suitable for this application. Another area of interest is in wireless
communication systems based on orthogonal frequency division mul-
tiplexing (OFDM). The OFDM scheme is used in, for example, dig-
ital video broadcasting (DVB) , including DVB-T with 2/8-K FFT
modes and DVB-H with an additional 4-K FFT mode. The architecture
described in (a) is suitable for this field of application.
Section II gives a brief introduction to the FFT and Section III
presents different dynamic data scaling alternatives for pipeline
FFTs, with additional architectural features described in Section IV.
Section V shows software simulation results in terms of precision
and memory requirements. Finally, Section VI presents the VLSI
implementation and measurements on the fabricated application-spe-
cific integrated circuit (ASIC) prototype, and a conclusion is given in
II. FFT ARCHITECTURE
ThefastFouriertransform isadecomposition ofan?-pointdiscrete
Fouriertransform (DFT)intosuccessively smaller DFTtransforms.
This paper describes pipeline FFT architectures, constructed from a
number of cascaded Radix-? butterfly blocks and complex multipliers
each dividing the sequence into ? smaller FFTs. Another common ap-
proach is parallel FFT architectures, placing the computational blocks
in parallel instead of cascaded. Simulations and implementations pre-
sented in this paper are all based on the Radix-??single path delay
feedback ???????? algorithm . Input data is supplied in linear
sample order, hence, requiring the largest delay feedback buffer of size
has low memory requirements (????? ? words), simple butterfly ar-
chitecture and requires only ????????? ? complex multipliers. The
Radix-??butterfly is constructed from two Radix-2 butterflies divided
by a trivial multiplication, as shown in Fig. 1.
III. DYNAMIC DATA SCALING
Fixed-point is a widely used format in realtime and low-power
applications due to the simple implementation of arithmetic units.
In fixed-point arithmetic, a result from a multiplication is usually
rounded or truncated to avoid a significantly increased wordlength,
hence, generating a quantization error. The quantization energy caused
byroundingis relativelyconstant dueto thefixedlocation ofthebinary
1063-8210/$20.00 © 2006 IEEE