Algorithm and VLSI architecture for linear MMSE detection in MIMOOFDM systems.
ABSTRACT The paper describes an algorithm and a corresponding VLSI architecture for the implementation of linear MMSE detection in packetbased MIMOOFDM communication systems. The advantages of the presented receiver architecture are low latency, highthroughput, and efficient resource utilization, since the hardware required for the computation of the MMSE estimators is reused for the detection. The algorithm also supports the extraction of soft information for channel decoding

Dataset: NEWCAS07 Sarraf[1 2]rev4
 [Show abstract] [Hide abstract]
ABSTRACT: Digital camera identification can be accomplished based on sensor pattern noise, which is unique to a device, and serves as a distinct identification fingerprint. Camera identification and authentication have formed the basis of image/video forensics in legal proceedings. Unfortunately, realtime video source identification is a computationally heavy task, and does not scale well to conventional software implementations on typical embedded devices. In this paper, we propose a hardware architecture for source identification in networked cameras. The underlying algorithms, an orthogonal forward and inverse discrete wavelet transform and minimum mean square errorbased estimation, have been optimized for 2D frame sequences in terms of area and throughput performance. We exploit parallelism, pipelining, and hardware reuse techniques to minimize hardware resource utilization and increase the achievable throughput of the design. A prototype implementation on a Xilinx Virtex6 FPGA device was optimized with a resulting throughput of 167 MB/s, processing 30 640 × 480 video frames in 0.17 s.IEEE Transactions on Circuits and Systems for Video Technology 01/2014; 24(1):157167. · 1.82 Impact Factor  SourceAvailable from: Shahriar Shahabuddin
Conference Paper: An Adaptive Detector Implementation for MIMOOFDM Downlink
9th International Conference on Cognitive Radio Oriented Wireless Networks (CrownCom); 06/2014
Page 1
Algorithm and VLSI Architecture for Linear
MMSE Detection in MIMOOFDM Systems
A. Burg, S. Haene, D. Perels, P. Luethi, N. Felber and W. Fichtner
Integrated Systems Laboratory, ETH Zurich, Switzerland
{ apburg,haene,perels,luethi,felber,fw } @iis.ee.ethz.ch
Data frame
Abstract The paper describes an algorithm and a correspond
ing VLSI architecture for the implementation of linear MMSE
detection in packetbased MIMOOFDM communication sys
tems. The advantages of the presented receiver architecture are
low latency, highthroughput, and efficient resource utilization,
since the hardware required for the computation of the MMSE
estimators is reused for the detection. The algorithm also supports
the extraction of soft information for channel decoding.
Idle
Dtat
Idle
MIMO detectioni
Detectionlatency
Fig. 1.
OFDM systems.
Timing diagram of MIMO detection process in packetbased MIMO
I. INTRODUCTION
Multipleinput multipleoutput (MIMO) wireless communi
cation systems [1] employ multiple antennas at the transmitter
and at the receiver to increase system capacity and to achieve
better quality of service. In spatial multiplexing mode, MIMO
systems reach higher peak data rates without increasing the
bandwidth of the system by transmitting multiple data streams
in parallel in the same frequency band. Orthogonal frequency
division multiplexing (OFDM) is a modulation scheme that is
robust against interference arising from multipath propagation.
Consequently, many upcoming standards for high throughput
wireless communication such as IEEE 802.1 in and IEEE
802.16 rely on a combination of MIMO with OFDM. Unfor
tunately, the performance improvements of MIMO technol
ogy also entail a considerable increase in signal processing
complexity, in particular for the separation of the parallel
data streams. Hence, a major challenge associated with the
implementation of future wireless communication systems is
in the design of lowcomplexity MIMO detection algorithms
and corresponding VLSI architectures.
In this work, we consider the VLSI implementation of
linear MMSE detection for wideband MIMOOFDM systems.
A suboptimal linear detection scheme is contemplated since
the implementation of algorithms with better performance
(e.g., [2], [3], [4]) either do not meet the high throughput
requirements for MIMOWLAN (especially not on FPGAs)
or lack the ability to provide softinformation for channel
decoding with low hardware complexity.
time index t on the kth tone of the OFDM signal. After proper
OFDM modulation at the transmitter and demodulation at the
receiver, the corresponding received vector y[k, t] is given by
y[k,t]=H[k]s[k, t] + n[k, t],
(1)
where the MR X MTdimensional matrix H[k] describes the
effective MIMO channel for the kth tone and the vector n[k, t]
models the thermal noise in the system as i.i.d. proper complex
Gaussian with variance (Y
knowledge of the channel matrices, the linear MMSE estimator
for each tone is given by
G[k]=(HH [k]H[k] +MT 2I)
per complex dimension. Assuming
lHH[k]
(2)
and linear MIMO detection corresponds to a straightforward
matrixvector multiplication according to
s[k,t]G[k]y[k,t]
(3)
followed by quantization of the entries of s[k, t] to the nearest
constellation point.
The difficulty in the implementation of linear receivers for
packetbased MIMOOFDM systems arises from the frame
structure because the initial training phase, during which the
receiver obtains knowledge of H[k], is immediately followed
by data. Since the detection of the data according to (3) only
starts when the MMSE estimators for all K data carrying tones
have been computed, the delay incurred by the preprocessing
according to (2) translates directly into detection latency as
illustrated in Fig. 1. In MIMOOFDM receiver implementa
tions[5],thislatencyisresponsiblefor considerablememory
requirementsto buffer the received vectors and can causeprob
A. System Model and Requirements
The system under consideration is a packetbased MIMO
OFDM system wtth MT transmit and MR recetve antennas. par
thantheA
0780393902/06/$20.00~~~lem ©2006 IEEEn
4102emnt
ISCA
2006du
acsscnto
Authorized licensed use limited to: Texas A M University. Downloaded on March 24, 2009 at 03:08 from IEEE Xplore. Restrictions apply.
Page 2
of packetbased MIMOOFDM receivers. However, it is also
noted that the corresponding operation is only performed once
at the start of the frame so that, without special provisions, the
potentially costly hardware for the preprocessing will be idle
most of the time.
Contribution: In this paper an algorithm for efficient tone
bytone linear preprocessing of channel state information in
MIMOOFDM systems is presented, together with a hardware
efficient VLSI architecture for its realization. The described
receiver constitutes the basis for the softoutput demapper
described in [6] which yields a 56 dB gain in terms of signal
to noise ratio (SNR) over a harddecision MMSE decoder.
The reported ASIC and FPGA area and performance figures
provide reference for the true silicon complexity of linear
MMSE receivers for MIMOOFDM systems.
Outline: The next section introduces the algorithm for
the computation of the linear MMSE detectors. Section III
describes a scalable VLSI architecture for the proposed al
gorithm. Area and performance figures for ASIC and FPGA
implementations are provided in Section IV. Section V con
cludes the paper.
number of multiplications2 and divisions is given by
5
2T5
CMult=2MRMT+ 5MRM MT +MT
CDiv2MR
2
(6)
In order to map recursion (5) to hardware,
mathematical description is expanded as shown in Alg. 1. The
operation sequence is designed to reduce the dynamic range
of intermediate results and to minimize the number of costly
divisions, while keeping the number of multiplications low.
its compact
Algorithm 1 Algorithm for computing the MMSE estimator
P(M)
for
MT6M
2lfrj=I...MR do
3
4:
S= 1+ Hj
5:Se elog25S
2Sel/
6:
7:
8: end for
9: G = P(MR)HH
1lI
g=P(ji)HH
(note that S is strictly positive)
g = 5mg
p(j) = p(j1)  ggH2Se
II. PREPROCESSING ALGORITHM
III. VLSI ARCHITECTURE
Algorithm choices for the implementation of (2) are either
based on QRdecomposition [7] using unitary transformations
or on direct matrix inversion algorithms with conventional
arithmetic. The main advantages of the QR approach lie in its
favorable numerical properties in fixedpoint implementations
and in the availability of a wide range of regular array archi
tectures [8], [9] for their implementation. The main arguments
for direct matrix inversion are the lower number of operations
compared to QR decomposition and the fact that the matrix
(HH [k]H[k]+MTG2I)
In fact, the diagonal entries of this matrix are required for the
computation of softoutputs [10], [6].
The implementation that is described in this paper relies
on direct matrix inversion. The corresponding algorithm iS
borrowed from the updating procedure of the Kalman gain in
Kalman filtering applications. The basic idea is to start from
the trivial inverse Of MT.2I and to obtain (HHH +MTG2I)
through a series of MR rankone updates by using the matrix
inversion lemma. The iteration is initialized by setting
The choice of a suitable hardware architecture for the
implementationofAlg. 1 dependson thesystem specifications
and on the available area: The most area efficient solution
is a fully decomposed, processorlike architecture. However,
such a minimumarea solution cannot meet the lowlatency
requirements of MIMOOFDM systems. A highly parallel
architecture achieveshigher throughputbut suffers signifi
cantlyfrom the fact that datadependenciesand thedesire
for aregulardata flow mandate asequentialexecution of the
individual steps in Alg. 1. Since these steps differ significantly
in the number of required operations, a massively parallel
architecture would result in apoorutilization ofprocessing
resources. In
number of processing resources is chosen so that their average
utlzto. shg.Moto.h
I is produced as an intermediate result.
a moderately parallel VLSI architecture the
tp
nAl.1rqieete
MT or a multiple Of MT multiplications. Hence, choosing
an MTfold degreeofparallelismleads to ahighhardware
utilization.
1
A. Moderately Parallel Architecture
The highlevel block diagram of the proposed moderately
parallel architecture is shown in Fig. 2. The circuit employs
MT identical processing elements (PEs) arranged in a circular
arrayand a common1/Yblock thatcomputesthe additions in
step 4) and the pseudo floatingpoint division in step 5). The
connections in the array are local, meaning that only neigh
boring PEs are connected with each other. Each PE mainly
contains a complexvalued multiplier, an adder and some local
storage registers as shown in Fig. 3. All intermediate variables
are stored locally, equally distributed over the PBs. For the
21n terms of complexvalued multiplications. The few realvalued mul
tiplications are counted as complexvalued, assuming a dedicated VLSI
architecture with multipliers optimized for complexvalued coefficients.
p(O)
M1
I
(4)
MTG2
and proceeds by computing
HH
p(jl)
p(i) =p(i1) HIiH.Pi
V
'
(5)
1 +HHP(j1)HH'i
where H1 denotes the jth row of H. After MR iterations,
p(MR)~~~~~~~~~~
index of the OFDM tone has been omitted for brevity. The
complexity of the above described algorithm in terms of the
.HH+MGIn (RH hr
h
4103
Authorized licensed use limited to: Texas A M University. Downloaded on March 24, 2009 at 03:08 from IEEE Xplore. Restrictions apply.
Page 3
PE(1) PH'l
4P4,2"j 2
P3,3
P2,4
[PHj]I
[PHj&13
= [PH&14
27 t 
zt
z t
z t
PE(3)
PE(4)
_31 jl+22j2+l3j3<P
P3
2HI12
Hg
+P2,3HH
+P3,4Hf
Frrsrr'XSrflXrwrwrtm~~~~~~~~~~~~~~~~~~~Cycles
,1 , ~~,~~~~~~~~~~~~~~~~~ ,
p
I
P
EE
compute g, since P(°) is a diagonal matrix. The multiplications
Highlevel block diagram of a moderately parallel architecture for the in step 4) can be carried out on all PBs in parallel in a single
direct matrix inversion. The same hardware is reused for the linear detection.
r ai
_
_
PE(1 )PE(2) PE(i)
PE(MT)
Fig. 4.Procedure to compute step 3) in Ag. 1 for MT
4.
_ _ L * L *~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~E4 14_ Lj
. ,$_It%
result again into the g register of the neighboring PB. This
Procedure is repeated for MT cycles with the exception of
the first iteration (j
)
1) in which a single cycle suffices to
Fig. 2.
cycle. The summation of the results that yields S is absorbed
into the 1/
H Yp division in step 5)inCDiV cycles. The computation ofg from g
inst
iseaginstrivaancain
inothe upperbring, whiecirclaestrfoughthe lditowe ring.th
block, which performs the addition and the
7) is
b
o
E
ibe performed in
a single cycle
Fig2
Hihevebockdigra
oa
odratlyparlllachtecur fo teon teidvuaPB.step
carried out
l
inMTacycles
inasnl
_
~~~~~~~
Sm
each of wvhich one entry of g iS broadcasted to all PBs through
~~~~~~~~~~~~~~~~Before step 9) is executed, the diagonal entries of P can be
Lolonal
d
lI IIregister
o_ _ _
_
 
Ithe
wise, as shown in Fig. 2. The jth row of H can be replaced
~~ ~ ~ ~ ~~~pricula sryimpratwentenme
M mtiveorulpictiones
II
T
~~~~~~extractedfor the computation of soft bitmetrics, as described
in [6].Thepmatrixmultiplication of
in a series of MR matrixvector multiplications each of which
is identical to step 3). The entries of H are again applied to
inputsofthePEsrowbyrowso that G isoutoutcolumn
of
l
o in
ppeline
gkRIwith His computed
Lztj
/
withacthedjorthcolmpuain
ofG,s that noietramemorycs is requrired
into store the MMSE estimators This
eMR
oi aon is
p
isc lage Thec
MI
IsIdToverallnumber of cyclesThatris
with the presented architecture is given by
requir
ed to
ompute Alg.
1
 LEn1
S
~~~~~~~~~~~~~~~oveallumerdo
~~~~~~~tcpd =MR(3MT +2)MT + 1 +MRCDiV.
multlier,thati rdequiredtohomueloalrgst.I
(7)
 1  X
l
~~~~~~~~c.
wise, A significant advantage of the described conventional arith
metic based computation of the MMSE estimator is that once
*prepro
ispomplete, the sameh
afor the detection according to (3). To this end, G is read back
Detection
MTU
Fig. 3.Schematic of a single PE. The main components are the complex
sw
ane
be lrgeue
from the memory one column at a time. The entries of the jth
column are presented to the PBs together with the jth entry
of the received vector y which is broadcasted to all PBs. The
results of the multiplications of Gij with Yj in the ith PB are
accumulated in the
av
after MR cycles.
Hermitian matrix P, only the main diagonal and the lower
triangular part are stored.
B. MMSE Estimator Computation
The computation of the MMSE estimators starts with the
loop between step 2) and 8) in Alg. 1. During the jth iteration
of this loop, the entries of the jth row of H must be presented
to the inputs of the PBs as shown in Fig. 2. The computation
of the matrixvector multiplication in step 3) is illustrated in
Fig. 4. In the first cycle, the first PB uses the upper ring to
broadcast H1,1 to all other PBs which multiply H1,1 with their
respective entry of the first column of p(ir) and store the
result in the g register of the neighboring PB. In the second
cycle, the second PB broadcasts Ho2 to the other PBs, which
multiply it with the respective entries of the second column
of p(4.), add the content of their g register and forward the
register of the same PB andsis available
Df Pipelining
Despite the recursive nature of the applied algorithm,
pipelining can be introduced to allow for higher clock frequen
cies. An additional register is added to the original architecture
as shown in Fig. 3. The actual increase in clock speed depends
on the quality of the placement of this pipeline register in the
logic. Implementation results show that a factor of almost 1.7
can easily be reached with manual retiming. Unfortunately,
pipelining of the recursive matrix inversion algorithm also
mandates the insertion of additional cycles to flush the pipeline
4104
Authorized licensed use limited to: Texas A M University. Downloaded on March 24, 2009 at 03:08 from IEEE Xplore. Restrictions apply.
Page 4
1010T
Ntpie°e
4Nopipelined
Time/Inv.
0.68 ,us
0.72 ,ls
FPGAImplementation.
sign (for MT=MR=4) on a XILINX XC2V60006 FPGA,
WW = 18 was chosen as the device contains hardwired multi
pliers of that size. The pipelined version operates with a clock
rate of 40 MHz and requires 2.2 ps to compute the MMSE
estimator of one tone. Hence, for example, the detection
latency in a system with K= 64 tones adds up to 141 ps. The
throughput in detection mode is 10 Mvps. In terms of area,
the design consumes 16 out of 144multipliers and 3'416 logic
slices out of a total of 33'792.
V. CONCLUSIONS
For the implementation of the de
6
WW
18
1
Area
69k
75k
85k0_74_s__A_
______ ______ ______ ______
_
O
B
t
~1 c
21
______
_
_
1 0
Pp_
CDiv=8Tmecpd
Area
P;elined
iW=18
v. WW=_1_9
WW20
WW=21
'Floating
point_
___
35
Inpacketbased MIMOOFDM systems even allegedly low
complexity linear detectors pose a considerable implementa
tion challenge. The presented algorithm and the scalable VLSI
architecture for the computation of the MMSE estimators
partially solve this problem for MIMOOFDM systems with a
small number of tones (K < 64). A first important advantage of
thepresented approachis that it reduces silicon areaby reusing
the same hardware for the preprocessing and the for the
detection. The second advantage is the ability to easily extract
soft bitmetrics for a subsequent channel decoder [6]. The
main drawback are the considerable numerical requirements.
Moreover, it is noted that for systems with a large number
of tones, preprocessing latency is still too high. A possible
solution to this problem has recently been proposed in [11].
18
1_9
20
21
TmE/v.58 ,
0.6 ,us\
0.6 ,us
0.61[,us
0
78k GE
82k GE
89k GE
15
3
10
o
1
25
25
30s35o
30
40
40SNR
10 20
Fig. 5.
for MMSE detection in a system with MT = MR = 4 and with 16QAM
modulation.
Fixedpoint BER simulation and VLSI implementation results
after the operations associated with steps 3), 4), 6), 7) and 9)
of Alg. 1. As a result, the number of cycles increases to
tCPd = MR(3MT +6) MT + 2+ MRCDiV (8)
ACKNOWLEDGEMENT
This work is supported by the STREP project No. IST
026905 (MASCOT) within the sixth framework programme
of the European Commission.
REFERENCES
[1] G. Foschini and M. Gans, "On limits of wireless communications in a
fading environment when using multiple antennas," Wireless Personal
Communications, vol. 6, no. 3, pp. 311334, 1998.
[2] Z.Guo and P. Nilsson, "A 53.3 Mb/s 4 x 4 16QAM MIMO decoder in
0.35pm CMOS," in Proc. IEEE ISCAS, May 2005, pp. 49474950.
[3] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and
mance, and the computationtieforasingleMMSH.B6lcskei, "VLSI implementation ofMIMO detection using the sphere
mance, and the computation time for a single MMSE estimator
is illustrated in Fig. 5 for a system with MT = MR = 4.
The VLSI implementation results are based on a 0.25 pm
technology and for th BER
simulationstheentIEEEInt. Symp. on Circuits and Systems, May 2006.
technology and for the BER simulations the entries of H[k]
are assumed i.i.d. Rayleigh fading with variance one, so that
the received SNR is given by 17/2. For the computation of
the MMSE estimator, H[k]
is represented in a block floating
point format and WW denotes the wordwidth of the realvalued
multipliers which constitute the complexvalued multipliers
in the PEs. The clock rates of the unpipelined designs are
between 93 MHz and 101 MHz, depending on the wordlength.
The pipelined implementations achieve between 167 MHz and
176 MHz. For the computation of the MMSE estimators, the
gain from the higher clock frequency remains small due to
the increase in the number of cycles. However, a significant
performance improvement is achieved from pipelining when
the circuit operates in detection mode, because during this
operation no pipeline bubbles need to be inserted. Without
.. ..
pipelining, 2325 millilon (received) vectors per second (Mvps)
can be processed, while with pipelining throughput increases
to 4244 Mvps.
and the number of cycles for the division must also be
increased to match the higher clock rate.
IV. IMPLEMENTATION RESULTS
A critical design parameter is the
Implementation. A crlhcal deslgn parameter 1S the
wordlength of the complexvalued datapath. The correspond
ing tradeoffs between silicon area, bit error rate (BER) perfor
ASIC
ASIC
decoder algorithm," IEEE Journal of SolidState Circuits, 2005.
[4] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, "KBest
MIMO detection VLSI architectures achieving up to 424 Mbps," in Proc.
[5] D. Perels, S. Haene, P. Luethi, A. Burg, N. Felber, W. Fichtner, and
H. B6lcskei, "ASIC implementation of a MIMOOFDM transceiver for
192 mbps WLANs," in Proc. IEEE ESSCIRC, Sept. 2005, pp. 215218.
[6]
"Silicon implementation of an MMSEbased soft demapper for MIMO
BICM," in Proc. IEEE Int. Symp. on Circuits and Systems, May 2006.
[7] Z. Khan, T. Arslan, J. S. Thompson, and A. T. Erdogan, "Area & power
efficient VLSI architecture forcomputing pseudo inverse of channel
ln t PsT ccrsfhupnamatrix in a MIMO wireless system," in Proc. IEEE Int. Conf on VLSI
Design (VLSID), Jan. 2006, pp. 734737.
[8] G. Lightbody, R. Woods, and R. Walke, "Design of a parameterized
silicon intellectual property core for QRbased RLS filtering,"IEEE
Trans. on VLSI Systems, vol. 11, pp. 659678, 2003.
[9] F. Edman and V. Owall, "A scalable pipelined complex valued matrix
inversion architecture," in Proc. IEEE ISCAS, 2005, pp. 44894492.
[10]
ceiver design for MIMO bitinterleaved coded modulation," in Proc. 8th
IEEEInt. Symposium on Spread Spectrum Techniques andApplications,
2004, pp. 1216.
~~~~~~~~~~~~~[11]
"Interpolationbased QR decomposition in MIMOOFDM systems,"
in Proc. IEEE Workshop on Signal Processing Advances in Wireless
Communications (SPAWC), June 2005, pp. 945949.
s. Haene, A.Burg,D. Perels, P. Luethi, N. Felber, andW.Fichtner,
I.B.Collings,M. R. G. Butler, and M.McKay,"Lowcomplexityre
.
D. Cescato, M. Borgmann, H. Boilcskei, J. C. Hansen, and A. Burg,
4105
Authorized licensed use limited to: Texas A M University. Downloaded on March 24, 2009 at 03:08 from IEEE Xplore. Restrictions apply.
View other sources
Hide other sources
 Available from Andreas Peter Burg · Jun 4, 2014
 Available from tamu.edu