Page 1

Algorithm and VLSI Architecture for Linear

MMSE Detection in MIMO-OFDM Systems

A. Burg, S. Haene, D. Perels, P. Luethi, N. Felber and W. Fichtner

Integrated Systems Laboratory, ETH Zurich, Switzerland

{ apburg,haene,perels,luethi,felber,fw } @iis.ee.ethz.ch

Data frame

Abstract- The paper describes an algorithm and a correspond-

ing VLSI architecture for the implementation of linear MMSE

detection in packet-based MIMO-OFDM communication sys-

tems. The advantages of the presented receiver architecture are

low latency, high-throughput, and efficient resource utilization,

since the hardware required for the computation of the MMSE

estimators is reused for the detection. The algorithm also supports

the extraction of soft information for channel decoding.

Idle

Dtat

Idle

MIMO detectioni

Detectionlatency

Fig. 1.

OFDM systems.

Timing diagram of MIMO detection process in packet-based MIMO-

I. INTRODUCTION

Multiple-input multiple-output (MIMO) wireless communi-

cation systems [1] employ multiple antennas at the transmitter

and at the receiver to increase system capacity and to achieve

better quality of service. In spatial multiplexing mode, MIMO

systems reach higher peak data rates without increasing the

bandwidth of the system by transmitting multiple data streams

in parallel in the same frequency band. Orthogonal frequency

division multiplexing (OFDM) is a modulation scheme that is

robust against interference arising from multipath propagation.

Consequently, many upcoming standards for high throughput

wireless communication such as IEEE 802.1 in and IEEE

802.16 rely on a combination of MIMO with OFDM. Unfor-

tunately, the performance improvements of MIMO technol-

ogy also entail a considerable increase in signal processing

complexity, in particular for the separation of the parallel

data streams. Hence, a major challenge associated with the

implementation of future wireless communication systems is

in the design of low-complexity MIMO detection algorithms

and corresponding VLSI architectures.

In this work, we consider the VLSI implementation of

linear MMSE detection for wideband MIMO-OFDM systems.

A suboptimal linear detection scheme is contemplated since

the implementation of algorithms with better performance

(e.g., [2], [3], [4]) either do not meet the high throughput

requirements for MIMO-WLAN (especially not on FPGAs)

or lack the ability to provide soft-information for channel

decoding with low hardware complexity.

time index t on the kth tone of the OFDM signal. After proper

OFDM modulation at the transmitter and demodulation at the

receiver, the corresponding received vector y[k, t] is given by

y[k,t]=H[k]s[k, t] + n[k, t],

(1)

where the MR X MT-dimensional matrix H[k] describes the

effective MIMO channel for the kth tone and the vector n[k, t]

models the thermal noise in the system as i.i.d. proper complex

Gaussian with variance (Y

knowledge of the channel matrices, the linear MMSE estimator

for each tone is given by

G[k]=(HH [k]H[k] +MT 2I)

per complex dimension. Assuming

lHH[k]

(2)

and linear MIMO detection corresponds to a straightforward

matrix-vector multiplication according to

s[k,t]G[k]y[k,t]

(3)

followed by quantization of the entries of s[k, t] to the nearest

constellation point.

The difficulty in the implementation of linear receivers for

packet-based MIMO-OFDM systems arises from the frame

structure because the initial training phase, during which the

receiver obtains knowledge of H[k], is immediately followed

by data. Since the detection of the data according to (3) only

starts when the MMSE estimators for all K data carrying tones

have been computed, the delay incurred by the preprocessing

according to (2) translates directly into detection latency as

illustrated in Fig. 1. In MIMO-OFDM receiver implementa-

tions[5],thislatencyisresponsiblefor considerablememory

requirementsto buffer the received vectors and can causeprob-

A. System Model and Requirements

The system under consideration is a packet-based MIMO-

OFDM system wtth MT transmit and MR recetve antennas. par

thanth-eA

0-7803-9390-2/06/$20.00~~~lem ©2006 IEEEn

4102emnt

ISCA

2006du

acsscnto

Authorized licensed use limited to: Texas A M University. Downloaded on March 24, 2009 at 03:08 from IEEE Xplore. Restrictions apply.

Page 2

of packet-based MIMO-OFDM receivers. However, it is also

noted that the corresponding operation is only performed once

at the start of the frame so that, without special provisions, the

potentially costly hardware for the preprocessing will be idle

most of the time.

Contribution: In this paper an algorithm for efficient tone-

by-tone linear preprocessing of channel state information in

MIMO-OFDM systems is presented, together with a hardware-

efficient VLSI architecture for its realization. The described

receiver constitutes the basis for the soft-output demapper

described in [6] which yields a 5-6 dB gain in terms of signal

to noise ratio (SNR) over a hard-decision MMSE decoder.

The reported ASIC and FPGA area and performance figures

provide reference for the true silicon complexity of linear

MMSE receivers for MIMO-OFDM systems.

Outline: The next section introduces the algorithm for

the computation of the linear MMSE detectors. Section III

describes a scalable VLSI architecture for the proposed al-

gorithm. Area and performance figures for ASIC and FPGA

implementations are provided in Section IV. Section V con-

cludes the paper.

number of multiplications2 and divisions is given by

5

2T5

CMult=2MRMT+ 5MRM -MT +MT

CDiv2MR

2

(6)

In order to map recursion (5) to hardware,

mathematical description is expanded as shown in Alg. 1. The

operation sequence is designed to reduce the dynamic range

of intermediate results and to minimize the number of costly

divisions, while keeping the number of multiplications low.

its compact

Algorithm 1 Algorithm for computing the MMSE estimator

P(M)

for

MT6M

2lfrj=I...MR do

3

4:

S= 1+ Hj

5:Se elog25S

-2Sel/

6:

7:

8: end for

9: G = P(MR)HH

1lI

g=P(j-i)HH

(note that S is strictly positive)

g = 5mg

p(j) = p(j-1) - ggH2-Se

II. PREPROCESSING ALGORITHM

III. VLSI ARCHITECTURE

Algorithm choices for the implementation of (2) are either

based on QR-decomposition [7] using unitary transformations

or on direct matrix inversion algorithms with conventional

arithmetic. The main advantages of the QR approach lie in its

favorable numerical properties in fixed-point implementations

and in the availability of a wide range of regular array archi-

tectures [8], [9] for their implementation. The main arguments

for direct matrix inversion are the lower number of operations

compared to QR decomposition and the fact that the matrix

(HH [k]H[k]+MTG2I)

In fact, the diagonal entries of this matrix are required for the

computation of soft-outputs [10], [6].

The implementation that is described in this paper relies

on direct matrix inversion. The corresponding algorithm iS

borrowed from the updating procedure of the Kalman gain in

Kalman filtering applications. The basic idea is to start from

the trivial inverse Of MT.2I and to obtain (HHH +MTG2I)

through a series of MR rank-one updates by using the matrix

inversion lemma. The iteration is initialized by setting

The choice of a suitable hardware architecture for the

implementationofAlg. 1 dependson thesystem specifications

and on the available area: The most area efficient solution

is a fully decomposed, processor-like architecture. However,

such a minimum-area solution cannot meet the low-latency

requirements of MIMO-OFDM systems. A highly parallel

architecture achieveshigher throughputbut suffers signifi-

cantlyfrom the fact that datadependenciesand thedesire

for aregulardata flow mandate asequentialexecution of the

individual steps in Alg. 1. Since these steps differ significantly

in the number of required operations, a massively parallel

architecture would result in apoorutilization ofprocessing

resources. In

number of processing resources is chosen so that their average

utlzto. shg.Moto.h

I is produced as an intermediate result.

a moderately parallel VLSI architecture the

tp

nAl.1rqieete

MT or a multiple Of MT multiplications. Hence, choosing

an MT-fold degreeofparallelismleads to ahighhardware

utilization.

1

A. Moderately Parallel Architecture

The high-level block diagram of the proposed moderately

parallel architecture is shown in Fig. 2. The circuit employs

MT identical processing elements (PEs) arranged in a circular

arrayand a common1/Y-block thatcomputesthe additions in

step 4) and the pseudo floating-point division in step 5). The

connections in the array are local, meaning that only neigh-

boring PEs are connected with each other. Each PE mainly

contains a complex-valued multiplier, an adder and some local

storage registers as shown in Fig. 3. All intermediate variables

are stored locally, equally distributed over the PBs. For the

21n terms of complex-valued multiplications. The few real-valued mul-

tiplications are counted as complex-valued, assuming a dedicated VLSI

architecture with multipliers optimized for complex-valued coefficients.

p(O)

M1

I

(4)

MTG2

and proceeds by computing

HH

p(j-l)

p(i) =p(i-1) HIiH.Pi

V

'

(5)

1 +HHP(j-1)HH'i

where H1 denotes the jth row of H. After MR iterations,

p(MR)~~~~~~~~~~

index of the OFDM tone has been omitted for brevity. The

complexity of the above described algorithm in terms of the

.HH+MGIn (RH hr

h

4103

Authorized licensed use limited to: Texas A M University. Downloaded on March 24, 2009 at 03:08 from IEEE Xplore. Restrictions apply.

Page 3

PE(1) PH'l

4P4,2"j 2

P3,3

P2,4

[PHj]I

[PHj&13

= [PH&14

27 t |

zt

z t|

z t

PE(3)

PE(4)

_31 jl+22j2+l3j3<P

P3

2HI12

Hg

+P2,3HH

+P3,4Hf

Frrsrr'XSrflXrwrwrtm~~~~~~~~~~~~~~~~~~~Cycles

,1 , ~~,~~~~~~~~~~~~~~~~~ ,

p

I

P

EE

compute g, since P(°) is a diagonal matrix. The multiplications

High-level block diagram of a moderately parallel architecture for the in step 4) can be carried out on all PBs in parallel in a single

direct matrix inversion. The same hardware is reused for the linear detection.

r ai

_

_

PE(1 )PE(2) PE(i)

PE(MT)

Fig. 4.Procedure to compute step 3) in Ag. 1 for MT

4.

_ _ L * L *~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~E4 14_ Lj

--. ,$_It%

result again into the g register of the neighboring PB. This

Procedure is repeated for MT cycles with the exception of

the first iteration (j

)

1) in which a single cycle suffices to

Fig. 2.

cycle. The summation of the results that yields S is absorbed

into the 1/

H Yp division in step 5)inCDiV cycles. The computation ofg from g

inst

iseaginstrivaancain

inothe upperbring, whiecirclaestrfoughthe lditowe ring.th

block, which performs the addition and the

7) is

b

o

E

ibe performed in

a single cycle

Fig2

Hih-evebockdigra

oa

odratlyparlllachtecur fo teon teidvuaPB.step

carried out

l

inMTacycles

inasnl

_

~~~~~~~

Sm

each of wvhich one entry of g iS broadcasted to all PBs through

~~~~~~~~~~~~~~~~Before step 9) is executed, the diagonal entries of P can be

Lolonal

d

lI IIregister

o_ _ _

_

| |

Ithe

wise, as shown in Fig. 2. The jth row of H can be replaced

~~ ~ ~ ~ ~~~pricula sryimpratwentenme

M mtiveorulpictiones

II

T

~~~~~~extractedfor the computation of soft bit-metrics, as described

in [6].Thepmatrixmultiplication of

in a series of MR matrix-vector multiplications each of which

is identical to step 3). The entries of H are again applied to

inputsofthePEsrow-by-rowso that G isoutoutcolumn-

of

l

o in

ppeline

gkRIwith His computed

Lztj

/

withacthedjorthcolmpuain

ofG,s that noietramemorycs is requrired

into store the MMSE estimators This

eMR

oi aon is

p

isc lage Thec

MI

IsIdToverallnumber of cyclesThatris

with the presented architecture is given by

requir

ed to

ompute Alg.

1

| LEn1

S

~~~~~~~~~~~~~~~oveallumerdo

~~~~~~~tcpd =MR(3MT +2)-MT + 1 +MRCDiV.

multlier,thati rdequiredtohomueloalrgst.I

(7)

| 1 - X

l

~~~~~~~~c.

wise, A significant advantage of the described conventional arith-

metic based computation of the MMSE estimator is that once

*prepro

ispomplete, the sameh

afor the detection according to (3). To this end, G is read back

Detection

MTU

Fig. 3.Schematic of a single PE. The main components are the complex-

sw

ane

be lrgeue

from the memory one column at a time. The entries of the jth

column are presented to the PBs together with the jth entry

of the received vector y which is broadcasted to all PBs. The

results of the multiplications of Gij with Yj in the ith PB are

accumulated in the

av

after MR cycles.

Hermitian matrix P, only the main diagonal and the lower

triangular part are stored.

B. MMSE Estimator Computation

The computation of the MMSE estimators starts with the

loop between step 2) and 8) in Alg. 1. During the jth iteration

of this loop, the entries of the jth row of H must be presented

to the inputs of the PBs as shown in Fig. 2. The computation

of the matrix-vector multiplication in step 3) is illustrated in

Fig. 4. In the first cycle, the first PB uses the upper ring to

broadcast H1,1 to all other PBs which multiply H1,1 with their

respective entry of the first column of p(i-r) and store the

result in the g register of the neighboring PB. In the second

cycle, the second PB broadcasts Ho2 to the other PBs, which

multiply it with the respective entries of the second column

of p(4.), add the content of their g register and forward the

register of the same PB andsis available

Df Pipelining

Despite the recursive nature of the applied algorithm,

pipelining can be introduced to allow for higher clock frequen-

cies. An additional register is added to the original architecture

as shown in Fig. 3. The actual increase in clock speed depends

on the quality of the placement of this pipeline register in the

logic. Implementation results show that a factor of almost 1.7

can easily be reached with manual retiming. Unfortunately,

pipelining of the recursive matrix inversion algorithm also

mandates the insertion of additional cycles to flush the pipeline

4104

Authorized licensed use limited to: Texas A M University. Downloaded on March 24, 2009 at 03:08 from IEEE Xplore. Restrictions apply.

Page 4

1010T

Ntpie°e

4Nopipelined

Time/Inv.

0.68 ,us

0.72 ,ls

FPGAImplementation.

sign (for MT=MR=4) on a XILINX XC2V6000-6 FPGA,

WW = 18 was chosen as the device contains hardwired multi-

pliers of that size. The pipelined version operates with a clock

rate of 40 MHz and requires 2.2 ps to compute the MMSE

estimator of one tone. Hence, for example, the detection

latency in a system with K= 64 tones adds up to 141 ps. The

throughput in detection mode is 10 Mvps. In terms of area,

the design consumes 16 out of 144multipliers and 3'416 logic

slices out of a total of 33'792.

V. CONCLUSIONS

For the implementation of the de-

6

WW

18

1

Area

69k

75k

85k0_74_s_|_A_

______ ______ ______ ______

_

O

B

t

~1 c

21

______

_

_

1 0

Pp_

CDiv=8Tmecpd

Area

P;elined

iW=18

--v. WW=_1_9

WW-20

WW=21

'Floating-

point_

___

35

Inpacket-based MIMO-OFDM systems even allegedly low-

complexity linear detectors pose a considerable implementa-

tion challenge. The presented algorithm and the scalable VLSI

architecture for the computation of the MMSE estimators

partially solve this problem for MIMO-OFDM systems with a

small number of tones (K < 64). A first important advantage of

thepresented approachis that it reduces silicon areaby reusing

the same hardware for the preprocessing and the for the

detection. The second advantage is the ability to easily extract

soft bit-metrics for a subsequent channel decoder [6]. The

main drawback are the considerable numerical requirements.

Moreover, it is noted that for systems with a large number

of tones, preprocessing latency is still too high. A possible

solution to this problem has recently been proposed in [11].

18

1_9

20

21

TmE/v.58 ,

0.6 ,us\

0.6 ,us

0.61[,us

0

78k GE

82k GE

-89k GE

15

-3

10

o

1

25

25

30s35o

30

40

40SNR

10 20

Fig. 5.

for MMSE detection in a system with MT = MR = 4 and with 16-QAM

modulation.

Fixed-point BER simulation and VLSI implementation results

after the operations associated with steps 3), 4), 6), 7) and 9)

of Alg. 1. As a result, the number of cycles increases to

tCPd = MR(3MT +6) -MT + 2+ MRCDiV (8)

ACKNOWLEDGEMENT

This work is supported by the STREP project No. IST-

026905 (MASCOT) within the sixth framework programme

of the European Commission.

REFERENCES

[1] G. Foschini and M. Gans, "On limits of wireless communications in a

fading environment when using multiple antennas," Wireless Personal

Communications, vol. 6, no. 3, pp. 311-334, 1998.

[2] Z.Guo and P. Nilsson, "A 53.3 Mb/s 4 x 4 16-QAM MIMO decoder in

0.35pm CMOS," in Proc. IEEE ISCAS, May 2005, pp. 4947-4950.

[3] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and

mance, and the computationtieforasingleMMSH.B6lcskei, "VLSI implementation ofMIMO detection using the sphere

mance, and the computation time for a single MMSE estimator

is illustrated in Fig. 5 for a system with MT = MR = 4.

The VLSI implementation results are based on a 0.25 pm

technology and for th BER

simulationstheentIEEEInt. Symp. on Circuits and Systems, May 2006.

technology and for the BER simulations the entries of H[k]

are assumed i.i.d. Rayleigh fading with variance one, so that

the received SNR is given by 17/2. For the computation of

the MMSE estimator, H[k]

is represented in a block floating-

point format and WW denotes the wordwidth of the real-valued

multipliers which constitute the complex-valued multipliers

in the PEs. The clock rates of the unpipelined designs are

between 93 MHz and 101 MHz, depending on the wordlength.

The pipelined implementations achieve between 167 MHz and

176 MHz. For the computation of the MMSE estimators, the

gain from the higher clock frequency remains small due to

the increase in the number of cycles. However, a significant

performance improvement is achieved from pipelining when

the circuit operates in detection mode, because during this

operation no pipeline bubbles need to be inserted. Without

.. ..

pipelining, 23-25 millilon (received) vectors per second (Mvps)

can be processed, while with pipelining throughput increases

to 42-44 Mvps.

and the number of cycles for the division must also be

increased to match the higher clock rate.

IV. IMPLEMENTATION RESULTS

A critical design parameter is the

Implementation. A crlhcal deslgn parameter 1S the

wordlength of the complex-valued datapath. The correspond-

ing trade-offs between silicon area, bit error rate (BER) perfor-

ASIC

ASIC

decoder algorithm," IEEE Journal of Solid-State Circuits, 2005.

[4] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, "K-Best

MIMO detection VLSI architectures achieving up to 424 Mbps," in Proc.

[5] D. Perels, S. Haene, P. Luethi, A. Burg, N. Felber, W. Fichtner, and

H. B6lcskei, "ASIC implementation of a MIMO-OFDM transceiver for

192 mbps WLANs," in Proc. IEEE ESSCIRC, Sept. 2005, pp. 215-218.

[6]

"Silicon implementation of an MMSE-based soft demapper for MIMO-

BICM," in Proc. IEEE Int. Symp. on Circuits and Systems, May 2006.

[7] Z. Khan, T. Arslan, J. S. Thompson, and A. T. Erdogan, "Area & power

efficient VLSI architecture forcomputing pseudo inverse of channel

ln t PsT ccrsfhupnamatrix in a MIMO wireless system," in Proc. IEEE Int. Conf on VLSI

Design (VLSID), Jan. 2006, pp. 734-737.

[8] G. Lightbody, R. Woods, and R. Walke, "Design of a parameterized

silicon intellectual property core for QR-based RLS filtering,"IEEE

Trans. on VLSI Systems, vol. 11, pp. 659-678, 2003.

[9] F. Edman and V. Owall, "A scalable pipelined complex valued matrix

inversion architecture," in Proc. IEEE ISCAS, 2005, pp. 4489-4492.

[10]

ceiver design for MIMO bit-interleaved coded modulation," in Proc. 8th

IEEEInt. Symposium on Spread Spectrum Techniques andApplications,

2004, pp. 12-16.

~~~~~~~~~~~~~[11]

"Interpolation-based QR decomposition in MIMO-OFDM systems,"

in Proc. IEEE Workshop on Signal Processing Advances in Wireless

Communications (SPAWC), June 2005, pp. 945-949.

s. Haene, A.Burg,D. Perels, P. Luethi, N. Felber, andW.Fichtner,

I.B.Collings,M. R. G. Butler, and M.McKay,"Lowcomplexityre-

.

D. Cescato, M. Borgmann, H. Boilcskei, J. C. Hansen, and A. Burg,

4105

Authorized licensed use limited to: Texas A M University. Downloaded on March 24, 2009 at 03:08 from IEEE Xplore. Restrictions apply.