# 4k-point FFT algorithms based on optimized twiddle factor multiplication for FPGAs

**ABSTRACT** In this paper, we propose higher point FFT (fast Fourier transform) algorithms for a single delay feedback pipelined FFT architecture considering the 4096-point FFT. These algorithms are different from each other in terms of twiddle factor multiplication. Twiddle factor multiplication complexity comparison is presented when implemented on Field-Programmable Gate Arrays (FPGAs) for all proposed algorithms. We also discuss the design criteria of the twiddle factor multiplication. Finally it is shown that there is a trade-off between twiddle factor memory complexity and switching activity in the introduced algorithms.

**0**Bookmarks

**·**

**210**Views

- [Show abstract] [Hide abstract]

**ABSTRACT:**Sound source localization algorithms are usually complex at the computational level, to the point that only dedicated hardware architectures can meet the required realtime processing power. This paper analyses a previously proposed, single-source, location algorithm, which adapts in realtime its search spectrum, thus reducing its workload. The proposed algorithm provides a good balance between processing power and real-time execution by combining the simplicity of the General Cross Correlation (GCC) with the accuracy of the Delay and Sum Beamforming (DSB) algorithm. Moreover, we also analyse the needs and constraints to achieve an optimal solution, showing its advantages over equally accurate alternatives. Our results show that under certain, yet realistic circumstances, the same accuracy as with the DSB algorithm is obtained.International Journal of Sensor and Related Networks. 02/2013; 1(1):1-7.

Page 1

4k-point FFT algorithms based on optimized

twiddle factor multiplication for FPGAs

Fahad Qureshi, Syed Asad Alam and Oscar Gustafsson

Department of Electrical Engineering, Link¨ oping University

SE-581 83 Link¨ oping, Sweden

E-mail: {fahadq, asad, oscarg}@isy.liu.se

Abstract—In this paper, we propose higher point FFT (fast

Fourier transform) algorithms for a single delay feedback

pipelined FFT architecture considering the 4096-point FFT.

These algorithms are different from each other in terms of

twiddle factor multiplication. Twiddle factor multiplication com-

plexity comparison is presented when implemented on Field-

Programmable Gate Arrays(FPGAs) for all proposed algorithms.

We also discuss the design criteria of the twiddle factor multi-

plication. Finally it is shown that there is a trade-off between

twiddle factor memory complexity and switching activity in the

introduced algorithms.

I. INTRODUCTION

Computation of the discrete Fourier transform (DFT) and

inverse DFT is used in for e.g. orthogonal frequency-division

multiplexing (OFDM) communication systems, Digital Video

Broadcasting (DVB) and spectrometers. Few of these systems

require large point FFT, usually more than 1K point.

An N-point DFT can be expressed as

X(k) =

N−1

?

N is the twiddle factor, the N:th primitive

root of unity with its exponent being evaluated modulo N, n is

the time index, and k is the frequency index. Various methods

for efficiently computing (1) have been the subject of a large

body of published literature. They are commonly referred to as

fast Fourier transform (FFT) algorithms. Also, many different

architectures to efficiently map the FFT algorithm to hardware

have been proposed [1].

A commonly used architecture for transforms of length

N = bris the pipelined FFT [2]. The pipeline architecture

is characterized by continuous processing of input data. In

addition, the pipeline architecture is highly regular, making

it straightforward to automatically generate FFTs of various

lengths. Especially for the large point FFT, reduces the com-

putational complexity as well as hardware complexity.

Figure 1 outlines the architecture of a Radix-2isingle-path

delay feedback (SDF) decimation in frequency (DIF) pipeline

FFT architecture of length N = 32. This architecture is

generic while the required ranges of each complex twiddle

factor multiplier is outlined in Table I for varying values of

i. For the twiddle factor multipliers with small ranges special

methods have been proposed. Especially, one can note that for

a W4 multiplier the possible coefficients are {±1,±j} and,

n=0

x(n)Wk

N, k = 0,1,...,N − 1

(1)

where WN= e−j2π

TABLE I

MULTIPLICATION RESOLUTION AT DIFFERENT STAGES FOR VARIOUS FFT

ALGORITHMS (N = 256).

Stage number

4

W64

W4

W256

W16

W16

W16

Radix

2

22[3]

23[4]

24[5]

25[6]

26[6]

123567

W256

W4

W4

W4

W4

W4

W128

W256

W8

W8

W8

W8

W32

W64

W4

W256

W32

W32

W16

W4

W8

W4

W256

W64

W8

W16

W32

W8

W4

W256

W4

W4

W4

W16

W8

W4

hence, this can be simply solved by optionally interchanging

real and imaginary parts and possibly negate (or replace the

addition with a subtraction in the subsequent stage). In [5], [8]

twiddle factor multiplication for {W8,W16, and W32} using

constant multiplication were proposed. However, another way

to solve the twiddle factor multiplication is to use a general

complex multiplier and pre-compute the twiddle factors and

store them in a memory.

BF

BF

BF BF

1 16248

BF

Stage 2 Stage 1Stage 3

Stage 4

Stage 5

WWWW

Fig. 1.

in frequency (DIF) pipeline FFT architecture (N = 32) with twiddle factor

stages as used in Table I.

Generalized Radix-2 single-path delay feedback (SDF) decimation

In digital CMOS circuits, dynamic power is the dominating

part of the total power consumption which can be approxi-

mated by [9]

Pdyn=1

2V2

where VDDis the supply voltage, fC is the clock frequency,

CLis the load capacitance and α is the switching activity. Low

complexity and low power architecture designs are always

desirable. Low power can be achieved by either reducing

the switching activity or resource utilization. In [10]–[13],

methods for reducing the size of the coefficient memory has

DDfcCLα

(2)

Page 2

been proposed. In [7], the authors proposed balanced binary

tree decomposition and claim optimal twiddle factor memory

requirement.

In this work we propose algorithms to implement the 4096-

point FFT. Butterfly structure of these proposed architectures

are same but twiddle factor multiplications are different. Also

discussed are the design criteria for the proposed algorithms on

the basis of implementation of twiddle factor multiplication.

The rest of the paper is organized as follows. Next sec-

tion describes the binary tree representation of Cooley-Tukey

algorithm. In Section III we discuss the design criteria of

the algorithms. In Section IV we introduce the proposed

architectures derived from radix-2ithen in Section V, some

results are presented. Finally, some conclusions are presented.

II. BINAY TREE REPRESENTATION OF COOLEY-TUKEY

ALGORITHM

The Cooley-Tukey FFT algorithm can be expressed as

X [Qk1+ k2]

=

P−1

?

n1=0

??Q−1

n2=0

?

x[n1+ Pn2]Wn2k2

Q

?

Wn1k2

M

?

Wn1k1

P

0 ≤ n1,k1≤ P − 1;0 ≤ n2,k2≤ Q − 1

Where, N,P and Q are considered to be powers of 2,

i.e., N = 2p+q, P = 2pand Q = 2qwhere p and q are

positive integers. Here, the N-point DFT is decomposed into

the Q P-point and P Q-point DFTs. These are named as inner

DFTs and outer DFTs repectively. Between these DFTs we

have twiddle factor multiplications. Typically, the P and Q-

point DFTs are again divided into smaller DFTs. An efficient

representation of algorithms of this type is the binary tree

representation [7]. An example of a binary tree is shown in

Fig. 2 corresponding to (3). The left branch corresponds to the

P = 2p-point DFT and the right branch to the Q = 2q-point

DFT. The resolution of the interconnecting twiddle factor is

N = 2p+q, i.e., a WN multiplier is required.

(3)

p+q

p

q

Fig. 2.Illustration of binary tree corresponding to (3).

FFT algorithm is categorized by the way Cooley-Tukey re-

cursive decomposition is applied. These decompositions finally

reach butterfly operations which greatly influences the FFT

architecture. A small radix is more desirable because it has a

simple butterfly operation but higher radix has less number

of twiddle factor multiplications. The radix-2ihas simple

radix-2 butterfly operations and twiddle factor multiplications

depend upon the value of i. The generalized radix-2(N = 32)

W3,25

x(16)

x(17)

x(18)

x(19)

x(20)

x(21)

x(22)

x(23)

x(24)

x(25)

x(26)

x(27)

x(28)

x(29)

x(30)

x(31)

x(0)

x(2)

x(3)

x(4)

x(5)

x(6)

x(7)

x(8)

x(9)

x(10)

x(11)

x(12)

x(13)

x(14)

x(15)

x(1)

W0,25

W0,27

x(1)

x(17)

x(9)

x(5)

x(13)

x(29)

x(3)

x(19)

x(11)

x(27)

x(7)

x(23)

x(15)

x(31)

x(0)

x(8)

x(4)

x(28)

x(2)

x(10)

x(26)

x(6)

x(22)

x(14)

x(30)

x(20)

x(12)

x(16)

x(24)

x(18)

x(25)

x(21)

W1,31

W1,30

W1,29

W1,28

W1,27

W1,26

W1,25

W1,24

W1,23

W1,22

W1,21

W1,20

W1,19

W1,0

W1,1

W1,2

W0,0

W0,1

W0,2

W0,3

W0,4

W0,5

W0,6

W0,7

W0,8

W0,9

W0,10

W0,11

W0,12

W0,13

W0,14

W0,15

W0,16

W0,17

W0,18

W0,19

W0,20

W0,21

W0,22

W0,23

W0,24

W0,26

W0,28

W0,29

W0,30

W0,31

W1,4

W1,3

W1,5

W1,6

W1,7

W1,8

W1,9

W1,10

W1,11

W1,12

W1,13

W1,14

W1,15

W1,16

W1,17

W1,18

W2,0

W2,1

W2,2

W2,3

W2,4

W2,5

W2,6

W2,7

W2,8

W2,9

W2,10

W2,11

W2,12

W2,13

W2,14

W2,15

W2,16

W2,17

W2,18

W2,19

W2,20

W2,21

W2,22

W2,23

W2,24

W2,25

W2,26

W2,27

W2,28

W2,29

W2,30

W2,31

W3,31

W3,30

W3,29

W3,28

W3,27

W3,26

W3,24

W3,23

W3,22

W3,21

W3,20

W3,19

W3,18

W3,17

W3,16

W3,15

W3,14

W3,13

W3,10

W3,9

W3,8

W3,7

W3,6

W3,5

W3,4

W3,3

W3,2

W3,1

W3,0

W3,11

W3,12

Fig. 3. Generalized Radix-2 32-point FFT signal flow graph

signal flow graph is shown in Fig. 3. Multiplication after

each butterfly operation is shown with row and column. The

radix-2ialgorithm can be achieved by applying the balanced

decomposition for small point FFT.

III. CRITERIA FOR ALGORITHM SELECTION

Algorithm selection criteria is the most important step to

design low power FFT algorithm. Twiddle factor multipli-

cation is one of the major power contributors of the single

delay feedback pipelined FFT architecture. Twiddle factor

multiplication requires both memory and complex multiplier

which consumes more power and more area.

A. Complexity of WN Multiplier

The simplest approach, is to just use a large look-up table to

store the twiddle factors. For a WNmultiplier, N words need

to be stored. Twiddle factor multiplication is implemented with

one complex multiplier and LUTs to store the precomputed

coefficient. It should also be noted that this scheme possibly

stores the same twiddle factor in several positions as the

mapping is from row to twiddle factor and for radix-2i

algorithms some twiddle factors appears more than once for

i ≥ 2. The complexity of the LUTs is depending upon the

size of the FFT and resolution of the twiddle factor. It also to

uses the well known octave symmetry to only store twiddle

factors for 0 ≤ α ≤ π/4 with an additional cost of address

mapping circuit [13].

The lower resolution N ≤ 16, complex multiplier can be

implemented with dedicated constant multiplier [5], [8].

1) W8Multiplier: A W8-multiplier only requires multipli-

cation by either 1 or sinπ

using a multiplexer selecting between the input or the output

4(cosπ

4). This can easily be realized

Page 3

V

6

6

5

1

2

4

66

3

3

III

I

6

4

2

IV

II

5

1

Fig. 4.Decomposed algorithms for 64-point

of a constant multiplier with coefficient sinπ

multiplier can be realized using a minimum number of adders

using the method in [14].

2) W16 Multiplier: A W16-multiplier is a low resolution

multiplier. This twiddle factor multiplication can be imple-

mented with the dedicated constant multiplier of sinπ

and sinπ

multiplier based on trigonometric identities which were im-

plemented with the constant coefficients sinπ

[15] authors proposed the low complexity in terms of adder

with minimum error based on aware quantization method. In

the proposed architectures we implement dedicated constant

multiplier for W16twiddle factor multiplication.

4. The constant

8, cosπ

8

4with some control logic. [5] proposed a W16

8and cosπ

8. In

B. Switching activity

Switching activity between two successive coefficients fed

to the complex multiplier affects the power consumption.

The coefficient reordering technique was proposed [16] to

design low power architecture. Algorithmic level changes

also affect the switching activity, depending upon how the

FFT decomposition is recursively applied to form a small

point FFT. In [17] the equivalent radix-22algorithm with low

switching activity was proposed. In the proposed architecture,

we discuss switching activity of W64 multiplication. The

different decompositions of the 64-point FFT block is shown

in Fig. 4 and the switching activity is tabulated in Table II. The

position of the twiddle factor is affecting the switching activity.

In case II and IV, we have same twiddle factor complexity

but case II has less switching activity. Switching activity also

depends upon whether any particular twiddle factor is located

on left or right branch of the tree. It is shown that there is a

trade off between complex multiplier and switching activity,

both having affect on power consumption.

TABLE II

SWITCHING ACTIVITY OF DECOMPOSED W64MULTIPLICATION (12-BITS)

Twiddle factor

W64

I

301

II

479

III

665

IV

587

V

733

IV. PROPOSED ARCHITECTURES BASED ON RADIX-2i

Considering the 4096-point FFT, based on the radix-2i

decomposition the proposed algorithms are shown in Fig. 5(b-

d) with binary tree diagram. Each node corresponds to twiddle

factor multiplication. Twiddle factors are indexed by n and k,

the linear index map equations and sequences of required n

and k to determine the index. Proposed architectures can be

11 1

1

4

22

1 1 1

4

22

1 1 1 1

1 1

1 1

(a)

12

12

6

6

(c)

(b)

12

2

2

(d)

12

34

22

1 1 1 1

5

7

3

3

3

3

3

1 1

2

1

1 1

2

1

2

6

6

1 1

1 1

2

1

1

2

4

8

4

2

1 1 1 1

2

4

22

1 1 1 1

22

1 1 1 1

2

2

1

1 1

1 1

2

Fig. 5. (a) Balanced binary tree decomposition [7] (b-d) Proposed algorithms.

formulated with eq. 3. Here we formulated the first decompo-

sition of Fig. 5(a) expressed as

X [64k1+ k2]

=

64−1

?

n1=0

??64−1

n2=0

?

x[n1+ 64n2]Wn2k2

64

?

Wn1k2

4096

?

Wn1k1

64

(4)

where W4096is the twiddle factor multiplication which con-

nects the two decomposed DFTs. Similarly, we can apply

the decomposition equation on each node of the binary tree

representation of FFT. The generalized index mapping is

presented for all stages of any radix-2ialgorithm [18]. Twiddle

factors of each algorithm with resolution are tabulated in

Table III.

V. RESULTS

We have analyzed the complexity and switching activity

of twiddle factor multiplications. Both these factors influence

low power designs. The architectures of the twiddle factor

multiplication have been coded in VHDL. In higher resolution

twiddle factor multiplication, we considered the LUTs to

store the precomputed twiddle factors with complex multiplier

and for others dedicated constant multiplier is considered

for multiplication. The twiddle factor memory and complex

multipliers were synthesized, targeting Virtex-4 FPGA. The

twiddle factors are represented using 12 bits each for real and

imaginary parts, using two’s complement representation. The

resulting complexity for each stage is illustrated in Table V.

The switching activity between successive coefficient fed

to the complex multiplier is defined in terms of Hamming

distance for each coefficient transition. The Hamming distance

is defined as the number of 1’s of the XOR operation between

two successive binary coefficient. Twiddle factors can be pre-

computed and stored in look-up tables instead of calculating

in real time. In pipelined SDF architecture, in each cycle

these stored coefficients are fed to the complex multiplier. The

sequence of the stored coefficients affect the switching activity.

The reading sequence is then simulated to obtain the resulting

switching activity. The results for the different algorithms are

shown in Table IV. The analysis of these results show that,

we have more options to implement 4096-point FFT.

Page 4

TABLE III

MULTIPLICATION RESOLUTION AT DIFFERENT STAGES FOR BALANCED BINARY TREE DECOMPOSITION AND PROPOSED ALGORITHMS.

Stage number

Case12345678910

W4

W16

W16

W32

11

W8

W4

W4

W4

Balanced binary tree decomposition [7]

Proposed 1st

Proposed 2nd

Proposed 3rd

W4

W4

W4

W4

W8

W16

W64

W16

W64

W4

W4

W4

W4

W256

W16

W128

W8

W4

W4

W4

W4094

W16

W4096

W8

W4

W4

W4

W4096

W8

W4096

W64

W4

W64

W4

W4

W8

The first proposed architecture requires 2 complex multi-

plier while other architectures need 3 complex multipliers. The

hardware complexity of dedicated multiplier and the twiddle

factor memory is higher than others with less switching

activity. In the proposed architectures the complexity of the

dedicated constant multipliers and twiddle factor memory is

decreasing while switching activity is increasing from first to

third proposed architecture.

Low power design is trade off between these parameters.

In the proposed architectures we have better options to select

low power design than balanced binary tree algorithms.

TABLE IV

TWIDDLE FACTOR MULTIPLICATION COMPLEXITY

Number of 4-input LUTs

Balanced binary

decomposition [7]

4*215

–

–

136+430

–

–

5967

7393

3

Twiddle

factor

W8

W16

W32

W64

W128

W256

W4096

Total

Complex multiplier

Proposed Algorithms

2nd

–

419*3419*2

––

–126+401

––

575–

60585967

78907332

23

1st

–

3rd

2*215

419

48

–

136

–

6102

7135

3

TABLE V

SWITCHING ACTIVITY OF TWIDDLE FACTOR

Twiddle

factor

W32

W64

W128

W256

W4096

Total

Balanced binary

decomposition [7]

–

587+38639

–

–

34061

73287

Proposed Algorithms

2nd

–

479+31475

–

2388–

4072634061

4311466015

1st

–

–

–

3rd

40437

–

1310

–

37481

79228

VI. CONCLUSIONS

In this work, we proposed the different algorithms for single

delay feedback architecture for higher radix, considering the

4096-point FFT. The twiddle factor multiplications at each

stage is different for each proposed algorithms. Low power

designs of each algorithm depends upon few twiddle factor

multiplication design parameters. Design criteria of twiddle

factor multiplication is trade off between these parameters.

It is shown that in the proposed algorithms we have better

choices to select the low power architecture for 4096-point

FFT.

REFERENCES

[1] L. Wanhammar, DSP Integrated Circuits, Academic Press, 1999.

[2] E. H. Wold and A. M. Despain, “Pipeline and parallel-pipeline FFT

processors for VLSI implementations,” IEEE Trans. Comp., vol. 33,

no. 5, pp. 414–426, May 1984.

[3] S. He and M. Torkelson, “A new approach to pipeline FFT processor,”

in Proc. IEEE Parallel Processing Symp., 1996, pp. 766–770.

[4] S. He and M. Torkelson, “Designing pipeline FFT processor for

OFDM(de)Modulation,” in Proc. IEEE URSI Int. Symp. Sig. Elect.,

1998, pp. 257–262.

[5] J.-E. Oh,and M.-S. Lim, “New radix-2 to the 4th power pipeline FFT

processor,” IEICE Trans. Electron., vol. E88-C, no. 8, pp. 694–697, Aug.

2005.

[6] A. Cortes, I. Velez and J. F. Sevillano,“Radix rkFFTs: matricial

representation and SDC/SDF pipeline implementation,” IEEE Trans. on

Signal Processing, vol. 57, no. 7, pp. 2824–2839, July 2009.

[7] Hyun-Yong Lee, and In-Cheol Park,“Balanced binary-tree decompo-

sition for area-efficient pipelined FFT processing,” IEEE Trans. on

Circuits and Systems-I, vol. 54, no. 4, pp. 889–900, April 2009.

[8] F. Qureshi and O. Gustafsson, “Low-complexity reconfigurable complex

constant multiplication for FFTs,” in Proc. IEEE Int. Symp. Circuits

Syst., Taipei, Taiwan, May 24–27, 2009.

[9] K. Johansson, O. Gustafsson, and L. Wanhammar, “Switching activity

estimation for shift-and-add based constant multipliers,” in Proc. IEEE

Int. Symp. Circuits Syst., Seattle, WA, USA, May. 18-21, 2008.

[10] Seungbeom Lee, Duk-bai Kim and Sin-Chong Park, “Power-efficient

design of memory based FFT processor with new addressing scheme,”

in Proc. Int. Symp. Communications and Information Technology, 26–29

Oct. 2004, pp. 678–681.

[11] F. Qureshi and O. Gustafsson, “Analysis of twiddle factor memory

complexity of radix-2ipipelined FFTs,” in Proc. Asilomar Conf. Signals

Syst. Comp., Pacific Grove, CA, Nov. 1-4, 2009.

[12] H. Cho, M. Kim, D. Kim, and J. Kim “R22SDF FFT implementation

with coefficient memory reduction scheme,” in Proc. Vehicular Technol-

ogy Conf., 2006.

[13] M. Hasan and T. Arslan, “Scheme for reducing size of coefficient

memory in FFT processor,” Electronics Letters, vol. 38, no. 4, pp. 163–

164, Feb. 2007.

[14] O. Gustafsson, A. G. Dempster, K. Johansson, M. D. Macleod, and

L. Wanhammar, “Simplified design of constant coefficient multipliers,”

Circuits, Systems and Signal Processing, vol. 25, no. 2, pp. 225–251,

Apr. 2006.

[15] O. Gustafsson and F. Qureshi, “Addition aware quantization for low

complexity and high precision constant multiplication,” IEEE Signal

Processing Letters., vol. 17, no. 2, pp. 173-176, Feb. 2010.

[16] J. Ming Wu and Y. Chun Fan, “Coefficient ordering based pipelined

FFT/IFFT with minimum switching activity for low power WiMAX

communication system,” in Proc. IEEE Tenth Int. Symp. Consumer

Electronics, 2006, pp. 1–4.

[17] F. Qureshi and O. Gustafsson, “Twiddle factor memory switching

activity analysis of Radix-22and equivalent FFT algorithms,” in Proc.

IEEE Int. Symp. Circuits Syst., Paris, France, 2010.

[18] F. Qureshi and O. Gustafsson, “Genralized twiddle factor index-Mapping

of radix-2 FFT algorithm,” in preparation.