Page 1

A Two Level Architecture for high throughput DCT-Processor and Implementing

on FPGA

Azad Fakhari

Computer faculty

Iran University of Science and Technology

Tehran, Iran

azad_1364@yahoo.com

Dr.Mahmood Fathy

Computer faculty

Iran University of Science and Technology

Tehran, Iran

mahfathy@iust.ac.ir

ABSTRUCT- Frequency analysis using discrete cosine

transform is being used in a large variety of algorithms such

as image processing algorithms. This paper proposes a new

high throughput architecture for the DCT processor. This

system has got a 2level architecture which uses parallelism

and pipelining and has been synthesized on Xilinx Virtex5

FPGA. Synthesis results show that this system works at

150MHz. Applying DCT on each 8x8 matrix of image take 67

clock pulses. In other words, applying DCT on each pixel

takes approximately one clock pulse.

I. INTRODUCTION

Nowadays, with a high demand for video and image data

transmission and storage, compression is a must, in order to

achieve lower transmission time for high quality images, as

well as to use lower storage space. DCT plays a key role in

many compression standards such as JPEG2000 for still

image compression, ITU H.261 and ITU H.263 in

teleconferencing and ISO MPEG1 and MPEG2 for moving

pictures and home video.

Image compression using 2D-DCT includes several levels

including: pre-processing,

transform, quantization and coding. 2D-DCT is the most

computationally intensive phase of the encoding process

and accelerating it would dramatically reduce the whole

process time.

Several algorithms and implementation methods have been

proposed for DCT, spreading

implementations in DSPs to hardware implementations in

ASICs. Depending on the final objective and the

application, the best implementation option will be chosen.

Generally when the speed is a premise, hardware

implementation is the best option.

Many of the implementation methods for 2D-DCT

algorithms have been proposed to achieve reduction of the

computational complexity and thus increasing the

operational speed and throughput. 2D-DCT can be broken

down in two groups of N 1D-DCT, which is equivalent to

processing a data block by rows, followed by a column

processing or vice versa and this is called row/column

decomposition.

Algorithms and architectures for the 2D DCT can be

divided into two categories:

• row/column decomposition methods

• Non row/column decomposition methods

applying discrete cosine

from software

DCT has been implemented in form of different

architectures but two of them are more common: Systolic

architecture (SA) and Distributed Arithmetic (DA). [1], [2],

[3], [4].

FPGAs facilitate accessibility and configurability in

implementing hardware on a chip. The FPGA chip can be

placed in an actual circuit to be evaluated in real situations.

For DCT, there are different architectures which have been

represented so far.

Some architectures are systolic such as [5] which is a

reconfigurable 2D-DCT architecture and [6] which is

array-based architecture with high scalability and also [7].

Residue Number System (RNS) is used as a Row/column

decomposition technique

transformation algorithm on two dimensions using 1D-

DCT is implemented on FPGA in [3]. Several architectures

proposed distributed arithmetic (DA) for example [1]

which represents a fully parallel architecture based on Chen

et al’s method or in [10] group distributed arithmetic

(GDA) combines the good features of cyclic convolution

and DA computation using shared ROM modules, barrier

shifters, and accumulators. In [11] a compressed distributed

arithmetic architecture for 2D 8x8 DCT is presented using

distributed arithmetic 1D-DCT architecture and in [12]

they combine the methods of the look-up table and row-

column overlapped operations using look-up tables and

shift registers to avoid the transpose operation and in this

design multipliers and transport RAMs are not used and

therefore it can achieve high speed and with less latency for

real-time applications. [13] Uses a recursive algorithm and

[14] and [15] are based on a frame-recursive approach

using two 1-D DCT arrays. [16] Is based on partial sum

and [4] is based on Lee algorithm with combined pipeline.

There are a lot of other architectures like [2] and [17].

This paper proposes a new architecture which is efficient in

speed because of its parallelism and pipe-lining. Section 2

is about discrete cosine transform and its aspects. The

proposed architecture is presented completely in section 3.

In that section 2 level architecture is defined and each level

is described individually then System function is shown in

a state machine. In section 4 we will see how this

architecture is implemented on hardware. Such as

architecture section, implementation is defined for both

levels individually. Section 5 is about simulation of the

in [8], [9]. Polynomial

2010 International Conference on Reconfigurable Computing

978-0-7695-4314-7/10 $26.00 © 2010 IEEE

DOI 10.1109/ReConFig.2010.67

115

Page 2

system which uses sample data to verify the results. Section

6 is about synthesizing the design on FPGA and its reports.

Finally section 7 is about system analysis and how fast this

system is.

II. DESCRETE COSINE TRANSFORM

DCT is one of the major transformations which is used in a

lot of compression algorithms. DCT is a frequency

transform which is equivalent to the real part of the discrete

Fourier transform (DFT). 2D-DCT is very common in

image processing and specially image compression. It

affects a 2-dimensional matrix of natural values. Discrete

Cosine Transform receives an image matrix, which is

divided into smaller image blocks (4x4, 8x8, 16x16 ...)

where each block is transformed from the spatial domain to

the frequency domain. DCT decomposes signal into spatial

frequency components. The lower frequency parts appear

toward the first line/first column of the DCT matrix, and

the higher frequency parts are in the last line/last column of

the DCT matrix. This

multiplication of specific cosine coefficients in values of a

small block of image and summation of the products results

in new values for each pixel [18], [19], [4].

2D-DCT tends to be parallelized very well. It is common to

use 8x8 matrixes to simplify the calculations. DCT will be

applied on each 8x8 matrixes of the image and the formula

is like:

)

cos(

4

00

ij

==

transformation includes

(

) , (i ).

16

) 1

+

2 (

cos().

16

) 1

+

2 ()( ).(

,

77

jf

vjuivCuC

vuF

ππ

=

∑∑

70

70

≤≤

≤≤

v

u

otherwise

C

1

)(

0

2

2

=

=

β

β

(1)

Where

F(u,v): values in transform domain

F(i,j): values in pixel domain

i, j: spatial coordinate in the pixel domain

u, v: coordinate in transform domain

III.

SYSTEM ARCHITECTURE

Unlike many other architectures, in this paper we propose a

non row/column decomposition method which implements

the 2D-DCT directly. Each 8x8 matrix has got 64 elements

which for each (a,b) element (

procedure that results F(a,b) must be applied once.

Considering that the procedure for each element is

independent of other 63 elements and 64 cosine

coefficients, C(a) and C(b) for each (a,b) element have got

constant values independent of the value of f(a,b), the

calculation procedure for F(a,b) could be applied for each

element of 8x8 matrix simultaneously with 63 other

elements. In other words, there can be 64 parallel

procedures for 64 elements of 8x8 matrix.

70

≤≤ a

,

70

≤≤ b

) the

Therefore,

simultaneously, 64 similar calculation units, that every unit

has got its specific coefficients which depend on the value

of a and b for each unit , are needed. When one operational

cycle finishes, the final values of F(a,b) in all 64

calculation units are ready. For applying DCT on the whole

image matrix the image must be divided into 8x8 matrixes

without overlap. Then DCT operations must be applied to

all 8x8 matrixes individually.

The system which implements these operations must be in

charge of all these tasks:

• Reading 8x8 matrix values from related addresses of

memory

• Controlling 64 calculation units for applying DCT on 8x8

matrix data

• Writing calculated data F(a,b) in related addresses of

memory

• Shifting the 8x8 window to the next part of image when

an operational cycle is finished

• Declaring the end of whole operations (applying DCT to

the whole image)

A DCT processor, a source memory that stores primitive

values of image and a destination memory, which

calculated data will be stored on, are needed for designing

this system.

for calculating 64 values of F(a,b)

A. 2LEVEL ARCHITECTURE

The DCT Processor has been designed in 2 levels in which

the high level unit is in charge of reading data from source

memory, transferring data to low level unit, reading

calculated data from low level unit and writing them on the

destination memory. On the other hand, the low level unit

is just in charge of applying DCT calculations on entered

data from high level unit.

B. HIGH LEVEL UNIT ARCHITECTURE

High level unit includes several sub-units.

1) A counting unit as a sequence counter which is used for

defining the state of the system and driving signals based

on that state.

2) Reading register file includes 64 registers for read data

from source memory.

3) Calculation register file includes 64 registers for data

which are used by low level unit.

4) Writing register file includes 64 for data which must be

written in the destination memory

5) Image size registration unit includes 2 registers which

save the height and width size of the image.

6) Current coordinate of 8x8 matrix registration unit

includes 2 counters which show the coordinate of the first

element of 8x8 matrix in the image.

7) Address generator unit uses values of counting unit,

image size registration unit and current coordinate of 8x8

matrix registration unit to generate these addresses

116

Page 3

Fig1: high level unit block diagram

• Source memory address for reading data from

• Destination memory address for writing data in

• Current addresses of 3 register files

• Current coefficient address in low level unit

8) Control unit generates control signals for memories,

register files, high level sub units and low level sub units.

Control signals include enable and reset signals.

C. LOW LEVEL UNIT ARCHITECTURE

Low level unit includes 64 similar sub-units which differ

only in their coefficients. Each sub unit includes these

parts:

1) Coefficient memory includes 64 rows and in each row

there is the coefficient value which is related to one of the

combinations of i and j in DCT formula. In order to

simplify the calculations, every coefficients has been

multiplied by 2^16 = 65536.

2) Multiplication unit include a fast clock-free multiplier

which multiplies input data from high level unit by the

related coefficient from coefficient memory.

3) Accumulation unit has got the value of 0 at first and

after each clock pulse it adds the current result value with

the input value. Therefore, after 64 clock pulses it contains

the final value of F(a,b). the first 16bits must be eliminated

in order to division of the result by 2^16 = 65536.

D. SYSTEM STATE MACHINE

At first state, system assigns width and height values of the

image via two related input pins and these values are stored

in image size registration unit. System also resets all other

registers in this state. System will go to state 1 by changing

an input signal which decides whether system is in

configuration state (start state) or process state (starts with

state 1).

Fig2: low level unit block diagram

Fig3: system state machine

At state 1, counting unit starts counting from 0 to 63 and by

getting to 64 it stops counting. During this counting period,

these actions will be happening:

• 64 new data are being read from source memory and

stored in reading register file.

• Those 64 data which have been read through the previous

cycle are being sent to 64 low level sub units for being

applied in DCT calculations. Coefficient memories in each

sub unit are being addressed by this counting from 0 to 63.

• Those 64 data which have been calculated through the

previous cycle are being written in the destination memory

By stopping the counting unit, system will go from state 1

to state 2.

At state 2, with first rising edge of clock pulse, data on the

reading register file will be moved to calculation register

file simultaneously and respectively. Also final values in

each calculation sub unit will be placed on each sub units

output.

System will go from state 2 to state 3 immediately after this

edge of clock pulse.

At state 3, with first rising edge of clock pulse, 64

calculated final values in all sub units will be moved to

writing register file simultaneously and respectively.

System will go from state 3 to state 4 immediately after this

edge of clock pulse.

At state 4 two actions can happen.

117

Page 4

• If at state 1 of current cycle the last part of calculated

data on the writing register file was written on the

destination memory, system would go to final state with

first rising edge of clock pulse.

• If the previous condition was not correct then the

counting unit would be reset and system would start a new

cycle by going to state 1.

At final state, counting unit stops and an output signal

changes in order to declare the end of process.

IV.

SYSTEM IMPLEMENTATION

Designing

implementation, simulation, analysis and testing in two

separate parts which are completely independent from each

other. High level unit acts as a commander which is in

charge of controlling system sequences, addressing

different parts, deciding to continue with or change the

current state of system and controlling every single element

either in high level unit or in low level unit. Low level unit

acts as a calculation core which gets command, data, and

required addresses from high level unit to do the

calculations of DCT which includes calculating the

summation of different products for achieving the result of

applying DCT on data.

this system in two levels facilitate

A. HIGH LEVEL UNIT IMPLEMENTATION

At the start state, 7bits counter and 3 flip-flops in counting

unit are reset. After transition to state 1, 7bits counter starts

up-counting. When it gets to 64(7th bit gets to '1'), it will

stop counting and enable signal for first flip-flop will

change into '1' simultaneously. Now, first flip-flop gets to

'1' by first rising clock pulse. This change makes the enable

signal for first flip-flop return to '0' and the enable signal

for second flip-flop will change into '1' simultaneously.

Then, second flip-flop gets to '1' by first rising clock pulse.

This change makes the enable signal for second flip-flop

return to '0' and the enable signal for third flip-flop will

change into '1' simultaneously. After that, third flip-flop

gets to '1' by first rising clock pulse. This change causes

resetting 7bits counter. 0 value of the counter makes the

first flip-flop reset. '0' value of the first flip-flop makes the

second flip-flop reset. '0' value of the second flip-flop

makes the third flip-flop reset. '0' value of the third flip-flop

causes ending the reset state of 7bits counter. In other

words, 7bits counter will be able to start counting again

right after the change from '1' to '0' of third flip-flop. Only

if the current cycle is the last cycle of system, the whole

counting unit will stop instead of starting counting again. In

this case system will go to final state and an output signal

declares the end of process.

B. LOW LEVEL UNIT IMPLEMENTATION

This unit includes 64 independent calculation sub-units

which work concurrently on 16bits input data. Every sub-

unit's structure is the same with others and the only

difference between them is their coefficient values which

are stored in their coefficient memory. By counting from 0

to 63 in counting unit, related coefficient in each sub-unit

multiplies by related 16bits data from calculation register

file. On each iteration, current result will be added to the

summation of all the previous results. When counting unit

reaches 64, the summation of products of 64 data with 64

coefficients will be available on a register in every sub-

unit. These 64 values would be the final values of an 8x8

matrix after DCT operations. In other words, since 64 sub-

units calculate the final data concurrently, when counting

unit reaches 64, 64 stored values in 64 calculation sub-units

would be the result of DCT on 64 data stored on

calculation register file. Those registers which are

containing final data in sub-units are connected to 64

registers in writing register file respectively. On the next

rising edge of clock pulse, data will be copied into the

related registers in writing register file.

It must be mentioned that coefficient memory in each sub-

unit includes 64 signed values which have been calculated

by a MATLAB function. Values for each sub-unit depends

on the sub-unit's coordinate in 8x8 matrix of sub units

related to the formula of DCT.

V. SIMULATION AND VERIFICATION

The implementation of the system has been done using

VHDL modeling including two separate modules for high

level unit and low level unit and they have been integrated

in a bigger module called DCT processor containing high

level unit and 64 sub-units of the low level unit. Then the

whole model has been simulated in ModelSim SE 6.0 for

verification test. Also, DCT function has been written in

MATLAB 7.8 to apply DCT mathematical function on the

same data to confirm the results. Results of the simulated

system are the same with original results of MATLAB for

same data with a small error which is negligible. After

comparing results of DCT processor and MATLAB

function, the design has been verified.

VI.

SYSTEM SYNTHESIS

After verification, the whole system has been synthesized

on FPGA using Xilinx ISE. The device properties are

presented in the following table.

TABLE I.

Brand

Family

Device

Package

Xilinx

Virtex5

xc5vlx155t

2ff1738

Device properties

118

Page 5

Important timing information is what we see in table II.

TABLE II.

Minimum Period

Maximum Frequency

Setup Time

Hold Time

6.630ns

150.824MHz

1.863ns

13.591ns

Timing Information

After synthesis, all the components can be extracted from

the synthesis report. Tables III and IV show the

components of high level unit and low level unit

individually.

TABLE III.

Comp

Cntr

Cntr

Type #

1

2

Reference

7b up

2b up

Counter

Start Counter,

End Counter

131 Rd Reg File,

Calc Reg File,

W, H, W Pos

64 Wr Reg File

4 FF1, FF2, FF3,

Done

1 H Pos

1 Src Mem Addr

1 Src Mem Addr

1 W Pos

1 Src Mem Addr

1 Dest Mem Addr

1 Dest Mem Addr

5 Dest Mem Addr,

Src Mem Addr

2 H Pos, W Pos

1 Dest Mem Addr

1 Dest Mem Addr

2 Dest Mem Addr

2 Height Pos,

Width Pos

1 Data

1 Dest mem data

out

Components in High Level unit

Reg 16b

Reg

Reg

18b

1b

Acc

Mult

Mult

Add

Add

Add

Add

Add

16b up

17x16b

18x18b

16b

16b cout

17b cout

31b

32b

Sub

Sub

Sub

Sub

Comp

16b

17b

31b

32b

16b

Mux

Mux

16b 64-1

18b 64-1

TABLE IV.

Comp

ROM

Acc

Reg

Mult

Type

#

1

64

64

64

Reference

Rom

Sum Reg

Data Reg

Data Reg Input

64x2048

18b up

18b

17x17

Components in Low Level unit

In addition, the utilization percentage of logic part is 2%,

memory is 3% and DSP slices are 49%.

VII.

SYSTEM ANALYSIS

This method for architecture and implementation of a DCT

processor has got a very high time efficiency because of its

parallelism and pipe-lining in reading from memory,

calculation and writing in memory. Assuming the same

DCT processor but with a single calculation unit (no

parallelism) and without pipe-lining, it would take 64 clock

pulses for reading a 8x8 matrix of data from memory,

64×64=4096 clock pulses for calculating the results of

applying DCT on 8x8 matrix and 64 clock pulses for

writing the 8x8 matrix of final data on memory. This

means it takes 4224 clock pulses for every 8x8 matrix of

image. To sum up, it would take 66n clock pulses for an

image which has n pixels. On the other hand, for the

system which has been described in this paper, each cycle

takes 67 clock pulses. Applying DCT on an image

n

+2 cycles because each cycle

containing n pixels takes64

affects a 8x8 matrix and it takes 2 more cycles for

calculating the last 8x8 matrix and writing it in the

destination memory. Each cycle takes 67 clock pulses and

n

+2) clock pulses for the whole

it means that 67× (64

image. In other word 1.05n + 134 clock pulses for an

image which has n pixels. For large images the system

works approximately 1 pixel in 1 clock pulse.

Thus, this DCT which has been synthesized on vitrex5

FPGA and functions in 150MHz can work real-time for an

NTSC (30fps) video with 4.7 mega pixel size.

VIII.

CONCLUSION

In this paper we have proposed a new high throughput

design for discrete cosine transform which can be used in

real-time systems. Using parallelism in calculations for 64

different parts and having a pipeline for reading,

calculating and writing data provides us with a dramatic

speed in applying DCT on an image. On the other hand,

calculating the coefficients for each calculation part and

storing it on a ROM saves a lot of time and hardware

complexity. Of course, implementing this design on ASIC

will decrease delays and increase the system frequency.

Therefore, it can work on larger images and videos with

higher qualities.

This system for instance can function in a digital camera as

a co-processor to be in charge of compression and decrease

the burden on the main processor.

This system can also be used as IDCT processor just by

adding another ROM for new coefficients because the

algorithm is completely the same with DCT. Furthermore,

by adding quantizer, encoder, decoder and inverse

quantizer to DCT/IDCT processor we can design a

compressor/decompressor system as a full package.

Compressor/decompressor system can work as a co-

processor in an image processing system.

REFERENCES

[1] Reza Ebrahimi Atani, Mehdi Baboli, Sattar Mirzakuchaki,

Shahabaddin Ebrahimi Atani, Babak Zamanlooy, "Design and

Implementation of a 118 MHz 2D DCT Processor", Industrial

119