A Two Level Architecture for high throughput DCT-Processor and Implementing
Iran University of Science and Technology
Iran University of Science and Technology
ABSTRUCT- Frequency analysis using discrete cosine
transform is being used in a large variety of algorithms such
as image processing algorithms. This paper proposes a new
high throughput architecture for the DCT processor. This
system has got a 2level architecture which uses parallelism
and pipelining and has been synthesized on Xilinx Virtex5
FPGA. Synthesis results show that this system works at
150MHz. Applying DCT on each 8x8 matrix of image take 67
clock pulses. In other words, applying DCT on each pixel
takes approximately one clock pulse.
Nowadays, with a high demand for video and image data
transmission and storage, compression is a must, in order to
achieve lower transmission time for high quality images, as
well as to use lower storage space. DCT plays a key role in
many compression standards such as JPEG2000 for still
image compression, ITU H.261 and ITU H.263 in
teleconferencing and ISO MPEG1 and MPEG2 for moving
pictures and home video.
Image compression using 2D-DCT includes several levels
transform, quantization and coding. 2D-DCT is the most
computationally intensive phase of the encoding process
and accelerating it would dramatically reduce the whole
Several algorithms and implementation methods have been
proposed for DCT, spreading
implementations in DSPs to hardware implementations in
ASICs. Depending on the final objective and the
application, the best implementation option will be chosen.
Generally when the speed is a premise, hardware
implementation is the best option.
Many of the implementation methods for 2D-DCT
algorithms have been proposed to achieve reduction of the
computational complexity and thus increasing the
operational speed and throughput. 2D-DCT can be broken
down in two groups of N 1D-DCT, which is equivalent to
processing a data block by rows, followed by a column
processing or vice versa and this is called row/column
Algorithms and architectures for the 2D DCT can be
divided into two categories:
• row/column decomposition methods
• Non row/column decomposition methods
applying discrete cosine
DCT has been implemented in form of different
architectures but two of them are more common: Systolic
architecture (SA) and Distributed Arithmetic (DA). , ,
FPGAs facilitate accessibility and configurability in
implementing hardware on a chip. The FPGA chip can be
placed in an actual circuit to be evaluated in real situations.
For DCT, there are different architectures which have been
represented so far.
Some architectures are systolic such as  which is a
reconfigurable 2D-DCT architecture and  which is
array-based architecture with high scalability and also .
Residue Number System (RNS) is used as a Row/column
transformation algorithm on two dimensions using 1D-
DCT is implemented on FPGA in . Several architectures
proposed distributed arithmetic (DA) for example 
which represents a fully parallel architecture based on Chen
et al’s method or in  group distributed arithmetic
(GDA) combines the good features of cyclic convolution
and DA computation using shared ROM modules, barrier
shifters, and accumulators. In  a compressed distributed
arithmetic architecture for 2D 8x8 DCT is presented using
distributed arithmetic 1D-DCT architecture and in 
they combine the methods of the look-up table and row-
column overlapped operations using look-up tables and
shift registers to avoid the transpose operation and in this
design multipliers and transport RAMs are not used and
therefore it can achieve high speed and with less latency for
real-time applications.  Uses a recursive algorithm and
 and  are based on a frame-recursive approach
using two 1-D DCT arrays.  Is based on partial sum
and  is based on Lee algorithm with combined pipeline.
There are a lot of other architectures like  and .
This paper proposes a new architecture which is efficient in
speed because of its parallelism and pipe-lining. Section 2
is about discrete cosine transform and its aspects. The
proposed architecture is presented completely in section 3.
In that section 2 level architecture is defined and each level
is described individually then System function is shown in
a state machine. In section 4 we will see how this
architecture is implemented on hardware. Such as
architecture section, implementation is defined for both
levels individually. Section 5 is about simulation of the
in , . Polynomial
2010 International Conference on Reconfigurable Computing
978-0-7695-4314-7/10 $26.00 © 2010 IEEE
system which uses sample data to verify the results. Section
6 is about synthesizing the design on FPGA and its reports.
Finally section 7 is about system analysis and how fast this
II. DESCRETE COSINE TRANSFORM
DCT is one of the major transformations which is used in a
lot of compression algorithms. DCT is a frequency
transform which is equivalent to the real part of the discrete
Fourier transform (DFT). 2D-DCT is very common in
image processing and specially image compression. It
affects a 2-dimensional matrix of natural values. Discrete
Cosine Transform receives an image matrix, which is
divided into smaller image blocks (4x4, 8x8, 16x16 ...)
where each block is transformed from the spatial domain to
the frequency domain. DCT decomposes signal into spatial
frequency components. The lower frequency parts appear
toward the first line/first column of the DCT matrix, and
the higher frequency parts are in the last line/last column of
the DCT matrix. This
multiplication of specific cosine coefficients in values of a
small block of image and summation of the products results
in new values for each pixel , , .
2D-DCT tends to be parallelized very well. It is common to
use 8x8 matrixes to simplify the calculations. DCT will be
applied on each 8x8 matrixes of the image and the formula
) , (i ).
2 ()( ).(
F(u,v): values in transform domain
F(i,j): values in pixel domain
i, j: spatial coordinate in the pixel domain
u, v: coordinate in transform domain
Unlike many other architectures, in this paper we propose a
non row/column decomposition method which implements
the 2D-DCT directly. Each 8x8 matrix has got 64 elements
which for each (a,b) element (
procedure that results F(a,b) must be applied once.
Considering that the procedure for each element is
independent of other 63 elements and 64 cosine
coefficients, C(a) and C(b) for each (a,b) element have got
constant values independent of the value of f(a,b), the
calculation procedure for F(a,b) could be applied for each
element of 8x8 matrix simultaneously with 63 other
elements. In other words, there can be 64 parallel
procedures for 64 elements of 8x8 matrix.
simultaneously, 64 similar calculation units, that every unit
has got its specific coefficients which depend on the value
of a and b for each unit , are needed. When one operational
cycle finishes, the final values of F(a,b) in all 64
calculation units are ready. For applying DCT on the whole
image matrix the image must be divided into 8x8 matrixes
without overlap. Then DCT operations must be applied to
all 8x8 matrixes individually.
The system which implements these operations must be in
charge of all these tasks:
• Reading 8x8 matrix values from related addresses of
• Controlling 64 calculation units for applying DCT on 8x8
• Writing calculated data F(a,b) in related addresses of
• Shifting the 8x8 window to the next part of image when
an operational cycle is finished
• Declaring the end of whole operations (applying DCT to
the whole image)
A DCT processor, a source memory that stores primitive
values of image and a destination memory, which
calculated data will be stored on, are needed for designing
for calculating 64 values of F(a,b)
A. 2LEVEL ARCHITECTURE
The DCT Processor has been designed in 2 levels in which
the high level unit is in charge of reading data from source
memory, transferring data to low level unit, reading
calculated data from low level unit and writing them on the
destination memory. On the other hand, the low level unit
is just in charge of applying DCT calculations on entered
data from high level unit.
B. HIGH LEVEL UNIT ARCHITECTURE
High level unit includes several sub-units.
1) A counting unit as a sequence counter which is used for
defining the state of the system and driving signals based
on that state.
2) Reading register file includes 64 registers for read data
from source memory.
3) Calculation register file includes 64 registers for data
which are used by low level unit.
4) Writing register file includes 64 for data which must be
written in the destination memory
5) Image size registration unit includes 2 registers which
save the height and width size of the image.
6) Current coordinate of 8x8 matrix registration unit
includes 2 counters which show the coordinate of the first
element of 8x8 matrix in the image.
7) Address generator unit uses values of counting unit,
image size registration unit and current coordinate of 8x8
matrix registration unit to generate these addresses
Fig1: high level unit block diagram
• Source memory address for reading data from
• Destination memory address for writing data in
• Current addresses of 3 register files
• Current coefficient address in low level unit
8) Control unit generates control signals for memories,
register files, high level sub units and low level sub units.
Control signals include enable and reset signals.
C. LOW LEVEL UNIT ARCHITECTURE
Low level unit includes 64 similar sub-units which differ
only in their coefficients. Each sub unit includes these
1) Coefficient memory includes 64 rows and in each row
there is the coefficient value which is related to one of the
combinations of i and j in DCT formula. In order to
simplify the calculations, every coefficients has been
multiplied by 2^16 = 65536.
2) Multiplication unit include a fast clock-free multiplier
which multiplies input data from high level unit by the
related coefficient from coefficient memory.
3) Accumulation unit has got the value of 0 at first and
after each clock pulse it adds the current result value with
the input value. Therefore, after 64 clock pulses it contains
the final value of F(a,b). the first 16bits must be eliminated
in order to division of the result by 2^16 = 65536.
D. SYSTEM STATE MACHINE
At first state, system assigns width and height values of the
image via two related input pins and these values are stored
in image size registration unit. System also resets all other
registers in this state. System will go to state 1 by changing
an input signal which decides whether system is in
configuration state (start state) or process state (starts with
Fig2: low level unit block diagram
Fig3: system state machine
At state 1, counting unit starts counting from 0 to 63 and by
getting to 64 it stops counting. During this counting period,
these actions will be happening:
• 64 new data are being read from source memory and
stored in reading register file.
• Those 64 data which have been read through the previous
cycle are being sent to 64 low level sub units for being
applied in DCT calculations. Coefficient memories in each
sub unit are being addressed by this counting from 0 to 63.
• Those 64 data which have been calculated through the
previous cycle are being written in the destination memory
By stopping the counting unit, system will go from state 1
to state 2.
At state 2, with first rising edge of clock pulse, data on the
reading register file will be moved to calculation register
file simultaneously and respectively. Also final values in
each calculation sub unit will be placed on each sub units
System will go from state 2 to state 3 immediately after this
edge of clock pulse.
At state 3, with first rising edge of clock pulse, 64
calculated final values in all sub units will be moved to
writing register file simultaneously and respectively.
System will go from state 3 to state 4 immediately after this
edge of clock pulse.
At state 4 two actions can happen.
• If at state 1 of current cycle the last part of calculated
data on the writing register file was written on the
destination memory, system would go to final state with
first rising edge of clock pulse.
• If the previous condition was not correct then the
counting unit would be reset and system would start a new
cycle by going to state 1.
At final state, counting unit stops and an output signal
changes in order to declare the end of process.
implementation, simulation, analysis and testing in two
separate parts which are completely independent from each
other. High level unit acts as a commander which is in
charge of controlling system sequences, addressing
different parts, deciding to continue with or change the
current state of system and controlling every single element
either in high level unit or in low level unit. Low level unit
acts as a calculation core which gets command, data, and
required addresses from high level unit to do the
calculations of DCT which includes calculating the
summation of different products for achieving the result of
applying DCT on data.
this system in two levels facilitate
A. HIGH LEVEL UNIT IMPLEMENTATION
At the start state, 7bits counter and 3 flip-flops in counting
unit are reset. After transition to state 1, 7bits counter starts
up-counting. When it gets to 64(7th bit gets to '1'), it will
stop counting and enable signal for first flip-flop will
change into '1' simultaneously. Now, first flip-flop gets to
'1' by first rising clock pulse. This change makes the enable
signal for first flip-flop return to '0' and the enable signal
for second flip-flop will change into '1' simultaneously.
Then, second flip-flop gets to '1' by first rising clock pulse.
This change makes the enable signal for second flip-flop
return to '0' and the enable signal for third flip-flop will
change into '1' simultaneously. After that, third flip-flop
gets to '1' by first rising clock pulse. This change causes
resetting 7bits counter. 0 value of the counter makes the
first flip-flop reset. '0' value of the first flip-flop makes the
second flip-flop reset. '0' value of the second flip-flop
makes the third flip-flop reset. '0' value of the third flip-flop
causes ending the reset state of 7bits counter. In other
words, 7bits counter will be able to start counting again
right after the change from '1' to '0' of third flip-flop. Only
if the current cycle is the last cycle of system, the whole
counting unit will stop instead of starting counting again. In
this case system will go to final state and an output signal
declares the end of process.
B. LOW LEVEL UNIT IMPLEMENTATION
This unit includes 64 independent calculation sub-units
which work concurrently on 16bits input data. Every sub-
unit's structure is the same with others and the only
difference between them is their coefficient values which
are stored in their coefficient memory. By counting from 0
to 63 in counting unit, related coefficient in each sub-unit
multiplies by related 16bits data from calculation register
file. On each iteration, current result will be added to the
summation of all the previous results. When counting unit
reaches 64, the summation of products of 64 data with 64
coefficients will be available on a register in every sub-
unit. These 64 values would be the final values of an 8x8
matrix after DCT operations. In other words, since 64 sub-
units calculate the final data concurrently, when counting
unit reaches 64, 64 stored values in 64 calculation sub-units
would be the result of DCT on 64 data stored on
calculation register file. Those registers which are
containing final data in sub-units are connected to 64
registers in writing register file respectively. On the next
rising edge of clock pulse, data will be copied into the
related registers in writing register file.
It must be mentioned that coefficient memory in each sub-
unit includes 64 signed values which have been calculated
by a MATLAB function. Values for each sub-unit depends
on the sub-unit's coordinate in 8x8 matrix of sub units
related to the formula of DCT.
V. SIMULATION AND VERIFICATION
The implementation of the system has been done using
VHDL modeling including two separate modules for high
level unit and low level unit and they have been integrated
in a bigger module called DCT processor containing high
level unit and 64 sub-units of the low level unit. Then the
whole model has been simulated in ModelSim SE 6.0 for
verification test. Also, DCT function has been written in
MATLAB 7.8 to apply DCT mathematical function on the
same data to confirm the results. Results of the simulated
system are the same with original results of MATLAB for
same data with a small error which is negligible. After
comparing results of DCT processor and MATLAB
function, the design has been verified.
After verification, the whole system has been synthesized
on FPGA using Xilinx ISE. The device properties are
presented in the following table.
Important timing information is what we see in table II.
After synthesis, all the components can be extracted from
the synthesis report. Tables III and IV show the
components of high level unit and low level unit
131 Rd Reg File,
Calc Reg File,
W, H, W Pos
64 Wr Reg File
4 FF1, FF2, FF3,
1 H Pos
1 Src Mem Addr
1 Src Mem Addr
1 W Pos
1 Src Mem Addr
1 Dest Mem Addr
1 Dest Mem Addr
5 Dest Mem Addr,
Src Mem Addr
2 H Pos, W Pos
1 Dest Mem Addr
1 Dest Mem Addr
2 Dest Mem Addr
2 Height Pos,
1 Dest mem data
Components in High Level unit
Data Reg Input
Components in Low Level unit
In addition, the utilization percentage of logic part is 2%,
memory is 3% and DSP slices are 49%.
This method for architecture and implementation of a DCT
processor has got a very high time efficiency because of its
parallelism and pipe-lining in reading from memory,
calculation and writing in memory. Assuming the same
DCT processor but with a single calculation unit (no
parallelism) and without pipe-lining, it would take 64 clock
pulses for reading a 8x8 matrix of data from memory,
64×64=4096 clock pulses for calculating the results of
applying DCT on 8x8 matrix and 64 clock pulses for
writing the 8x8 matrix of final data on memory. This
means it takes 4224 clock pulses for every 8x8 matrix of
image. To sum up, it would take 66n clock pulses for an
image which has n pixels. On the other hand, for the
system which has been described in this paper, each cycle
takes 67 clock pulses. Applying DCT on an image
+2 cycles because each cycle
containing n pixels takes64
affects a 8x8 matrix and it takes 2 more cycles for
calculating the last 8x8 matrix and writing it in the
destination memory. Each cycle takes 67 clock pulses and
+2) clock pulses for the whole
it means that 67× (64
image. In other word 1.05n + 134 clock pulses for an
image which has n pixels. For large images the system
works approximately 1 pixel in 1 clock pulse.
Thus, this DCT which has been synthesized on vitrex5
FPGA and functions in 150MHz can work real-time for an
NTSC (30fps) video with 4.7 mega pixel size.
In this paper we have proposed a new high throughput
design for discrete cosine transform which can be used in
real-time systems. Using parallelism in calculations for 64
different parts and having a pipeline for reading,
calculating and writing data provides us with a dramatic
speed in applying DCT on an image. On the other hand,
calculating the coefficients for each calculation part and
storing it on a ROM saves a lot of time and hardware
complexity. Of course, implementing this design on ASIC
will decrease delays and increase the system frequency.
Therefore, it can work on larger images and videos with
This system for instance can function in a digital camera as
a co-processor to be in charge of compression and decrease
the burden on the main processor.
This system can also be used as IDCT processor just by
adding another ROM for new coefficients because the
algorithm is completely the same with DCT. Furthermore,
by adding quantizer, encoder, decoder and inverse
quantizer to DCT/IDCT processor we can design a
compressor/decompressor system as a full package.
Compressor/decompressor system can work as a co-
processor in an image processing system.
 Reza Ebrahimi Atani, Mehdi Baboli, Sattar Mirzakuchaki,
Shahabaddin Ebrahimi Atani, Babak Zamanlooy, "Design and
Implementation of a 118 MHz 2D DCT Processor", Industrial