Page 1
Implementation of Three-Dimensional FPGA-Based FDTD Solvers: An
Architectural Overview
James P. Durbano
EM Photonics, Inc.
durbano@emphotonics.com
Fernando E. Ortiz
John R. Humphrey
Dennis W. Prather
University of Delaware
{ortiz, humphrey, dprather}@ee.udel.edu
Mark S. Mirotznik
The Catholic
University of America
mirotznik@cua.edu
No longer relegated to radio frequency (RF) engineers,
antenna designers, and military applications, electromagnetic
analysis has become a key factor in many areas of advanced
technology. From 3 GHz PCs and wireless computer networks,
to PDAs with Internet capabilities and the seemingly ubiquitous
cell phone, it seems that electronic designs increasingly require
electromagnetic characterization. To facilitate such analysis,
numerical techniques have been developed that allow computers
to easily solve Maxwell’s equations.
Maxwell’s equations, which
propagation, are a system of coupled, differential equations. As
such, they can be represented in difference form, thus allowing
their numerical solution. By implementing both the temporal
and spatial derivatives of Maxwell’s equations in difference
form, we arrive at one of the most common computational
electromagnetic algorithms, the Finite-Difference Time-Domain
(FDTD) method [1]. In this technique, the region of interest is
sampled to generate a grid of points, hereafter referred to as a
mesh. The discretized form of Maxwell’s equations is then
solved at each point in the mesh to determine the associated
electromagnetic fields.
Although FDTD methods are accurate and well defined,
current computer-system technology limits the speed at which
these operations can be performed. Run times on the order of
hours, weeks, months, or longer are common when solving
problems of realistic size. Some problems are even too large to
be effectively solved due to practical time and memory
constraints. The slow nature of the algorithm primarily results
from the nested for-loops that are required to iterate over the
three spatial dimensions and time.
To shorten the computational time, people acquire faster
computers, lease time on supercomputers, or build clusters of
computers to gain a parallel processing speedup [2], [3]. These
solutions can be prohibitively expensive and frequently
impractical. As a result, an approach that increases the speed of
the FDTD method in a relatively inexpensive and practical way
is required. To this end, people have suggested that an FDTD
accelerator, i.e., special-purpose hardware that implements the
FDTD method, be used to speed up the computations [4]-[8].
However, none have succeeded in developing a practical
implementation, nor a full three-dimensional solver.
In this extended abstract, we present an architecture that
overcomes the previous limitations. We begin with a high-level
description of the computational flow of this architecture.
The computational datapath begins with the Counting and
Control Unit (CCU). In addition to containing global system
govern electromagnetic
data, the CCU produces the coordinates and type (electric or
magnetic) of the next field to be computed. These coordinates
are then passed to the Data Dependence Unit (DDU). The DDU
is responsible for determining all values necessary to update the
individual field components at this node (i.e., which surrounding
field values are required).
The coordinates output by the DDU are then passed into a
RAM Address Decoder (RAD). This unit takes a given field
component (e.g., Ex(i,j,k)) and determines its location in
memory. By including multiple DDU and RAD units in the
design, several field components can be updated simultaneously.
Because this generates numerous read requests, the Memory
Switching Unit (MSU) was developed to coordinate all memory
transactions.
The majority of the problem data are stored in three RAM
banks. Each RAM contains x, y, or z-directed fields and the
material type of each node (e.g., air, water, silicon). As data are
fetched from RAM, they are stored in register banks until all
necessary data have been retrieved and the system is ready to
update the field.
Before the field-update computation occurs, however, several
material coefficients must be determined. These coefficients are
used in the computation of the field-update equation and take
into account the material properties of the medium (e.g.,
permittivity, permeability, conductivity). To determine the
coefficients, the material types are passed to the Material
Lookup Table (MLUT). The MLUT reads in bit vectors
representing a given material and returns the various coefficients
corresponding to those materials.
The surrounding field values and material coefficients must
then be routed to the appropriate Computation Engine (CE),
which updates the given field component based on the
discretized forms of Maxwell’s equations. Several CEs are
included in the design, allowing the system to update multiple
field components in parallel. The updated values are then
passed back to the MSU for storage in RAM.
In order to test our architectural ideas, a prototyping board
with a Xilinx Virtex-II 6000 FPGA, several RAM banks, and a
PCI interface was acquired. The user describes the design to
analyze by means of a CAD front end developed by EM
Photonics, Inc. The front-end software then sends the
appropriate data, such as the mesh size and the number of
timesteps to execute, to the hardware via the PCI bus. The
FDTD accelerator proceeds to update the fields, periodically
sending the results back to the host computer for post-processing
and visualization.
Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’03)
1082-3409/03 $17.00 © 2003 IEEE
Page 2
The benchmark problem was an air-filled cavity surrounded
by perfect electric conductor (PEC) walls. The cavity was
excited by a z-directed, sinusoidal point source (of unity
amplitude) located at the center of the resonator. The
simulation was run for 5,000 timesteps with a mesh size of
43x43x43. In order to analyze the error, a point detector was
placed in the corner of the cavity.
The hardware results were then compared with the results
obtained from C and MATLAB 6.1 programs solving the same
problem on 1.13 and 2.0 GHz PCs. A C implementation was
chosen to perform the speed analysis, whereas a MATLAB
implementation was chosen for error analysis. This allowed us
to optimize the C program for speed and the MATLAB program
for error measurements.
The average absolute error was on the order of 10-7 with the
average percentage error around 0.13%. This numerical error is
a result of two primary factors. First, MATLAB is a double-
precision (64 bit) language
implementation supports only single-precision (32 bit)
arithmetic units. If precision is of the utmost importance,
however, double-precision arithmetic units can easily be
implemented. The second factor that contributes to the error is
the computation of the source field. For simplicity, the sine
function was implemented as a lookup table (LUT) with only
16K entries. In future implementations, more entries will be
included in the LUT to increase resolution or other techniques,
such as the CORDIC algorithm, will be used to generate the
source [9].
In terms of processing power, the 14 MHz hardware had an
average throughput of approximately 150,000 nodes per second
(150 Knps1). This is 5.66 times slower than the processing
power of C running on a 1.13 GHZ PC (849 Knps) and 8.55
times slower than C on a 2.0 GHz machine (1,282 Knps). Note
that although the PC is clocked over 142 times faster than the
hardware, the hardware is less than 9 times slower.
Certainly a design that is slower than existing solutions is not
desired! However, this was a proof-of-concept design that
served not only to implement our basic architectural ideas, but
also to achieve the first three-dimensional FDTD accelerator
implementation in physical hardware. As such, these results
were obtained on a preliminary, non-optimized design. A
detailed analysis indicates that overlapping the computations of
different nodes will result in a threefold increase in speed. Also,
because the throughputs of our accelerator increase linearly with
clock frequency, by increasing the clock frequency from 14
MHz to 100 MHz, a common FPGA system speed, a sevenfold
increase in throughput is possible. Although the current design
is almost nine times slower than a 2.0 GHz PC running
optimized C code, after the above design modifications are
made, the hardware throughput will be almost two and a half
times that of a 2.0 GHz PC. Note that modifying the design to
work at 100 MHz is not an unreasonable goal, as the most
complex units in the design are the floating-point arithmetic
units, which are already capable of speeds in excess of 100
MHz. The overall clock frequency had to be reduced for some
non-optimized units related to the routing of data. These units
can be pipelined, thus permitting increased clock frequencies.
whereas the hardware
1 Knps = thousands of nodes processed per second, where the
time to process a node is the time to update all of the fields at
that node.
Finally, it should be noted that these results are from a
commercial, off-the-shelf prototyping board. Because the
design had to be mapped into this general-purpose board, several
architectural optimizations (such as increased parallelism) could
not be implemented. Initial calculations indicate that a
customized board can provide speed increases of at least two
orders of magnitude through the addition of multiple RAM
banks, increased clock frequencies, and a more efficient use of
memory.
To the best of our knowledge, this work represents the first
successful three-dimensional FDTD algorithm in hardware. We
are currently working on an optimized version of this
architecture and a custom printed circuit board to support our
design. This will provide increased computational speeds,
which we predict will easily surpass desktop computers, and will
ultimately rival the performance of computer clusters.
References
[1] K. S. Yee, "Numerical solution of initial boundary value
problems involving Maxwell's equations in isotropic
media," IEEE Transactions on Antennas and Propagation,
vol. 14, pp. 302-307, 1966.
[2] H. Jordan, S. Bokhari, S. Staker, J. Sauer, M. ElHelbawy,
and M. Piket-May, "Experience with ADI-FDTD
techniques on the Cray MTA supercomputer," in Proc. of
the SPIE - Commercial Applications for High-Performance
Computing, vol. 4528, pp. 68-76, 2001.
[3] G. A. Schiavone, I. Codreanu, R. Palaniappan, and P.
Wahid, "FDTD speedups obtained in distributed computing
on a Linux workstation cluster," IEEE Antennas and
Propagation Society, AP-S International Symposium
(Digest), vol. 3, pp. 1336-1339, 2000.
[4] J. R. Marek, M. A. Mehalic, and J. Andrew J. Terzuoli, "A
dedicated VLSI architecture for Finite-Difference Time
Domain calculations," in Proc. of The 8th Annual Review of
Progress in Applied Computational Electromagnetics,
Naval Postgraduate School, Monterey, CA, 1992.
[5] R. N. Schneider, L. E. Turner, and M. M. Okoniewski,
"Application of FPGA technology to accelerate the Finite-
Difference Time-Domain (FDTD) method," in Proc. of The
Tenth ACM International
Programmable Gate Arrays, Monterey, CA, 2002.
[6] P. Placidi, L. Verducci, G. Matrella, L. Roselli, and P.
Ciampolini, "A custom VLSI architecture for the solution
of FDTD equations," IEICE Transactions on Electronics,
vol. E85-C, pp. 572-577, 2002.
[7] L. Verducci, P. Placidi, G. Matrella, L. Roselli, F.
Alimenti, P. Ciampolini, and A. Scorzoni, "A feasibility
study about a custom hardware iplementation of the FDTD
algorithm," in Proc. of The 27th General Assembly of the
URSI, Maastricht, Netherlands, 2002.
[8] J. P. Durbano, "Hardware implementation of a 1-
dimensional Finite-Difference Time-Domain algorithm for
the analysis of electromagnetic propagation," M.E.E.
Thesis, Department of
Engineering, University of Delaware, Newark, USA, 2002.
[9] R. Andraka, "A survey of CORDIC algorithms for FPGA
based computers," in Proc. of The ACM/SIGDA
International Symposium on Field Programmable Gate
Arrays, Monterey, CA, USA, 1998.
Symposium on Field-
Electrical and Computer
Proceedings of the 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’03)
1082-3409/03 $17.00 © 2003 IEEE
Download full-text