Maxwell - a 64 FPGA Supercomputer
ABSTRACT We present the initial results from the FHPCA Supercomputer project at the University of Edinburgh. The project has successfully built a general-purpose 64 FPGA computer and ported to it three demonstration applications from the oil, medical and finance sectors. This paper describes in brief the machine itself - Maxwell - its hardware and software environment and presents very early benchmark results from runs of the demonstrators.
-
Citations (0)
-
Cited In (0)
Page 1
Abstract—We describe the FPGA-based supercomputer
Maxwell built by the FPGA High-Performance Computing
Alliance at the University of Edinburgh. Winner of the silver
medal in the BT Flagship Award for Innovation at the 2007
British Computer Society
general-purpose 64 FPGA
high-performance parallel computing. This paper describes the
machine itself, its hardware and software environment and
presents benchmark results from runs of three commercial
demonstration applications from the oil, medical and finance
sectors.
Index Terms—FPGAs, High-Performance Reconfigurable
Computing
Awards,
computer
Maxwell
designed
is a
for
I. INTRODUCTION
Field-programmable gate arrays (FPGAs) are not new, at
least not by today’s measure of technology novelty. Ross
Freeman, co-founder of Xilinx Inc., invented this new form
of programmable logic device in 1984. While rapidly
becoming a key piece of technology in the embedded systems
area it took really until the beginning of the 21st century for
FPGAs to be viewed as something more – a potential
contender for the place of numerical workhorse in
high-performance computing (HPC) systems. In 2002
Compton and Hauck concluded that “Reconfigurable
computing is becoming an important part of research in
computer architectures and software systems” [1].
Since then high-performance reconfigurable computing
(HPRC) has developed both a name and an ecosystem of
researchers and computer vendors pushing the envelope of
what it is possible to make programmable logic do in the
context of HPC applications. Baxter et al [2] and Craven &
Athanas [3] provide good analyses of the current and
potential state of FPGA-based supercomputing.
Most recently a number of groups have begun to examine
the potential of large-scale FPGA-based systems where the
FPGAs are used as main processors rather than the more
traditional co-processor. Cathey et al [4] describe in
excellent detail the design choices and tradeoffs involved in
building such a large-scale system – a system with many
similarities to the one described here – while Sass et al take
the possibilities a step further and consider a purely FPGA
design involving no host CPUs at all [5].
Manuscript received February 15, 2008. This work was supported in part
by Scottish Enterprise.
R.M. Baxter is with EPCC at the University of Edinburgh, Edinburgh, UK
(phone: +44 131 651 3579; fax: +44 131 650 6555; e-mail:
r.baxter@epcc.ed.ac.uk).
Against this background of potential in the emerging area
of high-performance reconfigurable computing the FPGA
High Performance Computing Alliance (FHPCA [6]) was
founded in early 2005 to take forward the ideas of an
FPGA-based supercomputer. The alliance partners are
Algotronix, Alpha Data, EPCC at the University of
Edinburgh, the Institute for System Level Integration,
Nallatech and Xilinx. The project was facilitated and part
funded by the Scottish Enterprise Industries team and had
two main goals:
• design and build a 64-FPGA supercomputer from
commodity parts and “plug-in” FPGA cards;
• demonstrate its effectiveness (or otherwise) for
real-world HPC applications.
The machine itself – Maxwell – was completed in the first
part of 2007 and subsequently went on to win the silver
medal in the BT Flagship Award for Innovation at the 2007
British Computer Society Project Excellence Awards [7].
This paper is structured as follows. Section II describes
the motivation behind Maxwell. Section III delves into the
details of the machine’s hardware while Section IV describes
the software environment and programming methodology
used in porting a number of demonstration applications.
Section V discusses three key demonstration applications
from the fields of financial services, medical imaging and oil
and gas exploration., and Section VI presents some
performance results from these applications on Maxwell.
Finally Section VII offers thoughts for the future.
II. MOTIVATION
designed Maxwell
general-purpose FPGA supercomputer. Given the
specialized nature of hardware acceleration the very concept
of ‘general-purpose’ for high-performance reconfigurable
computing (HPRC) is worth investigating in its own right.
Can a machine built to be as broadly applicable as possible
deliver enough FPGA performance to be worth the cost?
Our real interest in building Maxwell was not to test
whether FPGA hardware can be used to accelerate segments
of a standard HPC application, but to explore whether
standard HPC applications can be run almost entirely on
FPGA hardware, parallel communications and all. We take
the same view as Bennett et al [8] in regarding the FPGAs as
the primary compute platform rather than co-processors to a
CPU. To this end we constructed a machine capable in
principle of parallel operation across a network of large
FPGAs linked directly together.
is as a proof-of-concept
Maxwell – a 64 FPGA Supercomputer
Rob Baxter, Stephen Booth, Mark Bull, Geoff Cawood, James Perry, Mark Parsons, Alan Simpson,
Arthur Trew, EPCC and FHPCA; Andrew McCormick, Graham Smart, Ronnie Smart, Alpha Data ltd
and FPHCA; Allan Cantle, Richard Chamberlain, Gildas Genest, Nallatech ltd and FHPCA
Engineering Letters, 16:3, EL_16_3_23
______________________________________________________________________________________
(Advance online publication: 20 August 2008)
Page 2
III. HARDWARE
Maxwell is essentially an IBM BladeCentre Cluster with
FPGA acceleration. Altogether it comprises 32 blade servers
each with one Intel Xeon CPU and two Xilinx Virtex-4
FPGAs. The CPUs are connected to the FPGAs with a
standard IBM PCI-X Expansion Module.
A. BladeCentre chassis
Physically Maxwell comprises two 19-inch racks and five
IBM BladeCentres. Four of the BladeCentres have seven
IBM Intel Xeon blades and the fifth has four. Each blade is a
diskless 2.8 GHz Intel Xeon with 1 GB main memory. The
blades are connected over gigabit Ethernet through a single
48-way Netgear switch with 40 Gb/s throughput. The blades
are booted over the network from the headnode, a plain old
Dell Precision 670 with 4 GB main memory and over 1 TB of
local SATA disk.
The chassis is thus a fairly standard commodity setup –
deliberately so since our intention was to investigate the
viability of building a high-performance cluster from
standard parts and plug-in FPGA cards from two independent
vendors – Nallatech and Alpha Data.
Logically we regard Maxwell as a collection of 64 nodes,
where a node is defined as a software process running on a
host CPU, together with some FPGA acceleration hardware.
In full operation each blade CPU thus hosts two software
processes, each of which ‘manages’ one FPGA during
runtime.
One obvious drawback to Maxwell’s architecture is the
diskless nature of the blades; all disk i/o traffic routes through
the Ethernet switch, offering an instant performance
bottleneck for data intensive applications. However,
Maxwell is intended as a demonstration and exploration
platform for FPGA-to-FPGA computing rather than a
production system, so this is a design compromise we can
live with just now.
B. FPGAs
The FPGAs in Maxwell are Xilinx Virtex-4 devices in two
flavours. Those on the Alpha Data cards are XC4VFX100
parts, while those on the Nallatech cards are XC4VLX160.
The reasons for this are more logistical and political than
technical.
The LX flavours of Xilinx’s Virtex range offer the greatest
number of logic cells, while the FX flavours include
embedded PowerPC cores
transceivers (MGTs) (“RocketIO”)
communications [9]. The V4LX160s each have 152,064
logic cells against the V4FX100s’ 94,896. This makes them
pretty large by current FPGA device standards, allowing
room to implement significant pieces of HPC code on them.
Given the mixed nature of devices in the machine we have so
far not used the PowerPC cores on the FX100s.
These two flavours of Virtex-4 are built into two flavours
of plug-in PCI card: the Nallatech H101 and the Alpha Data
ADM-XRC-4FX.
Both types of card connect using a PCI/PCI-X bridge,
capable of 64 bit, 133MHz operation in PCI-X mode – a peak
and multigigabit
for
serial
off-chip
bandwidth of 600 MB/s. PCI-X is by no means an ideal
connection technology to use – PCI Express is capable of 8
GB/s on a 32-lane connection – and is another potential
performance bottleneck on Maxwell – indeed on many
PCI/PCI-X based FPGA cards. However, our approach to
programming Maxwell aims to remove the CPU-FPGA
connection from critical code paths by performing the bulk of
calculations purely on the FPGA, so again we regard this as
an acceptable design tradeoff.
C. Nallatech H101
The cards in one half of Maxwell are a slight variant on
Nallatech’s off-the-shelf H101-PCIXM [10]. The standard
card uses V4100LX devices; Maxwell uses the 60% bigger
V4160LX versions, with a peak clock speed of 200 MHz.
There are 32 of these cards, occupying PCI slots in 16 of the
blades.
Along with the V4LX160 the H101 has 16 MB of DDR-II
SRAM in four banks, and one 512 MB bank of DDR-II
SDRAM. The four SRAM banks deliver a peak bandwidth
of 6.4 GB/s, while the SDRAM delivers 3.2 GB/s.
Observant readers will have noticed that the V4LX devices
in the H101s have no serial RocketIO MGTs and thus no
means of accessing the outside world. Quite so.
Communication links from the Nallatech cards are achieved
through a separate comms chip, a small Virtex-II Pro FX
device with an embedded router core. Each H101 card thus
has four MGT links capable of running at 2.5 Gb/s.
D. Alpha Data ADM-XRC-4FX
The other 16 blades in Maxwell host 32 Alpha Data
ADM-XRC-4FX cards [11]. The ADM-XRC-4FX is a high
performance reconfigurable PMC/PMC-X/XMC (PCI
Mezzanine Card) based on the Xilinx Virtex-4-FX range.
The Maxwell cards are the V4FX100 variant.
As with the Nallatech cards the ADM-XRC-4FX have 16
MB of SRAM but double the SDRAM at 1,024 MB of
DDR-II in four banks. This gives a peak memory bandwidth
on 8.4 GB/s to the SDRAM.
Off-chip communication is direct from the V4FX devices
– again, four RocketIO MGTs each with a maximum possible
bandwidth of 3.125 GB/s.
E. Communications networks
As suggested thus far, Maxwell has two independent
communications networks. The blade CPUs are networked
over standard gigabit Ethernet via a single switch; the CPUs
thus have an all-to-all connectivity. This is contrasted with
the FPGA network which consists of point-to-point links
between the MGT connectors of adjacent FPGAs, as
illustrated in Figure 1. The FPGA pairs hosted on a single
CPU form “east-west” pairs in the network.
The MGTs are connected with standard HSSDC2 1x-1x
Infiband cables of 50cm and 100cm lengths, kept as short as
possible.
Engineering Letters, 16:3, EL_16_3_23
______________________________________________________________________________________
(Advance online publication: 20 August 2008)
Page 3
Figure 1. FPGA connectivity in Maxwell
The FPGA connections are purely point-to-point – we do
not implement routing logic in the FPGA devices. Maxwell’s
FPGAs thus form a two-dimensional torus, making the
RocketIO network highly suitable for nearest-neighbour
communication patterns but less than ideal for reduction
operations such as global sums. For these, applications call
back to the host CPUs for MPI reduction operations to be
performed over Ethernet.
Our general approach is to use the Ethernet network purely
as a control network and to perform parallel communications
over RocketIO. The Ethernet is also used for any explicit
MPI calls that remain in the application – for instance for
start-up data distribution or finalizing data marshalling on
completion.
The hard-wired nature of the FPGA communications
network contrasted with the implicit all-to-all CPU network
means that care is required in constructing application
configuration and acceleration components in software. This
is one thing we have tried to address in the Parallel Toolkit
software environment (Section IV below).
IV. SOFTWARE ENVIRONMENT
Our aim with the environment on Maxwell was to make it
as ‘HPC-system-like’ as possible. Uptake of FPGA and
related technologies in HPC will be hindered if the machines
require a whole different approach to that of ‘mainstream’
HPC.
In the early days of parallel computing every vendor had
their own way of doing things and progress for application
developers was slow. One vendor’s communication
protocols did not map onto another’s and machine
environments were very different. Today, parallel
environments are much more standardized and there is a lot
of ‘legacy’ software built using libraries like MPI, BLAS and
SCALAPACK. Machines requiring significantly different
approaches will find it difficult to gain footholds in the HPC
market.
Thus Maxwell looks very much like any other parallel
cluster. It runs the Linux variant CentOS and all standard
Gnu/Linux tools. It offers Sun Grid Engine as a batch
scheduling system and MPI for inter-process communication.
A. A Parallel Toolkit for HPRC
The one novel innovation is the Parallel Toolkit (PTK).
The PTK has been developed as part of the overall design of
Maxwell and provides an attempt to enforce top-down
standardization on application codes across the different
‘flavours’ of hardware.
Maintaining the portability and maintainability of HPC
codes on new architectures is these days essential. Many
years of effort have been invested since the early 1990s on
standardizing HPC codes to run cleanly and portably across a
wide range of parallel hardware. Standards such as the
Message Passing Interface [12] and OpenMP [13] are used
almost exclusively; gone are the days of vendor-specific
codes and proprietary languages.
High-level FPGA programming faces the same challenges
now as parallel computing did 15 years ago. Addressing
these challenges requires a consensus among tool providers
and a standardization activity
high-performance reconfigurable computing – a process
organizations like OpenFPGA [14] hope to catalyse. Indeed
the PTK was developed contemporaneously with the
OpenFPGAs GenAPI [15] and shares many of the same
design goals.
The PTK is a potential step along the road towards
standardizing both high-level APIs and job and machine
configuration methods for HPRC. It is a set of practices and
infrastructure intended to address key HPRC acceleration
issues such as: how to associate processes with FPGA
resources; how to associate FPGAs with bitstreams; how to
manage contention for FPGA resources within a process; and
how to managing code dependencies to facilitate re-use. To
these ends the PTK comprises a library of C++ classes
providing abstract interfaces to FPGA hardware components;
classes providing standard ways to configure arbitrary FPGA
hardware; and a standard way of launching parallel FPGA
jobs.
The PTK is described in more detail in [16].
by the users of
B. FPGA Programming
While providing a useful, common way of configuring
FPGAs from software the PTK does not address the deeper
portability issues for running applications on different
flavours of FPGA hardware. Maxwell still requires
developers to build their FPGA bitstreams against either
Nallatech H101 or Alpha Data ADM-XRC-4FX cards offline
and copy the bitfiles across to the system. We do not
mandate any particular tool approach for FPGA developers –
any tool capable of targeting the two card types will produce
a bitstream that can be run on Maxwell.
V. DEMONSTRATION APPLICATIONS
As part of the overall project we have ported three
demonstration applications to run on Maxwell. Each of these
has been produced in two hardware ‘flavours’ for the two
halves of the machine, with a common high-level interface to
software captured in the Parallel Toolkit.
Engineering Letters, 16:3, EL_16_3_23
______________________________________________________________________________________
(Advance online publication: 20 August 2008)
Page 4
Two criteria were used to select these applications. Firstly
they were chosen from the application areas of financial
engineering, medical imaging and oil and gas, three areas
judged generally to have most to gain from hardware
acceleration solutions of one form or another. Secondly they
were chosen to illustrate progressively more complex parallel
application features, from trivial parallelism and simple data
requirements to full-scale distributed-memory parallelism.
In all three cases we adopted the methodology of the PTK:
identify the application hotspot; define an abstract
object-oriented interface that encapsulates the hotspot;
refactor the code against this interface; generate accelerated
versions of the hotspot code underneath the interface. In each
case we also produced a ‘pure software’ version of the
hotspot against the same PTK interface, providing a direct
point of comparison for both testing and benchmarking
purposes.
A. MCopt – Monte Carlo option pricing
Financial engineering is a mathematical branch of
economics that deals with the modeling of asset prices and
their associated derivatives. One of the cornerstones of
financial engineering is the Black-Scholes model of prices
[17], essentially a recasting of the equations of physical heat
diffusion and Brownian motion.
The assumptions of the Black-Scholes model imply that
for a given stock price at time t, simulated changes in the
stock price at a future time t + dt can be generated by the
following formula:
dS = S rc dt + S σ ε √dt
where S is the current stock price, dS is the change in the
stock price, rc is the continuously compounded risk-free
interest rate, σ is the volatility of the stock, dt is the length of
the time interval over which the stock price change occurs
and ε is a random number generated from a standard
Gaussian probability distribution.
The pricing of stock options follows the Black-Scholes
model, and simple stock options (so-called ‘European’
options) can be priced with a simple closed-form formula
called, unsurprisingly, the Black-Scholes formula. More
complex options such as those whose final price depends on a
time-average or other path-dependent price calculations have
no closed form and are typically priced using stochastic or
Monte Carlo modeling.
Our first demonstration application supposes you wanted
to price an ‘Asian’ option, an option in which the final stock
price is replaced with the average price of the asset over a
period of time, computed by collecting the daily closing price
over the life of the option. The price can be modeled as a
series of dSs over the option’s lifetime (say Ntimesteps). The
formula for each dS is based on the previous day’s closing
price, and the average of the Ntimesteps stock prices would
determine the value of the option at expiration.
The above gives you one possible future for the stock
price; repeating the model Nruns times allows the process to
converge on the ‘right’ option price. Nruns here is of order
10,000 – 50,000.
Based on the above model, a serial code would be as
follows:
for i = 1, Nruns
for n = 1, Ntimesteps
ε = gaussianRandomNumber()
S[n][i] = S[n-1][i] (1 + rc dt + σ ε √dt)
endfor n
Sav[i] = 1/Ntimesteps ∑n S[n][i]
c[i] = max(Sav[i] – K, 0)
p[i] = max(K – Sav[i], 0)
endfor i
Sbar = 1/Nruns ∑i Sav[i]
cfinal = 1/Nruns ∑i c[i]
pfinal = 1/Nruns ∑i p[i]
K here is the strike price of the option, the price defined in
the option contract; rc and σ are as defined above.
Pricing a single Asian option thus requires Nruns × Ntimesteps
Gaussian random numbers (plus four multiplies and three
adds each).
Our demonstration captures
parameterized, on FPGA, and batches similar pricing
calculations for different stocks/assets across the whole 64
FPGAs. In fact the demonstrator core is so compact that it
can be replicated 10 times or more across a single FPGA
device, providing an additional order of magnitude in
possible speedup.
This demonstrator – MCopt – is the simplest of the three
applications, having a simple, compact computational core
and very limited data requirements.
this whole core,
B. DI3D – three and four-D facial imaging
The second demonstration application was produced in
collaboration with Dimensional Imaging 3D ltd, a firm
specializing in three and four-dimensional facial imaging for
medical applications such as maxilo-facial surgery [18].
The principle here is that a digital camera rig is used to
capture pairs of still (for 3D) or video (for 4D) images. Each
stereo pair is then combined to produce a depth map which
contains full three-dimensional information and is used to
create a 3D software model, or a 3D video in the case of 4D
capture.
Image combination is an expensive business and is an ideal
application for FPGA acceleration, playing well to the
devices’ strengths in image processing. Our demonstrator
thus takes a key part of Dimensional Imaging’s own software
and accelerates it using FPGA hardware. Two versions of the
demonstrator were produced – an embedded version which
can connect to a live camera rig and provide on-the-fly image
combination, and a batch version designed to process large
numbers of image pairs from video frames.
This latter version is designed to run across all 64 FPGAs
of Maxwell and provides the next step-up in complexity.
While as trivially parallel as the MCopt application this
demonstrator has real data requirements – large digital
images must be managed and streamed through the FPGAs in
an efficient manner to ensure overall performance.
C. OHM3D – CSEM modeling
Our final demonstrator is another commercial code, this
time in the area of oil and gas exploration. OHM plc are an
Engineering Letters, 16:3, EL_16_3_23
______________________________________________________________________________________
(Advance online publication: 20 August 2008)
Page 5
Aberdeen-based consultancy offering services to the oil and
gas industry [19]. OHM specializes in a form of simulation
called controlled source electromagnetic modeling, a
technique which uses the conductive properties of materials
to analyse pieces of the seabed in the search for oil or gas
reserves [20].
OHM’s three-dimensional CSEM code provides the basis
for our final demonstrator. Already parallelized using MPI,
this is a ‘classic’ HPC application involving large data sets
representing physical spaces and fields, double-precision
arithmetic and iterative numerical methods for performing
linear algebra operations on large matrices and vectors.
VI. PERFORMANCE RESULTS
This section presents our final results from benchmark
tests of the three demonstrator applications on Maxwell. All
results quoted in this section were run on Maxwell; the
quoted CPU results are thus for the 2.8 GHz Intel Xeon
processors in the IBM blades. In caption legends we use the
label ‘AD’ to refer to the Alpha Data hosted FPGAs, and
‘NT’ to refer to the Nallatech hosted devices.
A. MCopt
MCopt is the simplest of the demonstrators, a
trivially-parallel engine to explore the parameter space of a
typical option pricing calculation.
Our tests for MCopt aim to explore the five-dimensional
parameter space defined by the variable input parameters to
the Monte Carlo version of the Black-Scholes model (S, K, rc,
σ, Ntimesteps). Our test draws 100,000 data samples from this
parameter space; we fix the number of Monte Carlo
iterations, Nruns, to 10,000 for each sample.
The single-node performance is shown in Figure 2, and
Figure 3 shows MCopt run across 1 to 16 nodes, both CPU
and FPGA.
15810
49
145
0
5000
10000
15000
20000
CPUAD NT
Figure 2. Single-node MCopt performance (s).
In Figure 3 the logarithmic scale belies the extreme
scalability of the FPGA versions here: a batch of 100,000
parameters (prices, rates or some combination of these) that
would take over 4 ½ hours on a Xeon blade runs in less than a
minute on one FPGA, or less than 3 seconds on 16. As might
be expected with such a simple calculation the FPGAs
outperform the CPU by over two orders of magnitude – a
factor of 300 in the Alpha Data case.
1
10
100
1000
10000
100000
1248 16
CPU
AD
NT
Figure 3. MCopt scaling performance on Maxwell (times in s). Note the
scale is logarithmic.
B. DI3D – facial imaging
The facial imaging demonstrator, while still trivially
parallel in execution, is more challenging than the MCopt
application because of its data requirements.
Our tests here involve the batch processing of 32 pairs of
video still images – 64 in all – each around 150 kB in size.
This represents a little over a second of three-dimensional
video. The images are read in from a networked disk, so this
is also an interesting test of the network and i/o overheads of
the parallel system.
Figure 4 shows the performance for these tests running on
the Nallatech side of the machine. The tests were run on two
nodes each (two CPUs versus two FPGAs) due to the
memory requirements of the initialization phase
834.0
330.5
0.0
100.0
200.0
300.0
400.0
500.0
600.0
700.0
800.0
900.0
CPUH101/V4LX160
Wallclock time (s)
Figure 4. Two-node DI3D facial imaging performance (times in s),
comparison between software and Nallatech hardware versions.
Figure 5 plots the runtime for the same test against
increasing numbers of nodes. In both Figures 4 and 5 timings
were made using the standard MPI timer function
MPI_Wtime() [21] across the full batch-processing version
of the application. This includes software-only components
Engineering Letters, 16:3, EL_16_3_23
______________________________________________________________________________________
(Advance online publication: 20 August 2008)