Content uploaded by Alan D. George
Author content
All content in this area was uploaded by Alan D. George
Content may be subject to copyright.
82 Copublish ed by the IEEE CS an d the AIP 1521-9615/11/$26.00 © 2011 IEEE Computing in SCie nCe & engineeri ng
NO V E L AR C H I T E C T U R E S
Editors: Volodymyr Kindratenko, kindr@ncsa.uiuc.edu
Pedro Trancoso, pedro@cs.ucy.ac.cy
Novo-G: At the ForeFroNt
oF ScAlAble recoNFiGurAble
SupercomputiNG
By Alan George, Herman Lam, and Greg Stitt
Throughout computing’s long
history and within the many
forms of computers existing
today, from hand-held smartphones
to mammoth supercomputers, the
one common denominator is the
xed-logic processor. In conventional
computing, each application must be
adapted to match the xed structures,
parallelism, functionality, and preci-
sion of the target processor—such
as the CPU, digital signal processor
(DSP), or graphics processing unit
(GPU)—as dictated by the device
vendor.
Although advantageous for unifor-
mity, this “one size ts all” approach can
create dramatic inefciencies in speed,
area, and energy when the application
fails to conform to the device’s ideal
case. By contrast, a relatively new com-
puting paradigm known as recongurable
computing (RC) takes the opposite ap-
proach, wherein the architecture adapts
to match each application’s unique
needs, and thus approaches the speed
and energy advantages of application-
specic integrated circuit (ASIC), while
offering the versatility of CPUs.
Many RC systems have emerged
in the research community and mar-
ketplace, addressing an increasingly
broad application range, from sensor
processing for space science to pro-
teomics for cancer diagnosis. Most of
these systems are relatively modest
in scale, featuring from one to sev-
eral recongurable processors, such
as eld-programmable gate arrays
(FPGAs). At the extreme RC scale,
however, is the Novo-G machine at
the US National Science Founda-
tion’s Center for High-Performance
Reconfigurable Computing (NSF
CHREC Center) at the University
of Florida. Initial Novo-G studies
show that, for some important ap-
plications, a scalable system with
192 recongurable processors can
rival the speed of the world’s larg-
est supercomputers at a tiny frac-
tion of their cost, size, power, and
cooling. Here, we offer an overview
of this novel system, along with its
initial applications and performance
breakthroughs.
Recongurable Computing
Demands for innovation in comput-
ing are growing rapidly. Technologi-
cal advances are transforming many
data-starved science domains into
data-rich ones. For example, in ge-
nomics research in the health and
life sciences, contemporary DNA se-
quencing instruments can determine
150 to 200 billion nucleotide bases per
run, routinely resulting in output les
in excess of 1 terabyte per instrument
run.
In the near future, DNA sequence
output from a single instrument run
will easily exceed the size of the hu-
man genome by more than 100-fold.
Thus, it’s increasingly clear that the
discordant trajectories growing be-
tween data production and the capac-
ity for timely analysis are threatening
to impede new scientic discover-
ies and progress in many scientic
domains—not because we can’t gen-
erate the data, but rather because we
can’t analyze it.
To address these growing demands
with a computing infrastructure that’s
sustainable in terms of power, cooling,
size, weight, and cost, adaptive sys-
tems that can be dynamically tailored
to the unique needs of each application
are coming to the forefront. At the heart
of these systems are recongurable-
logic devices, processors such as
FPGAs that under software control
can adapt their hardware structures to
reect the unique operations, preci-
sion, and parallelism associated with
compute-intensive, data-driven appli-
cations in elds such as health and life
sciences, signal and image processing,
and cryptology.
The benet of using RC with mod-
ern FPGAs for such applications lies
in their recongurable structure.
Unlike xed-logic processors, which
require applications to conform to
The Novo-G supercomputer’s architecture can adapt to match each application’s unique needs and thereby
attain more performance with less energy than conventional machines.
CISE-13-1-Novel.indd 82 14/12/10 10:55 AM
January/February 2011 83
a xed structure (for better or for
worse), with RC the architecture con-
forms to each application’s unique
needs. This adaptive nature lets
FPGA devices exploit higher degrees
of parallelism while running at lower
clock rates, thus often achieving bet-
ter execution speed while consuming
less energy.
Figure 1 shows a comparative suite
of device metrics,1,2 which illustrates
performance (in terms of computa-
tional density) per watt of some of the
latest recongurable- and xed-logic
processing devices for 16-bit integer
(Int16) or single-precision oating-
point (SPFP), assuming an equal
number of add and multiply opera-
tions. The number above each bar in-
dicates the peak number of sustainable
parallel operations.
Generally, FPGAs achieve more
speed per power unit compared to
CPUs, DSPs, and GPUs. For ex-
ample, the study’s leading FPGA for
Int16 operations (Altera Stratix-IV
EP4SE530) can support more than
50 billion operations per second (GOPS)
per watt, while the leading xed-logic
processor (TI OMAP-L137 DSP) at-
tains less than 8 GOPS/watt. With
SPFP, the gap is narrower, but FPGAs
continue to enable more GOPS/watt.
Similarly, although not shown in the
gure, the gap widens increasingly for
simpler (byte or bit) operations. With
RC, the simpler the task, the less chip
area required for each processing el-
ement (PE), and thus the more PEs
that can t and operate concurrently
in hardware.
However encouraging, this promis-
ing approach has thus far been limited
primarily to small systems, studies,
and datasets. To move beyond these
limits, some key challenges must be
overcome. Chief among these chal-
lenges is parallelization, evaluation,
and optimization of critical applica-
tions in data-intensive elds in a way
that’s transparent, exible, portable,
and performed at a much larger scale
commensurate with the massive needs
of emerging, real-world datasets.
When successful, however, the impact
will be a dramatic speedup in execu-
tion time concomitant with savings in
energy and cooling.
As we describe later, our initial
studies show critical applications ex-
ecuting at scale on Novo-G, achieving
speeds rivaling the largest conven-
tional supercomputers in existence—
yet at a fraction of their size, energy,
and cost. While processing speed and
energy efciency are important, the
principal impact of a recongurable
supercomputer like Novo-G is the
freedom that its innovative approach
can give to scientists to conduct more
types of analysis, examine larger data-
sets, ask more questions, and nd bet-
ter answers.
Novo-G Recongurable
Supercomputer
The Novo-G experimental research
testbed has been operating since July
2009 at the NSF CHREC Center,
supporting various research projects
on scalable RC challenges. Novo-G’s
primary emphases are performance
Figure 1. Computational density (in giga operations per second) per watt of modern xed- and recongurable-logic devices.2
As the gure shows, recongurable-logic devices often achieve signicantly more operations per watt than xed-logic devices.
GOPS/Watt
0
EP3SE260 (65 nm)
EP3SL340 (65 nm)
EP4SE530 (40 nm)
FPOA (90 nm)
PACTXPP-3c
Recongurable-logic devices Fixed-logic devices
TILE64 (90 nm)
V6 SX475T (40 nm)
ADSP-TS203S (130 nm)
Athlon II X4 635
Cell (90 nm)
Intel Core i7-980X
Intel XeonX7560
Nvidia GTX 285
Nvidia GeForce GTX 480
Opteron 8439SE
Phenom II X6 1090T black
PowerXCell 8i (65 nm)
TI OMAP-L137 (65 nm)
Intel Itanium 9350 (Tukwila)
V6 LX7605T (40 nm)
V5 SX95T (65 nm)
V5 LX330T (65 nm)
10
20
30
40
50 1,944
2,296
221 292 551 320
13 0
348
320
48
648
324
488
180
306
1,440 2,128
324
2,632
60
70
80
Maximum number of operations Maximum number of operations
16-bit integer Single-precision oating-point 16-bit integer Single-precision oating-point
GOPS/Watt
0
1
2
3
4
5
10
6
76
32
64
72
16
96
480
736
48 48
64
6
64
144
96
192 480
736
114 114
64
16
6
7
9
8
Our initial studies show critical applications executing
at scale on Novo-G, achieving speeds rivaling the largest
conventional supercomputers in existence—yet at a
fraction of their size, energy, and cost.
CISE-13-1-Novel.indd 83 14/12/10 10:55 AM
NO V E L AR C H I T E C T U R E S
84 Computi ng in SCienCe & engi neering
(device, subsystem, system), productiv-
ity (concepts, languages, tools), and
impact (scalable applications).
Figure 2 shows the Novo-G ma-
chine and one of its quad-FPGA
boards. The current Novo-G con-
guration consists of 24 compute
nodes, each a standard 4U Linux
server with an Intel quad-core Xeon
(E5520) processor, memory, disk, and
so on, housed in three racks. A single
1U server with twin quad-core Xe-
ons functions as head node. Compute
nodes communicate and synchronize
via Gigabit Ethernet and a nonblock-
ing fabric of 20 Gbits/s InniBand.
Each of the compute nodes houses two
PROCStar-III boards from GiDEL
in its PCIe slots. Novo-G’s novel
computing power is derived from
these boards, each containing four
Stratix-III E260 FPGAs from Altera,
resulting in a system of 48 boards and
192 FPGAs. (An impending upgrade
will soon add 72 Stratix-IV E530
FPGAs to Novo-G, each with twice
the recongurable logic of a Stratix-
III E260—yet roughly the same power
consumption—thereby expanding
Novo-G’s total recongurable logic
by nearly 80 percent.) Concomitantly,
when fully loaded, the entire Novo-G
system’s power consumption peaks at
approximately 8,000 watts.
While this set of FPGAs can the-
oretically provide the system with
massive computing power, the
memory capacity, throughput, and
latency often limit performance if
unbalanced. As Figure 2b shows,
4.25 Gbytes of dedicated memory
is attached to each FPGA in three
banks. Data transfer between adja-
cent FPGAs can be made directly
through a wide, bidirectional bus at
rates up to 25.6 Gbps and latencies of
a single clock cycle up to 300 MHz,
and transfer between FPGAs across
two boards in the same server is also
supported via a high-speed cable. By
supplying each FPGA with large,
dedicated memory banks, as well as
high bandwidth and low latency for
inter-FPGA data transfer, the system
strongly supports RC-centric appli-
cations. Processing engines on the
FPGAs can execute with minimal in-
volvement by the host CPU cores, en-
abling maximum FPGA utilization.
Alongside the architecture, equally
important are the design tools avail-
able and upcoming for Novo-G. RC’s
very nature empowers application
developers with far more capability,
control, and inuence over the ar-
chitecture. Instead of stipulating all
architecture decisions to the device
vendors—as with CPUs and GPUs—
in RC, the application developer
species a custom architecture con-
guration, such as quantity and types
of operations, numerical precision,
and breadth and depth of parallelism.
Consequently, RC is a more challeng-
ing environment for application de-
velopment, and productivity concepts
and tools are thus vital.
Novo-G offers a broad and growing
range of academic and commercial
tools in areas such as
• strategic design and performance
prediction tools for parallel algo-
rithm and mapping studies;
• message-passing interface (MPI),
Unied Parallel C (UPC), and
shared memory (SHMEM) library
for system-level programming
with C;
• Very High-Speed Integrated Cir-
cuit Hardware Description Lan-
guage (V HDL), Verilog, and
an expanding list of high-level
synthesis tools for FPGA-level
programming;
• an assortment of core libraries;
• middleware and APIs for design ab-
straction, platform virtualization,
and portability of apps and tools;
and
• verication and performance-
optimization tools.
To help expand the applications and
tools available on Novo-G and estab-
lish and showcase RC’s advantages at
scale, the Novo-G Forum was formed
in 2010. This forum is an international
Figure 2. Novo-G and a processor board. (a) The Novo- G supercomputer and (b) one of its quad-FPGA recongurable
processor boards. The current conguration includes 24 compute nodes housed in three racks. The head node is a single 1U
server with twin quad-core Xeons.
JTAG for
SignalTap
debug
2x2GB=4GB DDR2
RAM per FPGA
(a)
(b)
PCIe x8
interface (4GB/s)
CISE-13-1-Novel.indd 84 14/12/10 10:55 AM
January/February 2011 85
group of academic researchers and
technology providers working col-
laboratively with a common goal: to
realize the promise of recongurable
supercomputing by demonstrating
unprecedented levels of performance,
productivity, and sustainability. Fac-
ulty and students in each academic
research team are committed to con-
tributing innovative applications and
tools research on the Novo-G ma-
chine based upon their unique exper-
tise and interests. Among the forum
participants are Boston University,
Clemson University, University of
Florida, George Washington Uni-
versity, University of Glasgow (UK),
Imperial College (UK), Northeastern
University, Federal University of Per-
nambuco (Brazil), University of South
Carolina, University of Tennessee, and
Washington University at St. Louis.
Each academic team has one or more
Novo-G boards for local experiments
and has remote access to the large
Novo-G machine at Florida for scal-
ability studies.
Initial Applications Studies
Of Novo-G’s three principal empha-
ses, impact is undoubtedly the most
important. What good is a new and
innovative high-performance system if
the resulting applications have little
impact in science and society? We
now offer an overview of Novo-G’s
initial performance breakt hroughs
on a set of bioinformatics applications
for genomics, which we developed in
collaboration with the University of
Florida’s Interdisciplinary Center for
Biotechnology Research (ICBR). Re-
sults of such breakthroughs can po-
tentially revolutionize the processing
of massive genomics datasets, which
in turn might enable revolutionary
discoveries for a broad range of chal-
lenges in the health, life, and agricul-
tural sciences.
Although more than a dozen chal-
lenging Novo-G application designs
are underway in several scientic do-
mains, our focus here is on our rst
case studies. These include two popu-
lar genomics applications for optimal
sequence alignment based upon wave-
front algorithms:
• Needleman-Wunsch (NW ) and
• Smith-Waterman (SW ) w it hout
traceback,
and a metagenomics application—
Needle-Distance (ND)—which is
an augmentation of NW with dis-
tance calculations. We’re nearing
completion on an extended version
of SW with the traceback option—
SW+TB—by augmenting our SW
hardware design to collect and feed
data for traceback to the hosts so that
FPGAs can perform SW while CPU
cores perform TB. Initial results indi-
cate that, after adding TB, execution
times increase less on Novo-G than
on the C/Opteron baseline, and thus
Novo-G speedups with SW+TB ex-
ceed those of SW.
Each of the applications features
massive data parallelism with minimal
communication and synchronization
among FPGAs, and a highly opti-
mized systolic array of processing ele-
ments (PEs) within each FPGA (and
optionally spanning multiple FPGAs).
Using a novel method for in-stream
control,3 we optimized each of the
three designs to t up to 850 PEs per
FPGA for NW, 650 for SW, and 450
for ND, all operating at 125 MHz.
Figure 3 shows a contour plot
for each application that illustrates
relative design performance on one
FPGA under varying input condi-
tions. The corresponding tables show
how the three designs scale when
Figure 3. Performance results on Novo-G for three bioinformatics applications: (a) Needleman-Wunsch (NW), (b) Smith-
Waterman (SW) without traceback, and (c) Needle-Distance (ND). Each chart illustrates the per formance of a single
FPGA under varying input conditions. Each table shows performance with varying number of FPGAs under optimal input
conditions.3
Baseline: 192∙225, length 850 sequence comparisons
1 K 4 K 16 K 256 K1 M 4 M 16 M32 M
Software runtime: 11,026 CPU hours on 2.4 GHz Opteron
Baseline: Human X chromosome v 19200, length 650 Seqs
Software runtime: 5,481 CPU hours on 2.4 GHz Opteron
Baseline: 192∙2
24
, length 450 distance calculations
Software runtime: 11,673 CPU hours on 2.4 GHz Opteron
# FPGAs
(a) (b) (c)
Runtime (sec) Speedup # FPGAs Runtime (sec) Speedup # FPGAs Runtime (sec) Speedup
1 47,616 833 1 23,846 827 1 13,522 3,108
4 12,014 3,304 4 5,966 3,307 4 3,429 12,255
96 503 78,914 96 250 78,926 96 144 291,825
128 391 101,518 128 188 104,955 128 118 356,125
192 (est.) 270 147,013 192 (est.) 127 155,366 192 (est.) 77 545,751
Number of sequence comparisons
Sequence length (nucleotides)
Number of sequence comparisons
0
500
1000
1500
2000
2500
3000
3500
Smith-Waterman (SW) Needle-Distance (ND)
Needleman-Wunsch (NW)
1 K
4 K
16 K
256 K
1 M
4 M
16 M
32 M
Numberof se
0
0
0
0
0
0
100
50
850
650
450
250
200
300
400
500
600
700
800
Speedup
900
0
100
50
100
150
200
250
300
350
400
450
500
550
600
650
1 K
32 K
32 M
1 M
200
300
400
500
600
700
800
Speedup
Sequence length (nucleotides)
Database length (nucleotides)
Speedup
Sequence length (nucleotides)
1 K 4 K 16 K 64 K 250 K 1 M 4 M 10 M 50
150
250
350
450
CISE-13-1-Novel.indd 85 14/12/10 10:55 AM
NO V E L AR C H I T E C T U R E S
86 Computi ng in SCienCe & engi neering
executed on multiple FPGAs in Novo-G.
In all cases, speedup is dened in
terms of an optimized C-code soft-
ware baseline running on a 2.4 GHz
Opteron core in our lab. More details
on these algorithms, architectures,
experiments, and results are provided
elsewhere.3 All data except the tables’
nal rows came directly from mea-
surements on Novo-G and include
the full execution time, including not
just computation time, but data trans-
fers to and from the FPGAs.
Speedup with one FPGA on each
of the three applications peaked at
approximately 830 for NW and SW
and more than 3,100 for ND. When
ramping up from a single FPGA to a
quad-FPGA board, we measured and
observed speedups to grow almost
linearly to about 3,300 for NW and
SW and more than 12,000 for ND.
At the largest scale of our testbed
experiments—32 boards, or 128
FPGAs—speedups for NW and SW
exceeded 100,000 and ND exceeded
356,000.
Because not all 48 boards in Novo-
G were operational during our study,
we extrapolated these trends, estimat-
ing speedups on all 192 FPGAs of
Novo-G of about 150,000 for NW
and SW and almost 550,000 for ND.
Putting these numbers in context,
the latter implies that a conventional
supercomputer would require more
than a half-million Opteron cores op-
erating optimally to match Novo-G’s
performance on the ND application.
By contrast, none of the world’s larg-
est supercomputing machines (as cited,
for example, in the www.top500.org’s
top rankings) has this many cores,
and thus none could achieve such per-
formance on this application despite
being orders of magnitude larger in
cost, size, weight, power, and cool-
ing. Although Novo-G won’t provide
all applications with the same speed-
ups as these examples, they do high-
light RC’s potential advantages,
especially in solving problems where
conventional, xed-logic computing
falls far short of achieving optimal
performance.
For a growing list of important ap-
plications from a broad range of
science domains, underlying compu-
tations and data-driven demands are
proving to be underserved by con-
ventional “one size ts all” processing
devices. By changing the mindset of
computing—from processor-centric
to application-centric—recongurable
computing can provide solutions for
domain scientists at a fraction of the
time and cost of traditional servers
or supercomputers. As we describe
here, the Novo-G machine, appli-
cations, research forum, and pre-
liminary results are helping to pave
the way for scalable recongurable
computing.
References
1. J. Williams et al., “Characterization of
Fixed and Recongurable Multi-Core De-
vices for Application Acceleration,” ACM
Trans. Recongurable Technology and
Systems, vol. 3, no. 4, 2011; to appear.
2. J. Richardson et al., “Comparative
Analysis of HPC and Accelerator
Devices: Computation, Memory, I/O,
and Power,” Proc. High-Performance
Recongurable Computing Technology
and Applications Workshop, ACM/IEEE
Supercomputing Conf. (SC10), IEEE Press,
to appear.
3. C. Pascoe et al., “Recongurable Super-
computing with Scalable Systolic Arrays
and In-Stream Control for Wavefront
Genomics Processing,” Proc. Symp.
Application Accelerators in High-
Performance Computing, 2010; www.
chrec.org/pubs/SA AHPC10_F1.pdf.
Alan George is director of the US Na-
tional Science Foundation Center for High-
Performance Recongurable Computing and
a professor of electrical and computer engi-
neering at the University of Florida. His re-
search interests focus upon high-performance
architectures, networks, systems, services,
and applications for recongur able, paral-
lel, distributed, and fault-tolerant comput-
ing. George has a PhD in computer science
from the Florida State University. He is a
member of IEEE Computer Societ y, the
ACM, the Society for Computer Simulation,
and the American Institute of Aeronautics
and Astronautic s. Contact him at ageorge@
u.edu.
Herman Lam is an associate professor in
the Department of Electrical and Computer
Engineering at the University of Florida. His
research interests include design methods
and tools for RC application development,
particularly as applied to large-scale recon-
gurable supercomputing. Lam has a PhD
in electrical and computer engineering from
the University of Florida. He is a member of
IEEE and the ACM and is a faculty member
of the NSF Center for High-Per formance
Recongurable Computing. Contact him at
hlam@u.edu.
Greg Stitt is an assistant professor in the
Department of Electrical and Computer En-
gineering at the University of Florida and a
faculty member of the US National Science
Foundation Center for High-Per formance
Recongurable Computing. His research
interests include design automation for re-
congurable computing and embedded
systems. Stitt has a PhD in computer science
from the University of California, Riverside.
He is a member of IEEE and the ACM. Con-
tact him at gstitt@ece.u.edu.
Selected articles and columns from
IEEE Computer Society publica-
tions are also available for free at http://
ComputingNow.computer.org.
CISE-13-1-Novel.indd 86 14/12/10 10:55 AM