ArticlePDF Available

Novo-G: At the Forefront of Scalable Reconfigurable Supercomputing

Authors:

Abstract and Figures

The Novo-G supercomputer's architecture can adapt to match each application/s unique needs and thereby attain more performance with less energy than conventional machines. Reconfigurable computing can provide solutions for domain scientists at a fraction of the time and cost of traditional servers or supercomputers. As we describe here, the Novo-G machine, applications, research forum, and preliminary results are helping to pave the way for scalable reconfigurable computing.
Content may be subject to copyright.
82 Copublish ed by the IEEE CS an d the AIP 1521-9615/11/$26.00 © 2011 IEEE Computing in SCie nCe & engineeri ng
NO V E L AR C H I T E C T U R E S
Editors: Volodymyr Kindratenko, kindr@ncsa.uiuc.edu
Pedro Trancoso, pedro@cs.ucy.ac.cy
Novo-G: At the ForeFroNt
oF ScAlAble recoNFiGurAble
SupercomputiNG
By Alan George, Herman Lam, and Greg Stitt
Throughout computing’s long
history and within the many
forms of computers existing
today, from hand-held smartphones
to mammoth supercomputers, the
one common denominator is the
xed-logic processor. In conventional
computing, each application must be
adapted to match the xed structures,
parallelism, functionality, and preci-
sion of the target processor—such
as the CPU, digital signal processor
(DSP), or graphics processing unit
(GPU)—as dictated by the device
vendor.
Although advantageous for unifor-
mity, thisone size ts all” approach can
create dramatic inefciencies in speed,
area, and energy when the application
fails to conform to the device’s ideal
case. By contrast, a relatively new com-
puting paradigm known as recongurable
computing (RC) takes the opposite ap-
proach, wherein the architecture adapts
to match each application’s unique
needs, and thus approaches the speed
and energy advantages of application-
specic integrated circuit (ASIC), while
offering the versatility of CPUs.
Many RC systems have emerged
in the research community and mar-
ketplace, addressing an increasingly
broad application range, from sensor
processing for space science to pro-
teomics for cancer diagnosis. Most of
these systems are relatively modest
in scale, featuring from one to sev-
eral recongurable processors, such
as eld-programmable gate arrays
(FPGAs). At the extreme RC scale,
however, is the Novo-G machine at
the US National Science Founda-
tion’s Center for High-Performance
Reconfigurable Computing (NSF
CHREC Center) at the University
of Florida. Initial Novo-G studies
show that, for some important ap-
plications, a scalable system with
192 recongurable processors can
rival the speed of the world’s larg-
est supercomputers at a tiny frac-
tion of their cost, size, power, and
cooling. Here, we offer an overview
of this novel system, along with its
initial applications and performance
breakthroughs.
Recongurable Computing
Demands for innovation in comput-
ing are growing rapidly. Technologi-
cal advances are transforming many
data-starved science domains into
data-rich ones. For example, in ge-
nomics research in the health and
life sciences, contemporary DNA se-
quencing instruments can determine
150 to 200 billion nucleotide bases per
run, routinely resulting in output les
in excess of 1 terabyte per instrument
run.
In the near future, DNA sequence
output from a single instrument run
will easily exceed the size of the hu-
man genome by more than 100-fold.
Thus, it’s increasingly clear that the
discordant trajectories growing be-
tween data production and the capac-
ity for timely analysis are threatening
to impede new scientic discover-
ies and progress in many scientic
domains—not because we can’t gen-
erate the data, but rather because we
can’t analyze it.
To address these growing demands
with a computing infrastructure that’s
sustainable in terms of power, cooling,
size, weight, and cost, adaptive sys-
tems that can be dynamically tailored
to the unique needs of each application
are coming to the forefront. At the heart
of these systems are recongurable-
logic devices, processors such as
FPGAs that under software control
can adapt their hardware structures to
reect the unique operations, preci-
sion, and parallelism associated with
compute-intensive, data-driven appli-
cations in elds such as health and life
sciences, signal and image processing,
and cryptology.
The benet of using RC with mod-
ern FPGAs for such applications lies
in their recongurable structure.
Unlike xed-logic processors, which
require applications to conform to
The Novo-G supercomputer’s architecture can adapt to match each application’s unique needs and thereby
attain more performance with less energy than conventional machines.
CISE-13-1-Novel.indd 82 14/12/10 10:55 AM
January/February 2011 83
a xed structure (for better or for
worse), with RC the architecture con-
forms to each application’s unique
needs. This adaptive nature lets
FPGA devices exploit higher degrees
of parallelism while running at lower
clock rates, thus often achieving bet-
ter execution speed while consuming
less energy.
Figure 1 shows a comparative suite
of device metrics,1,2 which illustrates
performance (in terms of computa-
tional density) per watt of some of the
latest recongurable- and xed-logic
processing devices for 16-bit integer
(Int16) or single-precision oating-
point (SPFP), assuming an equal
number of add and multiply opera-
tions. The number above each bar in-
dicates the peak number of sustainable
parallel operations.
Generally, FPGAs achieve more
speed per power unit compared to
CPUs, DSPs, and GPUs. For ex-
ample, the study’s leading FPGA for
Int16 operations (Altera Stratix-IV
EP4SE530) can support more than
50 billion operations per second (GOPS)
per watt, while the leading xed-logic
processor (TI OMAP-L137 DSP) at-
tains less than 8 GOPS/watt. With
SPFP, the gap is narrower, but FPGAs
continue to enable more GOPS/watt.
Similarly, although not shown in the
gure, the gap widens increasingly for
simpler (byte or bit) operations. With
RC, the simpler the task, the less chip
area required for each processing el-
ement (PE), and thus the more PEs
that can t and operate concurrently
in hardware.
However encouraging, this promis-
ing approach has thus far been limited
primarily to small systems, studies,
and datasets. To move beyond these
limits, some key challenges must be
overcome. Chief among these chal-
lenges is parallelization, evaluation,
and optimization of critical applica-
tions in data-intensive elds in a way
that’s transparent, exible, portable,
and performed at a much larger scale
commensurate with the massive needs
of emerging, real-world datasets.
When successful, however, the impact
will be a dramatic speedup in execu-
tion time concomitant with savings in
energy and cooling.
As we describe later, our initial
studies show critical applications ex-
ecuting at scale on Novo-G, achieving
speeds rivaling the largest conven-
tional supercomputers in existence—
yet at a fraction of their size, energy,
and cost. While processing speed and
energy efciency are important, the
principal impact of a recongurable
supercomputer like Novo-G is the
freedom that its innovative approach
can give to scientists to conduct more
types of analysis, examine larger data-
sets, ask more questions, and nd bet-
ter answers.
Novo-G Recongurable
Supercomputer
The Novo-G experimental research
testbed has been operating since July
2009 at the NSF CHREC Center,
supporting various research projects
on scalable RC challenges. Novo-G’s
primary emphases are performance
Figure 1. Computational density (in giga operations per second) per watt of modern xed- and recongurable-logic devices.2
As the gure shows, recongurable-logic devices often achieve signicantly more operations per watt than xed-logic devices.
GOPS/Watt
0
EP3SE260 (65 nm)
EP3SL340 (65 nm)
EP4SE530 (40 nm)
FPOA (90 nm)
PACTXPP-3c
Recongurable-logic devices Fixed-logic devices
TILE64 (90 nm)
V6 SX475T (40 nm)
ADSP-TS203S (130 nm)
Athlon II X4 635
Cell (90 nm)
Intel Core i7-980X
Intel XeonX7560
Nvidia GTX 285
Nvidia GeForce GTX 480
Opteron 8439SE
Phenom II X6 1090T black
PowerXCell 8i (65 nm)
TI OMAP-L137 (65 nm)
Intel Itanium 9350 (Tukwila)
V6 LX7605T (40 nm)
V5 SX95T (65 nm)
V5 LX330T (65 nm)
10
20
30
40
50 1,944
2,296
221 292 551 320
13 0
348
320
48
648
324
488
180
306
1,440 2,128
324
2,632
60
70
80
Maximum number of operations Maximum number of operations
16-bit integer Single-precision oating-point 16-bit integer Single-precision oating-point
GOPS/Watt
0
1
2
3
4
5
10
6
76
32
64
72
16
96
480
736
48 48
64
6
64
144
96
192 480
736
114 114
64
16
6
7
9
8
Our initial studies show critical applications executing
at scale on Novo-G, achieving speeds rivaling the largest
conventional supercomputers in existence—yet at a
fraction of their size, energy, and cost.
CISE-13-1-Novel.indd 83 14/12/10 10:55 AM
NO V E L AR C H I T E C T U R E S
84 Computi ng in SCienCe & engi neering
(device, subsystem, system), productiv-
ity (concepts, languages, tools), and
impact (scalable applications).
Figure 2 shows the Novo-G ma-
chine and one of its quad-FPGA
boards. The current Novo-G con-
guration consists of 24 compute
nodes, each a standard 4U Linux
server with an Intel quad-core Xeon
(E5520) processor, memory, disk, and
so on, housed in three racks. A single
1U server with twin quad-core Xe-
ons functions as head node. Compute
nodes communicate and synchronize
via Gigabit Ethernet and a nonblock-
ing fabric of 20 Gbits/s InniBand.
Each of the compute nodes houses two
PROCStar-III boards from GiDEL
in its PCIe slots. Novo-G’s novel
computing power is derived from
these boards, each containing four
Stratix-III E260 FPGAs from Altera,
resulting in a system of 48 boards and
192 FPGAs. (An impending upgrade
will soon add 72 Stratix-IV E530
FPGAs to Novo-G, each with twice
the recongurable logic of a Stratix-
III E260—yet roughly the same power
consumption—thereby expanding
Novo-G’s total recongurable logic
by nearly 80 percent.) Concomitantly,
when fully loaded, the entire Novo-G
system’s power consumption peaks at
approximately 8,000 watts.
While this set of FPGAs can the-
oretically provide the system with
massive computing power, the
memory capacity, throughput, and
latency often limit performance if
unbalanced. As Figure 2b shows,
4.25 Gbytes of dedicated memory
is attached to each FPGA in three
banks. Data transfer between adja-
cent FPGAs can be made directly
through a wide, bidirectional bus at
rates up to 25.6 Gbps and latencies of
a single clock cycle up to 300 MHz,
and transfer between FPGAs across
two boards in the same server is also
supported via a high-speed cable. By
supplying each FPGA with large,
dedicated memory banks, as well as
high bandwidth and low latency for
inter-FPGA data transfer, the system
strongly supports RC-centric appli-
cations. Processing engines on the
FPGAs can execute with minimal in-
volvement by the host CPU cores, en-
abling maximum FPGA utilization.
Alongside the architecture, equally
important are the design tools avail-
able and upcoming for Novo-G. RC’s
very nature empowers application
developers with far more capability,
control, and inuence over the ar-
chitecture. Instead of stipulating all
architecture decisions to the device
vendors—as with CPUs and GPUs—
in RC, the application developer
species a custom architecture con-
guration, such as quantity and types
of operations, numerical precision,
and breadth and depth of parallelism.
Consequently, RC is a more challeng-
ing environment for application de-
velopment, and productivity concepts
and tools are thus vital.
Novo-G offers a broad and growing
range of academic and commercial
tools in areas such as
strategic design and performance
prediction tools for parallel algo-
rithm and mapping studies;
message-passing interface (MPI),
Unied Parallel C (UPC), and
shared memory (SHMEM) library
for system-level programming
with C;
Very High-Speed Integrated Cir-
cuit Hardware Description Lan-
guage (V HDL), Verilog, and
an expanding list of high-level
synthesis tools for FPGA-level
programming;
an assortment of core libraries;
middleware and APIs for design ab-
straction, platform virtualization,
and portability of apps and tools;
and
verication and performance-
optimization tools.
To help expand the applications and
tools available on Novo-G and estab-
lish and showcase RC’s advantages at
scale, the Novo-G Forum was formed
in 2010. This forum is an international
Figure 2. Novo-G and a processor board. (a) The Novo- G supercomputer and (b) one of its quad-FPGA recongurable
processor boards. The current conguration includes 24 compute nodes housed in three racks. The head node is a single 1U
server with twin quad-core Xeons.
JTAG for
SignalTap
debug
2x2GB=4GB DDR2
RAM per FPGA
(a)
(b)
PCIe x8
interface (4GB/s)
CISE-13-1-Novel.indd 84 14/12/10 10:55 AM
January/February 2011 85
group of academic researchers and
technology providers working col-
laboratively with a common goal: to
realize the promise of recongurable
supercomputing by demonstrating
unprecedented levels of performance,
productivity, and sustainability. Fac-
ulty and students in each academic
research team are committed to con-
tributing innovative applications and
tools research on the Novo-G ma-
chine based upon their unique exper-
tise and interests. Among the forum
participants are Boston University,
Clemson University, University of
Florida, George Washington Uni-
versity, University of Glasgow (UK),
Imperial College (UK), Northeastern
University, Federal University of Per-
nambuco (Brazil), University of South
Carolina, University of Tennessee, and
Washington University at St. Louis.
Each academic team has one or more
Novo-G boards for local experiments
and has remote access to the large
Novo-G machine at Florida for scal-
ability studies.
Initial Applications Studies
Of Novo-G’s three principal empha-
ses, impact is undoubtedly the most
important. What good is a new and
innovative high-performance system if
the resulting applications have little
impact in science and society? We
now offer an overview of Novo-G’s
initial performance breakt hroughs
on a set of bioinformatics applications
for genomics, which we developed in
collaboration with the University of
Florida’s Interdisciplinary Center for
Biotechnology Research (ICBR). Re-
sults of such breakthroughs can po-
tentially revolutionize the processing
of massive genomics datasets, which
in turn might enable revolutionary
discoveries for a broad range of chal-
lenges in the health, life, and agricul-
tural sciences.
Although more than a dozen chal-
lenging Novo-G application designs
are underway in several scientic do-
mains, our focus here is on our rst
case studies. These include two popu-
lar genomics applications for optimal
sequence alignment based upon wave-
front algorithms:
Needleman-Wunsch (NW ) and
Smith-Waterman (SW ) w it hout
traceback,
and a metagenomics application—
Needle-Distance (ND)—which is
an augmentation of NW with dis-
tance calculations. We’re nearing
completion on an extended version
of SW with the traceback option—
SW+TB—by augmenting our SW
hardware design to collect and feed
data for traceback to the hosts so that
FPGAs can perform SW while CPU
cores perform TB. Initial results indi-
cate that, after adding TB, execution
times increase less on Novo-G than
on the C/Opteron baseline, and thus
Novo-G speedups with SW+TB ex-
ceed those of SW.
Each of the applications features
massive data parallelism with minimal
communication and synchronization
among FPGAs, and a highly opti-
mized systolic array of processing ele-
ments (PEs) within each FPGA (and
optionally spanning multiple FPGAs).
Using a novel method for in-stream
control,3 we optimized each of the
three designs to t up to 850 PEs per
FPGA for NW, 650 for SW, and 450
for ND, all operating at 125 MHz.
Figure 3 shows a contour plot
for each application that illustrates
relative design performance on one
FPGA under varying input condi-
tions. The corresponding tables show
how the three designs scale when
Figure 3. Performance results on Novo-G for three bioinformatics applications: (a) Needleman-Wunsch (NW), (b) Smith-
Waterman (SW) without traceback, and (c) Needle-Distance (ND). Each chart illustrates the per formance of a single
FPGA under varying input conditions. Each table shows performance with varying number of FPGAs under optimal input
conditions.3
Baseline: 192∙225, length 850 sequence comparisons
1 K 4 K 16 K 256 K1 M 4 M 16 M32 M
Software runtime: 11,026 CPU hours on 2.4 GHz Opteron
Baseline: Human X chromosome v 19200, length 650 Seqs
Software runtime: 5,481 CPU hours on 2.4 GHz Opteron
Baseline: 192∙2
24
, length 450 distance calculations
Software runtime: 11,673 CPU hours on 2.4 GHz Opteron
# FPGAs
(a) (b) (c)
Runtime (sec) Speedup # FPGAs Runtime (sec) Speedup # FPGAs Runtime (sec) Speedup
1 47,616 833 1 23,846 827 1 13,522 3,108
4 12,014 3,304 4 5,966 3,307 4 3,429 12,255
96 503 78,914 96 250 78,926 96 144 291,825
128 391 101,518 128 188 104,955 128 118 356,125
192 (est.) 270 147,013 192 (est.) 127 155,366 192 (est.) 77 545,751
Number of sequence comparisons
Sequence length (nucleotides)
Number of sequence comparisons
0
500
1000
1500
2000
2500
3000
3500
Smith-Waterman (SW) Needle-Distance (ND)
Needleman-Wunsch (NW)
1 K
4 K
16 K
256 K
1 M
16 M
32 M
Numberof se
0
0
0
0
0
0
100
50
850
650
450
250
200
300
400
500
600
700
800
Speedup
900
0
100
50
100
150
200
250
300
350
400
450
500
550
600
650
1 K
32 K
32 M
1 M
200
300
400
500
600
700
800
Speedup
Sequence length (nucleotides)
Database length (nucleotides)
Speedup
Sequence length (nucleotides)
1 K 4 K 16 K 64 K 250 K 1 M 4 M 10 M 50
150
250
350
450
CISE-13-1-Novel.indd 85 14/12/10 10:55 AM
NO V E L AR C H I T E C T U R E S
86 Computi ng in SCienCe & engi neering
executed on multiple FPGAs in Novo-G.
In all cases, speedup is dened in
terms of an optimized C-code soft-
ware baseline running on a 2.4 GHz
Opteron core in our lab. More details
on these algorithms, architectures,
experiments, and results are provided
elsewhere.3 All data except the tables’
nal rows came directly from mea-
surements on Novo-G and include
the full execution time, including not
just computation time, but data trans-
fers to and from the FPGAs.
Speedup with one FPGA on each
of the three applications peaked at
approximately 830 for NW and SW
and more than 3,100 for ND. When
ramping up from a single FPGA to a
quad-FPGA board, we measured and
observed speedups to grow almost
linearly to about 3,300 for NW and
SW and more than 12,000 for ND.
At the largest scale of our testbed
experiments—32 boards, or 128
FPGAs—speedups for NW and SW
exceeded 100,000 and ND exceeded
356,000.
Because not all 48 boards in Novo-
G were operational during our study,
we extrapolated these trends, estimat-
ing speedups on all 192 FPGAs of
Novo-G of about 150,000 for NW
and SW and almost 550,000 for ND.
Putting these numbers in context,
the latter implies that a conventional
supercomputer would require more
than a half-million Opteron cores op-
erating optimally to match Novo-G’s
performance on the ND application.
By contrast, none of the world’s larg-
est supercomputing machines (as cited,
for example, in the www.top500.org’s
top rankings) has this many cores,
and thus none could achieve such per-
formance on this application despite
being orders of magnitude larger in
cost, size, weight, power, and cool-
ing. Although Novo-G won’t provide
all applications with the same speed-
ups as these examples, they do high-
light RC’s potential advantages,
especially in solving problems where
conventional, xed-logic computing
falls far short of achieving optimal
performance.
For a growing list of important ap-
plications from a broad range of
science domains, underlying compu-
tations and data-driven demands are
proving to be underserved by con-
ventional “one size ts all” processing
devices. By changing the mindset of
computing—from processor-centric
to application-centric—recongurable
computing can provide solutions for
domain scientists at a fraction of the
time and cost of traditional servers
or supercomputers. As we describe
here, the Novo-G machine, appli-
cations, research forum, and pre-
liminary results are helping to pave
the way for scalable recongurable
computing.
References
1. J. Williams et al., “Characterization of
Fixed and Recongurable Multi-Core De-
vices for Application Acceleration,” ACM
Trans. Recongurable Technology and
Systems, vol. 3, no. 4, 2011; to appear.
2. J. Richardson et al., “Comparative
Analysis of HPC and Accelerator
Devices: Computation, Memory, I/O,
and Power,” Proc. High-Performance
Recongurable Computing Technology
and Applications Workshop, ACM/IEEE
Supercomputing Conf. (SC10), IEEE Press,
to appear.
3. C. Pascoe et al., “Recongurable Super-
computing with Scalable Systolic Arrays
and In-Stream Control for Wavefront
Genomics Processing,” Proc. Symp.
Application Accelerators in High-
Performance Computing, 2010; www.
chrec.org/pubs/SA AHPC10_F1.pdf.
Alan George is director of the US Na-
tional Science Foundation Center for High-
Performance Recongurable Computing and
a professor of electrical and computer engi-
neering at the University of Florida. His re-
search interests focus upon high-performance
architectures, networks, systems, services,
and applications for recongur able, paral-
lel, distributed, and fault-tolerant comput-
ing. George has a PhD in computer science
from the Florida State University. He is a
member of IEEE Computer Societ y, the
ACM, the Society for Computer Simulation,
and the American Institute of Aeronautics
and Astronautic s. Contact him at ageorge@
u.edu.
Herman Lam is an associate professor in
the Department of Electrical and Computer
Engineering at the University of Florida. His
research interests include design methods
and tools for RC application development,
particularly as applied to large-scale recon-
gurable supercomputing. Lam has a PhD
in electrical and computer engineering from
the University of Florida. He is a member of
IEEE and the ACM and is a faculty member
of the NSF Center for High-Per formance
Recongurable Computing. Contact him at
hlam@u.edu.
Greg Stitt is an assistant professor in the
Department of Electrical and Computer En-
gineering at the University of Florida and a
faculty member of the US National Science
Foundation Center for High-Per formance
Recongurable Computing. His research
interests include design automation for re-
congurable computing and embedded
systems. Stitt has a PhD in computer science
from the University of California, Riverside.
He is a member of IEEE and the ACM. Con-
tact him at gstitt@ece.u.edu.
Selected articles and columns from
IEEE Computer Society publica-
tions are also available for free at http://
ComputingNow.computer.org.
CISE-13-1-Novel.indd 86 14/12/10 10:55 AM
... In this regard, COTS clusters have a great advantage: the up-bring cost is most part absorbed by the industry by providing tested and validated nodes for quick installation, which is the case of Noctua 2 [107], Catapult [22], and others. Some rare cases of industry and academia collaborations greatly benefit from the COTS advantages with specific research-motivated modifications as in Novo-G [97] and BEE3 [125]. ...
... Dotted lines show the control and configuration bus and solid lines show the data path.generation for multiple accelerators.In 2010, Novo-G was presented as an experimental research cluster[97] consisting of 68 compute nodes built with COTS components. Its purpose was to help understand and advance the performance, productivity, and sustainability of future HPC systems and applications focusing on the sustainability problem of current HPC systems using three different PCIe Intel FPGA boards: 24 nodes with 192 Stratix III FPGAs boards, 12 nodes with 192 Stratix IV FPGA boards, and 32 nodes with 128 Stratix V. ...
Article
Full-text available
In recent years, the most powerful supercomputers have already reached megawatt power consumption levels, an important issue that challenges sustainability and shows the impossibility of maintaining this trend. To this date, the prevalent approach to super-computing is dominated by CPUs and GPUs. Given their fixed architectures with generic instruction sets, they have been favored with lots of tools and mature workflows which led to mass adoption and further growth. However, reconfigurable hardware such as FPGAs has repeatedly proven that it offers substantial advantages over this super-computing approach concerning performance and power consumption. In this survey, we review the most relevant works that advanced the field of heterogeneous super-computing using FPGAs focusing on their architectural characteristics. Each work was divided into three main parts: network, hardware, and software tools. All implementations face challenges that involve all three parts. These dependencies result in compromises that designers must take into account. The advantages and limitations of each approach are discussed and compared in detail. The classification and study of the architectures illustrate the trade-offs of the solutions and help identify open problems and research lines.
... In contrast to the static reconfiguration [9,10], the dynamical reconfiguration is a precondition for the creation of the computing structures, adapted to the requirements of Run Time mode tasks solution [3,[11][12][13]. Firstly, this allows to overcome the limitations of the firm architecture of the high-speed multipurpose computer systems and to approach the real processing performance to the declared 13 ...
Article
Full-text available
The effectiveness of data processing in the reconfigurable computer systems depends significantly on the unproductive time costs of the reconfiguration of the FPGA computational space. It is an important modern problem that hinders the intensive progress of reconfigurable computations. The aim of the research is to improve the efficiency of the process of tasks mapping into the reconfigurable computing structure of the dynamically RCs by reducing the communication delays when the reconfiguring FPGA space in the Run Time mode. Mathematical models for determining the main efficiency criteria of the dynamically RCs and estimating the execution time of the main stages of adaptive tasks mapping that take into consideration the influence of delays of the configuration data transfer at all organization levels of the system are proposed. The concept of adaptive tasks mapping into the dynamically reconfigurable FPGA space based on the new approach to the transformation of algorithms’ MDG and the multilevel configuration data caching is proposed and formalized. That allows the realization of various strategies of adaptive tasks mapping based on the criteria of overhead time minimization considering FPGA hardware limitations and parameters of the changing computing environment during the tasks mapping. The experiments showed that the use of adaptive tasks mapping allows to reduce the overhead time and increase the effectiveness of reconfigurable computations for executing the algorithms with frequent repetition of similar tasks.
... Other research architectures include systems that support a Local Cluster, but where the communication scaling via direct interconnects is limited to a single node. An example includes Novo-G (a former version of Novo-G#) [54]. Other examples include the research systems currently deployed at the IBM SuperVessel Cloud and the IBM Power8+CAPI cluster at the University of Texas, Austin that use a Shared Memory cache coherency model. ...
Article
In this article, we survey existing academic and commercial efforts to provide Field-Programmable Gate Array (FPGA) acceleration in datacenters and the cloud. The goal is a critical review of existing systems and a discussion of their evolution from single workstations with PCI-attached FPGAs in the early days of reconfigurable computing to the integration of FPGA farms in large-scale computing infrastructures. From the lessons learned, we discuss the future of FPGAs in datacenters and the cloud and assess the challenges likely to be encountered along the way. The article explores current architectures and discusses scalability and abstractions supported by operating systems, middleware, and virtualization. Hardware and software security becomes critical when infrastructure is shared among tenants with disparate backgrounds. We review the vulnerabilities of current systems and possible attack scenarios and discuss mitigation strategies, some of which impact FPGA architecture and technology. The viability of these architectures for popular applications is reviewed, with a particular focus on deep learning and scientific computing. This work draws from workshop discussions, panel sessions including the participation of experts in the reconfigurable computing field, and private discussions among these experts. These interactions have harmonized the terminology, taxonomy, and the important topics covered in this manuscript.
... The throughput of the ranking service increased by a factor of two with FPGAs compared to the results without FGPAs. Cray XD1 [8], Novo-G [9] and QP [10] are other examples of large systems that are built with distributed FPGAs. Moreover, FPGAs are deployed as a critical component in business systems to improve their overall performance. ...
Article
The popularity of cloud computing services for delivering and accessing infrastructure on demand has significantly increased over the last few years. Concurrently, the usage of FPGAs to accelerate compute-intensive applications has become more widespread in different computational domains due to their ability to achieve high throughput and predictable latency while providing programmability and improved energy efficiency. Computationally intensive applications such as big data analytics, machine learning, and video processing have been accelerated by FPGAs. With the exponential workload increase in data centers, aajor cloud service providers have made FPGAs and their capabilities available as cloud services. However, enabling FPGAs in the cloud is not a trivial task due to incompatibilities with existing cloud infrastructure and operational challenges related to abstraction, virtualization, partitioning, and security. In this paper, we survey recent frameworks for offering FPGA hardware acceleration as a cloud service, classify them based on their virtualization mode, tenancy model, communication interface, software stack, and hardware infrastructure. We further highlight current FPGAaaS trends and identify FPGA resource sharing, security, and microservicing as important areas for future research.
Article
This article provides a survey of academic literature about field programmable gate array (FPGA) and their utilization for energy efficiency acceleration in data centers. The goal is to critically present the existing FPGAs energy optimization techniques and discuss how they can be applied to such systems. To do so, the article explores current energy trends and their projection to the future with particular attention to the requirements set out by the European Code of Conduct for Data Center Energy Efficiency . The article then proposes a complete analysis of over ten years of research in energy optimization techniques, classifying them by purpose, method of application, and impacts on the sources of consumption. Finally, we conclude with the challenges and possible innovations we expect for this sector.
Chapter
High-performance reconfigurable computing systems integrate reconfigurable technology in the computing architecture to improve performance. Besides performance, reconfigurable hardware devices also achieve lower power consumption compared to general-purpose processors. Better performance and lower power consumption could be achieved using application-specific integrated circuit (ASIC) technology. However, ASICs are not reconfigurable, turning them application specific. Reconfigurable logic becomes a major advantage when hardware flexibility permits to speed up whatever the application with the same hardware module. The first and most common devices utilized for reconfigurable computing are fine-grained FPGAs with a large hardware flexibility. To reduce the performance and area overhead associated with the reconfigurability, coarse-grained reconfigurable solutions has been proposed as a way to achieve better performance and lower power consumption. In this chapter, the authors provide a description of reconfigurable hardware for high-performance computing.
Thesis
Les applications de calcul haute-performance (HPC) nécessitent des capacités de calcul conséquentes, qui sont généralement atteintes à l'aide de fermes de serveurs au détriment de la consommation énergétique d'une telle solution. L'accélération d'applications sur des plateformes hétérogènes, comme par exemple des FPGA ou des GPU, permet de réduire la consommation énergétique et correspond donc à un compromis architectural plus séduisant. Elle s'accompagne cependant d'un changement de paradigme de programmation et les plateformes hétérogènes sont plus complexes à prendre en main pour des experts logiciels. C'est particulièrement le cas des développeurs de produits financiers en finance quantitative. De plus, les applications financières évoluent continuellement pour s'adapter aux demandes législatives et concurrentielles du domaine, ce qui renforce les contraintes de programmabilité de solutions d'accélérations. Dans ce contexte, l'utilisation de flots haut-niveaux tels que la synthèse haut-niveau (HLS) pour programmer des accélérateurs FPGA n'est pas suffisante. Une approche spécifique au domaine peut fournir une réponse à la demande en performance, sans que la programmabilité d'applications accélérées ne soit compromise.Nous proposons dans cette thèse une approche de conception haut-niveau reposant sur le standard de programmation hétérogène OpenCL. Cette approche repose notamment sur la nouvelle implémentation d'OpenCL pour FPGA introduite récemment par Altera. Quatre contributions principales sont apportées : (1) une étude initiale d'intégration de c'urs de calculs matériels à une librairie logicielle de calcul financier (QuantLib), (2) une exploration d'architectures et de leur performances respectives, ainsi que la conception d'une architecture dédiée pour l'évaluation d'option américaine et l'évaluation de volatilité implicite à partir d'un flot haut-niveau de conception, (3) la caractérisation détaillée d'une plateforme Altera OpenCL, des opérateurs élémentaires, des surcouches de contrôle et des liens de communication qui la compose, (4) une proposition d'un flot de compilation spécifique au domaine financier, reposant sur cette dernière caractérisation, ainsi que sur une description des applications financières considérées, à savoir l'évaluation d'options.
Article
Full-text available
As on-chip transistor counts increase, the computing landscape has shifted to multi- and many-core devices. Computational accelerators have adopted this trend by incorporating both fixed and reconfigurable many-core and multi-core devices. As more, disparate devices enter the market, there is an increasing need for concepts, terminology, and classification techniques to understand the device tradeoffs. Additionally, computational performance, memory performance, and power metrics are needed to objectively compare devices. These metrics will assist application scientists in selecting the appropriate device early in the development cycle. This article presents a hierarchical taxonomy of computing devices, concepts and terminology describing reconfigurability, and computational density and internal memory bandwidth metrics to compare devices.
Conference Paper
Full-text available
The computing market constantly experiences the introduction of new devices, architectures, and enhancements to existing ones. Due to the number and diversity of processor and accelerator devices available, it is important to be able to objectively compare them based upon their capabilities regarding computation, I/O, power, and memory interfacing. This paper presents an extension to our existing suite of metrics to quantify additional characteristics of devices and highlight tradeoffs that exist between architectures and specific products. These metrics are applied to a large group of modern devices to evaluate their computational density, power consumption, I/O bandwidth, internal memory bandwidth, and external memory bandwidth.