Conference PaperPDF Available

Smart Technologies for Effective Reconfiguration: The FASTER approach

Authors:

Abstract and Figures

Current and future computing systems increasingly require that their functionality stays flexible after the system is operational, in order to cope with changing user requirements and improvements in system features, i.e. changing protocols and data-coding standards, evolving demands for support of different user applications, and newly emerging applications in communication, computing and consumer electronics. Therefore, extending the functionality and the lifetime of products requires the addition of new functionality to track and satisfy the customers needs and market and technology trends. Many contemporary products along with the software part incorporate hardware accelerators for reasons of performance and power efficiency. While adaptivity of software is straightforward, adaptation of the hardware to changing requirements constitutes a challenging problem requiring delicate solutions. The FASTER (Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration) project aims at introducing a complete methodology to allow designers to easily implement a system specification on a platform which includes a general purpose processor combined with multiple accelerators running on an FPGA, taking as input a high-level description and fully exploiting, both at design time and at run time, the capabilities of partial dynamic reconfiguration. The goal is that for selected application domains, the FASTER toolchain will be able to reduce the design and verification time of complex reconfigurable systems providing additional novel verification features that are not available in existing tool flows.
Content may be subject to copyright.
1
Smart Technologies for Effective Reconfiguration:
The FASTER approach
M. D. Santambrogio , D. Pnevmatikatos , K. Papadimitriou, C. Pilato, G. Gaydadjiev§, D. Stroobandt, T.
Davidson, T. Becker, T. Todman, W. Luk, A. Bonetto, A. Cazzaniga, G. C. Durelli, D. Sciuto
Dipartimento di Elettronica e Informazione, Politecnico di Milano
Foundation for Research and Technology - Hellas
Ghent University
§Chalmers University of Technology
Imperial College London
Abstract Current and future computing systems increasingly
require that their functionality stays flexible after the system is
operational, in order to cope with changing user requirements
and improvements in system features, i.e. changing protocols and
data-coding standards, evolving demands for support of different
user applications, and newly emerging applications in communi-
cation, computing and consumer electronics. Therefore, extending
the functionality and the lifetime of products requires the
addition of new functionality to track and satisfy the customers
needs and market and technology trends. Many contemporary
products along with the software part incorporate hardware
accelerators for reasons of performance and power efficiency.
While adaptivity of software is straightforward, adaptation of
the hardware to changing requirements constitutes a challenging
problem requiring delicate solutions.
The FASTER (Facilitating Analysis and Synthesis Technologies
for Effective Reconfiguration) project aims at introducing a
complete methodology to allow designers to easily implement
a system specification on a platform which includes a general
purpose processor combined with multiple accelerators running
on an FPGA, taking as input a high-level description and fully
exploiting, both at design time and at run time, the capabilities
of partial dynamic reconfiguration. The goal is that for selected
application domains, the FASTER toolchain will be able to
reduce the design and verification time of complex reconfigurable
systems providing additional novel verification features that are
not available in existing tool flows.
I. INTRODUCTION
Altering the functionality of hardware so as to adapt to new
requirements can offer great advantages in a wide range of
application domains. For example, a Network Intrusion Detec-
tion System needs to scan all incoming network packets for
suspicious content. The scanning has to be carried out at line-
speed so that the communications are not slowed down, while
the list of threats to check for may be extended and updated
on a daily basis. Fixed hardware solutions can achieve high
performance, and software solutions can easily adapt to the
new set of threats, but neither can achieve adaptivity and high
performance at the same time. Reconfigurable logic allows the
definition of new functions to be defined in hardware units,
combining hardware speed and efficiency, with ability to adapt
and cope in a cost effective way with expanding functionality,
changing environmental requirements, and improvements in
system features. For the Intrusion Detection System the new
rules can be hardcoded into the reconfigurable logic, thus
retaining the high performance, while providing the necessary
adaptivity and extensibility to new threats.
However, designing, implementing and verifying evolving
hardware systems is harder compared to static ones. The
FASTER project (Facilitating Analysis and Synthesis Tech-
nologies for Effective Reconfiguration) [1] aims at introducing
a complete methodology to allow designers to easily imple-
ment and verify a system specification on a platform that
includes one or more general purpose processor(s) combined
with multiple acceleration modules implemented on one or
multiple reconfigurable devices. Our goal is that for selected
application domains, the envisioned toolchain will be able to
reduce the design and verification time of complex reconfig-
urable systems by at least 20%, providing additional novel
verification features that are not available in existing tool
flows. In terms of performance, for these application domains
the toolchain could be used to achieve the same performance
with up to 50% smaller cost compared to programmable SoC
based approaches, or exceed the performance by up to a factor
of 2 for a fixed power consumption envelope.
FASTER will support both region-based [2] and micro-
reconfiguration, a technique to reconfigure very small parts of
the device [3]. The ability to handle both types of reconfigura-
tion opens up a new range of application possibilities for run-
time reconfiguration, as a much broader time frame for the re-
configuration itself is available and the underlying concepts are
different for both types of reconfiguration. FASTER will also
develop techniques for verifying static and dynamic aspects
of a reconfigurable design at compile time using symbolic
simulation -a powerful verification approach for static designs,
and extending it to support both static and dynamic aspects
of a reconfigurable design. We will also explore techniques
for verifying selected static and dynamic aspects of a recon-
figurable design at run time with a small impact on speed,
area and power consumption. Finally, FASTER will provide a
2
powerful runtime system that will be able to run on multiple
reconfigurable platforms and manage the various aspects of
parallelism and adaptivity with the least overhead. Within this
context, the novel contributions of the FASTER project can
be summarized in the following:
Including reconfigurability as an explicit design concept
in computing systems design,
Providing methods and tools that support run-time recon-
figuration in the entire design methodology,
Providing seamless integration of parallelism in the spec-
ification that can then be applied to software or (recon-
figurable) hardware components,
Providing a framework for analysis, synthesis and verifi-
cation of a reconfigurable system,
Providing efficient and transparent runtime support for
partial and dynamic reconfiguration.
In the remaining of this paper, Section II provides the
description of the overall problem FASTER is aiming to solve.
Related works, with similar objectives, are also presented
in this section. Section III describes the FASTER project
providing a general view of its organization. Section IV
focuses on the description of two components of FASTER:
the offline analysis and the runtime support for managing dy-
namically reconfigurable systems. Finally, Section V wraps up
the authors conclusions and present future research directions.
II. CON TEXT DEFINITION
Current and future computing systems increasingly require
to be flexible and extensible even after the system is op-
erational, in order to cope with changing requirements and
improvements in system features. The FASTER project [1]
aims at exploiting the capabilities of dynamic reconfiguration,
both at design-time and run-time, by taking as input a high-
level description of the application.
A. An Opportunity
In an ever-changing world there is an increasing demand
for embedded systems that will be able to adapt to their
environment or to meet new application demands. Adaptation
by means of changing the software running on processors
is not always adequate: many embedded applications re-
quire hardware-acceleration and it is imperative that also this
hardware can adapt to application changes. Reconfigurable
hardware is the key enabler for these systems. Hardware
supported adaptation mechanisms provide a cost effective way
of coping with changing requirements. This is in addition to
providing the flexibility needed to allow functionalities to be
defined and easily added or substituted after a system has been
manufactured and is already deployed. However, the ability
to take relevant reconfiguration issues into account from the
initial system specification to the final system design and the
mechanisms required to support this additional functionality
at runtime are currently lacking.
B. FASTER in the EU Projects Context
Previous research and EU projects such as MORPHEUS
[4], hArtes [5], Reflect [6], S4, Acotes [7], and Andres [8]
focus on the necessary tool-chain and address similar issues
as FASTER, they focus more on system-level or architec-
tural aspects of reconfiguration. The two projects closest to
FASTER are hArtes and MORPHEUS. hArtes focuses on the
creation of a toolchain for mapping sequential applications
in C onto a heterogeneous platform with different processors
and possibly also an FPGA (Molen architecture). No dynamic
reconfigurability is foreseen and the toolchain is geared to-
wards parallelizing, partitioning, mapping and scheduling C
code. The tasks marked for FPGA implementation follow the
toolchain for Molen [9]. MORPHEUS addresses solutions for
embedded computing based on a dynamically reconfigurable
platform and tools. The approach is to develop a global solu-
tion for a particular modular heterogeneous SoC platform that
provides a software-oriented design flow and toolset. These
soft hardware architectures enable better computing density
improvements, positioned between general purpose flexible
hardware and general purpose processors. Again, runtime
reconfiguration is possible with a coprocessor interface and
instruction extensions using the TU-Delft Molen approach.
While both these projects address similar issues as FASTER,
there is no explicit emphasis on design and runtime aspects of
partial and dynamic reconfiguration.
This is exactly where FASTER intends to contribute: to
introduce partial and dynamic reconfiguration from the initial
design of the system all the way to its runtime use. We
therefore have to define the design concepts capturing both
parallelism and reconfigurability as essential system properties
and to provide efficient and user transparent runtime support
for dynamic and partial reconfiguration. As reconfigurability
increases both the runtime and design time complexity, an
adequate framework for verification and analysis is a necessity
and the formulation of such framework will therefore also
constitute one of the targets of the project. FASTER is focused
on exploiting reconfiguration at the hardware description level,
so the front-end of these approaches can be used to translate
languages such as C or SystemC into task graphs which can
then be processed with the FASTER analysis and scheduling.
Furthermore, we assume that applications are developed taking
into account the dynamic and partial reconfigurability from the
start rather than transforming code to take advantage of this
feature. However, the tool is not restricted to new applications
and existing code can be integrated in two possible ways:
(a) using a pre-processing step that will produce the task
dependency graphs to be analyzed by the FASTER tool chain
so that its results and decisions can be used later on for the
synthesis and mapping of the application, or (b) by employing
a compilation approach such as the one proposed by the
OSMOSIS Capacities project where the high-level description
C or System C is transformed into HDL, which can then be
processed by the FASTER tool-chain. Therefore, FASTER is
complementary to hArtes since it adds an additional degree of
flexibility and heterogeneity by adding partial dynamic recon-
figurability both as region-based or micro-reconfiguration. The
two tool-chains could be usefully integrated together. Within
this context, the FASTER tool-chain will provide the design
analysis tools to identify the reconfigurability characteristics
of the application, determine the best implementation options,
3
and verify the resulting implementation in order to take advan-
tage of these features from the start, rather than introducing
reconfiguration late in the mapping process, or merely by
transforming the code.
FASTER proposes a novel approach for computing system
design, focusing on dynamically and partially reconfigurable
FPGA-based architectures. This project binds classical com-
puting system design and hardware reconfiguration to focus
the design attention on the entire platform characterization.
FASTER envisions the development of new tools and the
identification and formalization of new models and design
methodologies that can represent and implement efficient
computing systems. The tools will cope with the application
requirements as well as the development of area allocation
and management algorithms for efficient runtime support of
dynamic reconfiguration. FASTER will also extend the recon-
figuration options above the traditional region-based design
supporting both micro-reconfiguration and region-based recon-
figuration in the developed tool set. This broadens the range
of reconfiguration time scales that can be effectively used by
target applications.
III. FASTER AT A GL ANC E
The FASTER project aims to provide the ability to introduce
reconfigurability as an explicit design concept. Towards this
concept the following five objectives will be addressed:
ability to explore and specify reconfiguration at any
granularity that fits better the application of a designer,
methods and tools for including run-time reconfiguration
in every aspect of the design methodology,
tools for analysing and verifying both static and recon-
figurable parts of the system,
provide a flexible run-time system and specify the re-
quired hardware support that the run-time system needs
to efficiently implement the reconfiguration actions and
optimize the system operation under the desired criteria,
and finally
integrate seamlessly parallelism and reconfigurability in
the specification irrespective of hardware and software
implementation.
Based on these objectives, a new model for addressing
reconfiguration in dynamically reconfigurable FPGAs and
new graph-theoretic algorithms for the temporal and spatial
partitioning of a specification on the same architectures will
be proposed. We aim to propose a novel approach in defining
the clustering of tasks of a system specification by detecting
recurrent structures in the specification itself. This will allow
to identify modules that can be used more than once during
the application lifetime in order to save device resources and
reconfiguration time. The intermediate system representation
is then analysed in terms of its spatial-temporal constraints.
While most of the temporal constraints can be specified in the
initial design phase, a more detailed assessment of them will
take place in the second step.
This spatial-temporal profile is then verified against the ini-
tial application and design requirements. The next step consists
of dynamic scheduling where the runtime requirements of the
application are taken into account. By providing feedback to
the initial design stage certain decisions can be revised. The
outcome of this iterative process is a system description that
conforms to the application requirements and where dynamic
properties are also as much as possible taken into account.
The final architecture is then generated along with additional
information necessary for the runtime system.
A main innovation of the FASTER project is the support
of static, and both micro-reconfiguration and region-based
reconfiguration. In micro-reconfiguration, the parts being re-
configured are of fine granularity, such as LUTs (LookUp
Tables) or routing switches. Micro-reconfiguration can be fast
and configuration generation can take place at run-time, for
instance triggered by the change of some parameters [10].
Conceptually, micro-reconfiguration deals with the adaptation
of the implemented function by changing a small set of con-
figuration bits only. In contrast, region-based reconfiguration
deals with the instantiation of a new function in an entire re-
gion of the FPGA. The regions being reconfigured are of much
coarser granularity, with large blocks of logic being swapped
in and out. Region-based reconfiguration tends to be slow,
and traditionally configuration generation takes place at design
time. These pre-designed configurations program predefined
regions during runtime. The ability to handle both types of
reconfiguration opens up a new range of possibilities for run-
time reconfiguration, which can offer a versatile framework
for serving the design of applications targeting reconfigurable
hardware.
Figure 1 illustrates the overall FASTER tool chain. Starting
from the left side of the Figure, it begins with the description
of the application (in HLL, HDL or in other formats such
as task graph, etc.) plus the application requirements and
an abstract description of the reconfiguration capabilities of
the target platform. When starting from HLL (C/C++ with
OpenMP annotations, openCL, etc), FASTER analysis (WP2)
can determine which portions of the application can be ac-
celerated in hardware and will be executed in software. In
any case, the FASTER methodology then focuses on the
portions to be executed in hardware, where the analysis
on the corresponding HDLs determines which parts can be
reconfigured dynamically and which portions are fixed. Further
analysis will estimate the performance to offer feedback to the
user. The analysis will also determine the most appropriate
granularity for reconfiguration and the required parameters.
WP3 performs synthesis, floorplanning, placement and routing
for reconfigurable portions that have been selected for micro-
reconfiguration. It also performs the necessary check to verify
that the partitioned dynamically reconfigurable version of the
application is equivalent to the original source. The partitioned
application description is annotated with dependency informa-
tion that will drive the dynamic aspects of the reconfiguration.
The final bit files can be produced by vendor-specific tools, and
together with the run-time scheduler established in WP4 they
form the complete reconfigurable system. WP1 is an initial
step in the project and will establish the list of requirements
for the tools and determine the best evaluation criteria to assess
the results of the project, and WP5 drives the validation and
evaluation of the FASTER tools and flow.
4
!"#$%&'('&)*+*&,-"-.)
/0+1*"+-)203#$45*-1)
'-67*60+)05)
*890:'2;)
<8='-032>'-;)
>8/07?31*60+)67'
@A'+6B>*60+)05)
9=)>02'-)C)
D??&">*60+)
?20B&"+#
E?67"F*60+)502)
7">20%
2'>0+B#32*60+)
GHIJH;)H/EK8
L*-'&"+')
->$'A3&"+#)C)
M&002?&*++"+#
Verification =HNO
9*2*7.)
>$*+#'
I"-1)05)!PI)53+>60+-
C
9*2*&&'&"-7)
*++01*60+-)
G0?'+O98
/)A'->2"?60+
QOI)A'->2"?60+
%)D??
%)9&*R027
%)!S4NS)?*2660+"+#
I"<2*2,):4)NS4!S) @4M)
70A3&'-
C
NS)
100&-
N1*6>
O">20%2'>0+B#.
='#"0+%<*-'A
='5'2'+>')
A'-"#+ NS
G"A'+65,)*++01*60+-)
+''A'A)0+)/4!PI8
WP2 (front-end)
WP3 (back-end)
Vendor-flow
Vendor-flow +
relocation
UGent +
Vendor-flow
WP4 (runtime)
GPP RR1
Reuse
System
Reconfigurable AreaStatic Area
RR2
App Designers
Fig. 1. FASTER design flow broken down in distinct work packages
IV. FASTER OFFL IN E A ND RUN TIM E SU P PO RT FOR
DYNAM IC A LLY REC ONFI GU R AB LE SY STE MS
A. High-level analysis and reconfigurable system definition
The main goal of this work package is to analyze each
application and define its components, estimate its execution
time on the target platform, identify the right level of recon-
figurability for it (none, region-based or micro-reconfigurable),
the power constraints, the floorplanning constraints (size and
shape), the placement requirements (type of resources and
connectivity among modules) on the target platform and the
baseline schedule for its execution. In this work package,
the definition of a methodology and workflow to partition a
system specification into a task graph (in which every task is
to be treated as a region-reconfigurable module or a micro-
reconfigurable module) and to schedule it on the same recon-
figurable architecture, will be carried out. Figure 2 provides
an overview of its structure, showing the different stages from
the original high-level specification to the partitioned system
(i.e., cores which are ready to be scheduled for configuration
and execution). This work package is composed of 5 tasks.
The first task involves the use of high-level analytical models
in guiding the optimization of the system. The second task
involves application profiling for identifying opportunities for
reconfiguration. The third task concerns application optimiza-
tion for implementations involving micro-reconfigurable cores.
Based on these three tasks, the fourth task concerns baseline
scheduling of applications on the target platform. The output
will be the baseline schedule and the description of the appli-
cation subdivided into components, ready for synthesis using
any commercially available tools for the targeted technology.
Finally, different implementations of the same design can be
verified as being equivalent to one another.
1) High-level analysis: The purpose of high-level analysis
is to provide an analytical model of a reconfigurable design
that relates its application attributes to possible implementation
parameters, from which metrics such as area, performance and
power consumption can be estimated. Based on the above
implementation parameters we will then develop a tool for
statistical prediction of metrics such as reconfiguration time,
area, performance and power consumption, with which each
component will be annotated. This process is intended to
guide the identification of promising components and recon-
App Designers
Compile-time baseline
scheduling and core
mapping onto rec. regions
T2.4
T2.3
App task profiling
and identification
of rec. cores
High-level
analysis
T2.1 T2.2
TG generator
Input
Output
Pre-Processing
Description of
the Architecture
XML Designer GUI
XML Designer GUI
Optimization of app
for micro-rec. core
implementation
Front-end
Fig. 2. FASTER high-level analysis and reconfigurable system definition
figuration techniques at an early stage of the design, and
to assess their impact on the implementation. Examples of
application attributes include required throughput or execution
deadlines, data precision and data set size. Examples of
implementation parameters include available resources, their
area-delay-power estimates, and the associated reconfiguration
overheads. The output of high-level analysis is an estimate
of how a design with a given set of application attributes
and implementation parameters performs. However, it is not
the goal of this task to produce code or to analyze code
automatically. Rather, the estimation is intended to guide the
identification of features of promising designs, which could
benefit other tasks in WP2. Furthermore, the analysis process
5
continues in parallel throughout the implementation phase. It
interfaces with WP3 in that implementation parameters are
updated during synthesis and physical implementation. Such
updated values are fed back into the model and are used to
verify prior calculations. The analysis process also interfaces
with WP4 because characteristics of run-time management
techniques form part of the implementation parameters used in
the analysis. For instance, one would analyze how the response
time and reconfiguration time of run-time controllers influence
application performance. Likewise the analysis process can be
used to improve run-time management techniques by making
explicit the requirements for response times, context save and
restore procedures.
2) Profiling of applications for identifying region-based,
micro-reconfigurable, and static cores: The input of this
phase is the original specification, whose analysis results in
cores, i.e. groups of operations that compose configurable
modules, with optimal sizes. The identification of the region-
based reconfigurable cores is performed by analyzing the
Control Data Flow Graph of the input application, and trying
to identify isomorphic subgraphs in the graph. The choice
of finding isomorphic subgraphs is related to the possibility
of reusing these components without reconfiguration, thus
hiding/minimizing reconfiguration time. Within this context,
if distinct components, having the same implementation, are
mapped onto the same reconfigurable core, we can execute
them on the same physically implemented module with a
single initial reconfiguration. Obviously, at the end we have to
identify a number of components, isomorphic or not, that cover
the entire graph in order to restructure the entire application
into a set of interconnected components.
Our first step is the core identification phase. The input of
this step is the original specification, whose analysis results
in cores, i.e. groups of operations that, reconfigured together
as configurable modules, have optimal sizes. The second part
of the task is the Partitioning phase. Using the previously
computed set of modules as its input, this phase produces a
set of feasible covers of the original graph of the specification,
following a given policy.
3) Optimization of applications for micro-reconfigurable
core implementation: Since every change of a parameter
value in the parameterizable micro-reconfiguration results in a
reconfiguration of (part of) the implementation, the number
of parameter changes should be kept as low as possible.
This requires a higher parameter value locality in time. The
consortium has a profound experience in loop transformations
that increase the locality of data in cache based architectures.
We will investigate if and how such loop transformations
can improve the locality of parameter values and apply these
transformations to optimize the applications for the micro-
reconfiguration implementation. In this task, we will also
research how current applications under consideration can be
altered to benefit more from micro-reconfiguration. This will
mainly be an investigation on the introduction of parameters in
such a way that the overall implementation can be optimized.
In this respect, we will also investigate multi-mode applica-
tions where the different modes are similar (but not exactly
the same) and investigate how such applications can best be
represented to benefit from micro- reconfiguration.
4) Compile-time baseline scheduling: Given the informa-
tion computed in the previous tasks, an integration between
a heuristic reconfiguration-aware scheduler and a floorplacer
algorithm will produce the baseline schedule of the tasks on
the target platform. This scheduler is a reconfiguration-aware
scheduler for dynamically partially reconfigurable architec-
tures that can also manage static reconfiguration and multi
FPGAs. Its distinguishing features are the exploitation of con-
figuration prefetching, module reuse and anti-fragmentation
techniques. It schedules the modules according to the actual
hardware composition and availability: given the actual spatial
constraints defined by the floorplacer, it schedules the appli-
cations trying to reach the minimum scheduling time. The
function to be optimized can be chosen based on the user
constraints. Output of this task is the list of timings for each
module, i.e. start time, and reconfiguration time. However, a
static schedule may not be possible for all of the targeted
applications. For example, when the execution times of the
modules are data-dependent the scheduler will only provide
guidelines to the runtime system and the HW scheduler that
will manage the system at runtime. A strict interaction with
WP4 and the runtime scheduler is therefore foreseen, since the
runtime scheduler will receive as input the results produced
during the generation phase.
5) Verification: Verification answers the question: how do
we know the transformed design preserves the same be-
haviour? Traditionally, hardware designers have used exten-
sive simulation to verify that their designs implement the
desired behaviour, but as designs become increasingly large,
the number of test inputs required can become large. Rather
than simulating the design logically or numerically, we use
symbolic simulation to verify that a transformed version of
the design preserves the same behaviour as the original. The
symbolic simulation can run at word level, which allows us to
verify larger designs, assuming that bit level operator imple-
mentations are correct. We combine symbolic simulation with
an equivalence checker, based on the Yices tool [11], to check
whether different symbolic outputs are actually equivalent or
not.
We further extend verification to check the dynamic aspects
of designs: the behaviour of reconfiguring designs. We use
the concept of virtual multiplexers [12] to represent a re-
configurable region by virtual multiplexer-demultiplexer pairs
enclosing all the reconfigurable parts that can be loaded into
that region. This allows us to represent reconfigurable regions
in our symbolic simulation with multiplexers whose control
signals are generated by software parts of the design. Our
approach can apply to both region-based reconfiguration and
microreconfiguration.
Finally, our approach can extend to dynamic verification:
verifying the design at run-time. This entails running our
verification flow on the device as it runs.
B. Run-time reconfiguration management
The main goal of this work package is to provide the proper
extensions of the runtime system to efficiently handle the
6
online scheduling and placement of envisioned dynamically re-
configurable system. Depending on the running application the
appropriate techniques and module versions will be selected.
The online scheduling will consider minimization of reconfig-
uration overhead, minimization of fragmented area, reduction
in power consumption and temperature and inter module wire
delay optimization. The basic guidelines for online scheduling
are provided by the compilation stage of WP2. To support
the above, the architecture will expose some light-weight
system monitors and control hooks to the runtime system.
The approaches developed in this work package are applicable,
without lost of generality, to partially reconfigurable devices as
well as to multi-chip systems consisting of several devices that
support only complete reconfiguration. Our work for shaping
the runtime system is constituted by four main tasks.
1) Evaluation of existing runtime system support for recon-
figuration: Within the context of this task we will evaluate
existing runtime systems in terms of their capabilities and
limitations, and propose a set of features for the efficient
support for reconfiguration. So far, work of limited scope
has been conducted on runtime support for partially recon-
figurable systems with most efforts targeting Xilinx Virtex-II
FPGAs that support 1-D reconfiguration only. Some theoretical
approaches are addressing newer devices like Virtex-4 and
Virtex-5 supporting 2-D reconfiguration but they haven’t been
implemented, evaluated or verified yet. In either case, the
capabilities are limited by the specific architectural features
of specific devices with regard to the reconfiguration support.
Currently, the FPGA technology imposes the limitation that
specific areas for the partially reconfigurable modules should
be determined beforehand; then the reconfigurable modules are
loaded during execution. Already, research towards trying to
overcome this limitation is ongoing. We will evaluate the state
of the art looking into issues such as scheduling, fragmentation
and area allocation, first as they apply to existing platforms,
but also as we envision the operation of future reconfigurable
devices.
2) Propose architectural extensions for runtime system: Ef-
ficient reconfiguration is crucial for the success of dynamically
reconfigurable systems. Reconfiguration overhead is a main
shortcoming of the current FPGAs and therefore, the efficient
scheduling of the reconfigurations is critical to meet the timing
constraints of the applications. Moreover, awareness of the
device area constraints will assist decisions targeting the prob-
lem of fragmentation. This will allow keeping the number of
reconfigurations at low level. Towards the same direction, we
will explore other functionalities to improve reconfiguration
time such as caching or predicting reconfigurations based on
online statistics and prefetching them ahead they are needed.
In fact we are planning to incorporate configuration caching
and prefetching [13] in our runtime system. To support this
functionality dependencies and restrictions are communicated
from the synthesis process to the runtime system, along with
possible static prefetch hints. In addition, the runtime system
should collect online statistics and determine dynamically
when to cache and prefetch configurations. A primary issue is
the configuration cache size. For example the configuration
cache of the internal configuration access port (ICAP) in
Xilinx FPGAs is implemented with one Block RAM and
is fixed, but the cache of an embedded processor such as
PowerPC or Microblaze is implemented with Block RAMs and
can be modified. A (double or triple) buffering scheme would
extend the ICAP cache and thus increase the reconfiguration
bandwidth. To this direction, a prefetching mechanism can
load to the cache, or even to the configuration memory, the
configuration data that correspond to the circuit that is more
likely to execute in the near future.
This task will also address possible light-weight hardware
support extensions for efficient FPGA reconfiguration. The
set of system monitors and hooks that need to be exposed
to the runtime system is one of our concerns. This allows a
SW/HW holistic solution to a variety of problems. One such
opportunity is bitstream relocation, according to which a single
bitstream file can reconfigure different FPGA regions with the
same functionality. Another line of research falling under this
task is to investigate the hardware support needed for online
verification and debug of the reconfigurable units.
Furthermore, information such as thermal data taken with
temperature measurement can provide feedback to the runtime
system to deal with overheating. Then, re-scheduling can
be triggered to reduce temperature and consequently power
dissipation. This is of significant and increasing importance
mainly for the embedded domain; also it is a concern for the
desktop and high-performance computing domains.
3) Integration of micro-reconfiguration in the run-time
manager: In this task, parameterizable run-time micro-
reconfiguration [10] will be integrated into the runtime system
support. The reconfiguration procedure from WP3 should be
called by the reconfiguration manager when parameters change
value. Also, in scheduling the application tasks, the micro-
reconfigurations should be scheduled accordingly.
4) Runtime configuration scheduling and device manage-
ment: In this task we will investigate first how the run-
time system has to manage the directives provided by the
compilation/static-scheduling stage. Next, based on the appli-
cation behaviour the system will be dynamically optimized in
order to provide the best service. The runtime system can be
part of the OS itself, or software code running underneath the
OS and will undertake decisions such as
the time slot in which the reconfiguration of a module
will occur,
the portion of the FPGA on which the module is going
to be placed, and
the time slot in which its execution will start.
To appropriately cope with reconfiguration, the runtime sched-
uler will be leveraged with area management capabilities. The
input is provided from a dependency/communication graph
and based on a list of criteria a decision will be made. Such
criteria Are reconfiguration time, execution time of a module,
device area constraints, precedence between the modules,
and level of fragmentation. A relocation mechanism would
allow to defragment the device and consequently trigger fewer
reconfigurations. In addition, we will address the problem of
scheduling with feedback from thermal data in order to re-
schedule the modules according to temperature and power
dissipation values. To the same direction, the scheduler will
7
address the problem of keeping close modules that commu-
nicate frequently through wide-width buses; this will also
allow for low-latency communication. To optimize the on-
line scheduling of components for 2-D reconfiguration and
minimize the reconfiguration time and improve the resource
utilization we will investigate runtime techniques for hierar-
chically constructing more complex components out of the
available simpler regular ones. The algorithms that will be
selected depend heavily on whether the scheduling will be
preemptive or non-preemptive. In the former case, an Earliest
Deadline First (EDF) algorithm has been proven to be a good
solution, however in the latter case using a pure EDF algorithm
is not the best choice, due to the inserted idle times. Towards
this problem heuristics and migration of techniques from the
multiprocessor field might offer an appropriate solution. Be-
fore this, we will evaluate simpler algorithms like the First-Fit
and Best-Fit. Furthermore, we will explore the use of Genetic
Algorithms augmented with the capability to terminate upon
non-fixed number of generation counts; an implementation
in software or hardware (or mixture of both) can be used.
We already have a hardware implementation of a GA, which
due to its inherent pipelining and parallelism nature is two
orders of magnitude faster than its counterpart implemented
in C running in an embedded processor. Moreover, although
this implementation is complex in logic, it is parameterizable
by supporting different cost functions and genetic parameters
such as variable population size and members width, and
utilizes less than 8% of a Virtex-II FPGA [14]. This hardware-
based genetic algorithm can assist the runtime support on
scheduling decisions. Before proceeding with this, study is
needed to balance the benefits and drawbacks from using such
an assistant-circuit lying outside the runtime scheduler with
regard to the I/O latency caused to start-up the hardware-
based genetic algorithm plus to pass the results to the runtime
software code over the speed of finding the solution.
Limitations due to the FPGA architectural features from
the reconfiguration perspective might constrain the above
solutions. Even the latest Xilinx FPGAs do not allow on-line
change of the regions wherein the reconfigurable modules are
to be placed. This can limit the ability of making decisions
on real-time region allocation. Finally, we are concerned
in examining the extent to which our mechanisms will be
applied in a transparent manner regardless of the architectural
constraints.
V. CONCLUSIONS
The FASTER project will enhance the following five aspects
of the design of computing systems. The first objective is to
include reconfigurability as an explicit design concept. Starting
from high-level descriptions of the system and its application,
our approach will provide a set of different metrics, such
as communication and dependency graphs, profiling informa-
tion, etc, oriented to support system-level synthesis onto a
hardware platform supporting dynamic reconfiguration. The
second objective is to provide the methods and tools for in-
cluding run-time reconfiguration in every aspect of the design
methodology. The methods and tools designed in FASTER will
provide the tools necessary for efficient and transparent use of
partial and dynamic reconfiguration at different time scales,
supporting also both explicit parallelism in the application
specification and platforms that combine multicore processors
with reconfigurable logic. This will lead to a scenario that will
enable efficient interfacing between the parallel software and
the (reconfigurable) hardware components. The third objective
of FASTER is to provide an effective framework for analysis,
synthesis, and verification to guarantee that the final imple-
mentation corresponds to the application requirements and
system specifications. Our focus is on developing an integrated
tool-chain that supports the verification of both static and
dynamic portions of the reconfigurable design. Objective four
is to provide efficient and developer/user transparent runtime
support for partial and dynamic reconfiguration. Assuming
a partially reconfigurable system - either with a single or
multiple FPGAs - we will develop a runtime system that can
efficiently handle the online scheduling and placement of re-
configurable system components, using dynamically adaptive
schemes to optimize the system operation based on different
functional and non-functional requirements defined by the user
or the earlier tool-chain. Finally the fifth objective is to provide
seamless integration of parallelism and reconfigurability in the
specification, irrespective of whether it applies to software
or hardware components. The flow will interface the parallel
software to the hardware components, and to the runtime
manager responsible for partial and dynamic reconfiguration.
ACK NO WL EDG MEN T
This work was supported by the European Commission in
the context of FP7 FASTER project (#287804).
REFERENCES
[1] http://www.fp7-faster.eu/, [Online; accessed May 2012].
[2] P. Lysaght, B. Blodget, J. Mason, J. Young, and B. Bridgford, “Enhanced
Architectures, Design Methodologies and CAD Tools for Dynamic
Reconfiguration of Xilinx FPGAs (Invited Paper),” in Proceedings of
the IEEE Conference on Field Programmable Logic and Applications
(FPL), August 2006, pp. 1–6.
[3] K. Bruneel, “Efficient Circuit Specialization for Dynamic Reconfigura-
tion of FPGAs,” PhD thesis, Ghent University, 2011.
[4] http://ce.et.tudelft.nl/DWB/, [Online; accessed May 2012].
[5] http://hartes.org/hArtes/, [Online; accessed March 2012].
[6] http://www.reflect-project.eu/, [Online; accessed March 2012].
[7] http://www.hitech-projects.com/euprojects/ACOTES/, [Online; accessed
March 2012].
[8] http://andres.offis.de/, [Online; accessed March 2012].
[9] http://ce.et.tudelft.nl/DWB/, [Online; accessed May 2012].
[10] K. Bruneel and D. Stroobandt, “Automatic generation of run-time
parameterizable configurations,” in Proceedings of the IEEE Conference
on Field Programmable Logic and Applications (FPL), August 2008, pp.
361–366.
[11] B. Dutertre and L. de Moura, “The YICES SMT Solver,” Computer
Science Laboratory, SRI International, 333 Ravenswood Avenue, Menlo
Park, CA 94025 - USA, Tech. Rep., 2006.
[12] W. Luk, N. Shirazi, and P. Y. K. Cheung, “Modelling and Optimising
Run-Time Reconfigurable Systems,” in Proceedings IEEE Symposium
on FPGAs for Custom Computing Machines (FCCM). IEEE Computer
Society Press, 1996, pp. 167–176.
[13] S. Hauck, “Configuration Prefetch for Single Context Reconfigurable
Coprocessors,” in Proceedings of the ACM International Symposium on
Field-Programmable Gate Arrays (FPGA), 1998, pp. 65–74.
[14] M. Vavouras, K. Papadimitriou, and I. Papaefstathiou, “High-speed
FPGA-based Implementations of a Genetic Algorithm,” in IEEE In-
ternational Conference on Embedded Computer Systems, Architectures,
Modeling and Simulation (SAMOS), July 2009, pp. 9–16.
... To enhance device utilization and energy efficiency (for unpredictable scenarios) concepts like run-time remapping [4], [3] and dynamic parallelism [5], [6], [7] have been proposed. Run-time remapping changes the physical placement of an application to reduce communication [4], memory [8], and/or reconfiguration [9] costs. Dynamic parallelism parallelizes an application to induce speedup and generate additional time slacks that allow the platform to operate at a lower voltage/frequency. ...
Conference Paper
Full-text available
Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications, with arbitrary communication and computation patterns. Compile-time mapping decisions are neither optimal nor desirable to efficiently support the diverse and unpredictable application requirements. As a solution to this problem, recently proposed architectures offer run-time remapping. The run-time remappers displace or expand (parallelize/serialize) an application to optimize different parameters (such as platform utilization). However, the existing remappers support application displacement or expansion in either horizontal or vertical direction. Moreover, most of the works only address dynamic remapping in packet-switched networks and therefore are not applicable to the CGRAs that exploit circuitswitching for low-power and high predictability. To enhance the optimality of the run-time remappers, this paper presents a design framework called Run-time Rotatable-expandable Partitions (RuRot). RuRot provides architectural support to dynamically remap or expand (i.e. parallelize) the hosted applications in CGRAs with circuit-switched interconnects. Compared to state of the art, the proposed design supports application rotation (in clockwise and anticlockwise directions) and displacement (in horizontal and vertical directions), at run-time. Simulation results using a few applications reveal that the additional flexibility enhances the device utilization, significantly (on average 50 % for the tested applications). Synthesis results confirm that the proposed remapper has negligible silicon (0.2 % of the platform) and timing (2 cycles per application) overheads.
Article
In the era of platforms hosting multiple applications with arbitrary inter application communication and computation patterns, compile time mapping decisions are neither optimal nor desirable. As a solution to this problem, recently proposed architectures offer run-time remapping. The run-time remapping techniques displace or parallelize/serialize an application to optimize different parameters (e.g., utilization and energy). To implement the dynamic remapping, reconfigurable architectures commonly store multiple (compile-time generated) implementations of an application. Each implementation represents a different platform location and/or degree of parallelism. The optimal implementation is selected at run-time. However, the compile-time binding either incurs excessive configuration memory overheads and/or is unable to map/parallelize an application even when sufficient resources are available. As a solution to this problem, we present Transformation based reMapping and parallelism (TransMap). TransMap stores only a single implementation and applies a series for transformations to the stored bitstream for remapping or parallelizing an application. Compared to state of the art, in addition to simple relocation in horizontal/vertical directions, TransMap also allows to rotate an application for mapping or parallelizing an application in resource constrained scenarios. By storing only a single implementation, TransMap offers significant reductions in configuration memory requirements (up to 73 percent for the tested applications), compared to state of the art compaction techniques. Simulation results reveal that the additional flexibility reduces the energy requirements by 33 percent and enhances the device utilization by 50 percent for the tested applications. Gate level analysis reveals that TransMap incurs negligible silicon (0.2 percent of the platform) and timing (6 additional cycles per application) penalty.
Article
For high-performance embedded hard-real-time systems, ASICs and FPGAs hold advantages over general-purpose processors and graphics accelerators (GPUs). However, developing signal processing architectures from scratch requires significant resources. Our design methodology is based on sets of configurable building blocks that provide storage, dataflow, computation, and control. Based on our building blocks, we generate hundreds of thousands of our dynamic streaming engine processors that we call DSEs. We store our DSEs in a repository that can be queried for (online) design space exploration. From this repository, DSEs can be downloaded and instantiated within milliseconds on FPGAs. If a loss of flexibility can be tolerated then ASIC implementations are feasible as well. In this article we focus on FPGA implementations. Our DSEs vary in cores, computational lanes, bitwidths, power consumption, and frequency. To the best of our knowledge we are the first to propose online design space exploration based on repositories of precompiled cores that are assembled of common building blocks. For demonstration purposes we map algorithms for image processing and financial mathematics to DSEs and compare the performance to existing highly optimized signal and graphics accelerators.
Conference Paper
Full-text available
One very promising approach for solving complex optimizing and search problems is the Genetic Algorithm (GA) one. Based on this scheme a population of abstract representations of candidate solutions to an optimization problem gradually evolves toward better solutions. The aim is the optimization of a given function, the so called fitness function, which is evaluated upon the initial population as well as upon the solutions after successive generations. In this paper, we present the design of a GA and its implementation on state-of-the-art FPGAs. Our approach optimizes significantly more fitness functions than any other proposed solution. Several experiments on a platform with a Virtex-II Pro FPGA have been conducted. Implementations on a number of different high-end FPGAs outperforms other reconfigurable systems with a speedup ranging from 1.2x to 96.5x.
Article
SMT stands for Satisfiability Modulo Theories. An SMT solver decides the satisfiability of propositionally complex formulas in theories such as arithmetic and uninterpreted functions with equality. SMT solving has numerous applications in auto-mated theorem proving, in hardware and software verification, and in scheduling and planning problems. This paper describes Yices, an efficient SMT solver developed at SRI International. Yices supports a rich combination of first-order theories that occur frequently in soft-ware and hardware modeling: arithmetic, uninterpreted functions, bit vectors, arrays, recursive datatypes, and more. Beyond pure SMT solving, Yices can solve weighted MAX-SMT problems, compute unsatisfiable cores, and construct models. Yices is the main decision procedure used by the SAL model checking environment, and it is be-ing integrated to the PVS theorem prover. As a MAX-SMT solver, Yices is the main component of the probabilistic consistency engine used in SRI's CALO system.
Conference Paper
The paper describes architectural enhancements to Xilinx FPGAs that provide better support for the creation of dynamically reconfigurable designs. These are augmented by a new design methodology that uses pre-routed IP cores for communication between static and dynamic modules and permits static designs to route through regions otherwise reserved for dynamic modules. A new CAD tool flow to automate the methodology is also presented. The new tools initially target the Virtex-II, Virtex-II Pro and Virtex-4 families and are derived from Xilinx's commercial CAD tools
Conference Paper
In many applications, subsequent data manipulations differ only in a small set of parameter values. Because of their reconfigurability, FPGAs (field programmable gate arrays) can be configured with an optimized configuration every time the parameter values change. These optimized configurations are smaller and faster than their generic counterparts. However, the overhead involved in generating the configurations at run-time with conventional tools is very large. This paper introduces an automatic method for generating runtime parameterizable configurations from arbitrary Boolean circuits. These configurations in which some of the configuration bits are expressed as a function of a set of parameters enable very fast run-time specialization since specialization only involves evaluating these functions. Our approach is validated on adaptive filtering. We show that the specialized filter configurations produced by our method are 2.3 times smaller and 36% faster than a generic filter configuration and that these configurations can be generated in on average 166 mus. Being a generic method, run-time hardware optimization suddenly becomes feasible for a large class of applications.
Article
Current reconfigurable systems suffer from a significant overhead due to the time it takes to reconfigure their hardware. In order to deal with this overhead, and increase the power of reconfigurable systems, it is important to develop hardware and software systems to reduce or eliminate this delay. In this paper we propose one technique for significantly reducing the reconfiguration latency: the prefetching of configurations. By loading a configuration into the reconfigurable logic in advance of when it is needed, we can overlap the reconfiguration with useful computation. We demonstrate the power of this technique, and propose an algorithm for automatically adding prefetch operations into reconfigurable applications. This results in a significant decrease in the reconfiguration overhead for these applications. 1 Introduction When FPGAs were first introduced in the mid 1980s they were viewed as a technology for replacing standard gate arrays for some applications. In these first gen...
Article
We present a simple model for specifying and optimising designs which contain elements that can be reconfigured at run-time. In this model the control mechanism for reconfiguration can be implemented in many ways: by the user using multiplexers or other logic blocks, or by FPGAs which support dynamic partial reconfiguration. The model can be used for encoding layout information and for assessing tradeoffs in circuit speed, design size, reconfiguration time, complexity of reconfiguration controller and so on. Our approach is illustrated by various reconfigurable implementationsfor filtering and locating edges in images. The design tradeoffs of these implementations are being evaluated on a PCI platform, which contains a Xilinx 6216 device. 1 Introduction FPGAs have become the favoured choice in implementing `glue logic', experimental systems and hardware prototypes, because of advantages such as short turnaround time, user reconfigurability and low development costs. However the densit...
Efficient Circuit Specialization for Dynamic Reconfiguration of FPGAs
  • K Bruneel
K. Bruneel, "Efficient Circuit Specialization for Dynamic Reconfiguration of FPGAs," PhD thesis, Ghent University, 2011.
The YICES SMT Solver Computer Science Laboratory, SRI International, 333 Ravenswood Avenue
  • B Dutertre
  • L De Moura
Modelling and Optimising Run-Time Reconfigurable Systems
  • W Luk
  • N Shirazi
  • P Y K Cheung
W. Luk, N. Shirazi, and P. Y. K. Cheung, "Modelling and Optimising Run-Time Reconfigurable Systems," in Proceedings IEEE Symposium on FPGAs for Custom Computing Machines (FCCM). IEEE Computer Society Press, 1996, pp. 167-176.