Conference PaperPDF Available

Instruction-set Architecture Exploration Strategies for Deeply Clustered VLIW ASIPs

Abstract and Figures

Instruction-set architecture exploration for clustered VLIW processors is a very complex problem. Most of the existing exploration methods are hand-crafted and time consuming. This paper presents and compares several methods for automating this exploration. We propose and discuss a two-phase method which can quickly explore many different architectures and experimentally demonstrate that this method is capable of automatically achieving a 50% improvement on the energy-delay product cost of an automatically generated architecture for an ECG detection application and a 1% energy-delay product cost improvement compared to a hand-crafted design.
Content may be subject to copyright.
2nd Mediterranean Conference on Embedded Computing MECO - 2013 & ECyPS’2013 Budva, Montenegro
Instruction-set Architecture Exploration Strategies
for Deeply Clustered VLIW ASIPs
Roel Jordans, Rosilde Corvino, Lech J´
o´
zwiak, Henk Corporaal
Department of Electrical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands
r.jordans@tue.nl www.asam-project.org
Abstract—Instruction-set architecture exploration for clustered
VLIW processors is a very complex problem. Most of the existing
exploration methods are hand-crafted and time consuming. This
paper presents and compares several methods for automating this
exploration. We propose and discuss a two-phase method which
can quickly explore many different architectures and experimen-
tally demonstrate that this method is capable of automatically
achieving a 50% improvement on the energy-delay product cost
of an automatically generated architecture for an ECG detection
application and a 1% energy-delay product cost improvement
compared to a hand-crafted design.
I. INT RODUCT IO N
More and more modern products in telecommunications,
multi-media, medical instrumentation, and several other im-
portant areas require programmability, high-performance,
and/or limited energy consumption. Highly customized appli-
cation specific instruction-set processors (ASIPs) and, more
specifically, very-long instruction word (VLIW) architectures,
are increasingly used in such products. Several industrial tool-
flows, e.g. [1]–[3], are available to specify, simulate, and/or
synthesize such processors. Some tool-flows exist for the au-
tomatic exploration of these architectures, but they commonly
focus on VLIW processors with a centralized register file [3].
However, as was shown in [4], VLIW architectures with a
centralized register file and more than 8 issue-slots quickly
become impractical, as the resulting processors slow down due
to their complex wiring and the many ports required on the
centralized register file.
Clustered VLIW architectures offer a solution by partition-
ing the centralized register file into several smaller register
files. This technique makes it possible to construct fast VLIW
architectures with many issue-slots. However, until now clus-
tered processor architecture exploration has been performed
by hand. Terechko et al. [4] defined a taxonomy for different
variations of clustered VLIW processor architectures. In a
generic clustered VLIW architecture, one cluster contains one
register file and or more issue-slots. The so called deeply
clustered VLIW architectures are a sub-set of the generic
clustered VLIW architectures in which a cluster contains only
a single issue-slot.
In this paper we focus on these deeply clustered VLIW
architectures and present and compare several methods to
automatically explore: 1) the number of issue-slots, 2) the
This work was performed as a part of the European project ASAM that
has been partially funded by ARTEMIS Joint Undertaking, grant no. 100265
operations available within the issue-slot, 3) the size of the
register files. This paper focuses exploration methods which
are using a shrinking technique. This allows us to take an over-
sized architecture, produced as part of a high-level architecture
exploration, and turn it into an efficient processor design. The
presented exploration strategies are evaluated regarding to both
the efficiency of the resulting VLIW ASIP architecture and the
time required for performing the exploration.
This paper is organized as follows. Section II explains the
target architecture, the general exploration method, its explo-
ration parameters, and introduces the considered exploration
strategies. Section III discusses the experimental results and
section IV concludes the paper.
II. AR CH IT ECTUR E EX PL OR ATION MET HO DS
Figure 1 shows the general structure of a clustered VLIW
processor. The VLIW data-path is composed of a set issue-
slots (IS), each containing one or more function-units (FU).
The data-path is controlled by a sequencer which executes
instructions from the program memory. The function-units
in the issue-slots implement the operations and can require
pipelining. Each issue-slot is capable of starting a new oper-
ation per cycle. The inputs of the operations are taken from
the register file (RF) connected to the issue-slot. The output
of the issue-slot can be written to one or more register files
through the result select network (not shown in figure 1). Each
register file can have one or more input ports to allow parallel
writes to the register file. One or more local memories (not
shown in figure 1) can be present in the processor. These local
memories are accessed through a special load/store function-
unit (LSU). Only one LSU can be connected to a single local
memory.
Our ASIP architecture exploration method uses a shrinking
technique and assumes that an initial prototype is provided.
Data path
Issue slot2
FU1
Issue slot1
FU1FU2FU3
Issue slot3
FU1FU2
RFRF RF
Issue slotN
FU1FUN
RF
Sequencer
Program
Memory
...
...
Fig. 1: Architecture template of a VLIW ASIP data-path and
sequencer
2nd Mediterranean Conference on Embedded Computing MECO - 2013 & ECyPS’2013 Budva, Montenegro
This initial prototype consists of an oversized ASIP architec-
ture accompanied by a coarsely optimized version of the target
application. In our ASIP design flow, this initial prototype is
the product of a coarse, high-level, ASIP architecture explo-
ration [5] which aims at finding the best combination of several
possible application loop-transformations (e.g. loop fusion,
tiling, and vectorization) and corresponding ASIP architecture.
However, other sources of initial prototypes (e.g. a human
designer) can also be used.
The automatic ASIP instruction-set architecture exploration
method explores the following properties:
The number of issue-slots
The types of function units available in each issue-slot
The operations available in each function-unit
The sizes of register files
The size of the program memory
Our automatic ASIP instruction-set architecture exploration
does not explore the number and sizes of local memories. In
our total ASIP design flow, these have been explored as part
of the application transformation (parallelization) and coarse
ASIP architecture synthesis performed at a higher level, and
are already decided by [5] when constructing the initial pro-
totype. Previous research (e.g. [6], [7]) has shown that a large
part (50–80%) of the architecture cost is due to the data storage
and transfer. The memory and communication exploration is
therefore commonly separated from and performed before the
precise data-path exploration.
A. Architecture cost evaluation model
The energy and area cost of each proposed processor
architecture is evaluated using our architecture cost model.
This model uses the scheduled assembler code from the
target compiler, which is configured to only use resources
available in the proposed processor, and combines it with
the application’s profile to provide activity counts for the
internal components of the proposed processor. These activity
counts are then used to estimate the energy consumption of
the proposed processor architecture.
We have redirected some of the architecture optimization
work to the architecture model in order to keep the exploration
time within reasonable bounds. In our architecture model
we have decided to ignore any unused resources that are
still remaining from the platform instantiation of the initial
prototype. Any resource that is unused in a candidate prototype
will not be counted towards the area and energy estimation of
that candidate prototype. The architecture model recomputes
the required size of the program memory, based on the set
of the actually used resources and the length of the compiled
target application.
Currently, our architecture model is capable of ignoring the
following unused ASIP:
Complete function units from the area and energy esti-
mation, and the program memory size computation.
Partial function units from the program memory size
computation based on their unused operations.
Partial and complete register files from the area and
energy estimation, and program memory size computa-
tion based on the register file pressure of the compiled
application.
The architecture model also takes changes in the intercon-
nect of the ASIP resulting from the removal of components
into account for both the area and energy estimation. This
allows us to quickly see which resources are required and
which are not, and what the effect is of the resource’s removal
on the resulting ASIP. It also gives us a clear view on which
resources may form bottlenecks in the ASIP, and which are
good candidates for further exploration.
B. Exploration style
Using our intelligent architecture model, we are capable of
exploring the ASIP architecture using three different explo-
ration styles. We can perform a coarse exploration removing
complete issue-slots and rely on the architecture model to
remove any remaining redundant resources. A more fine-
grained exploration is possible by removing function-units
directly and only removing issue-slots when they are empty.
Or we can take a two-phase approach combining the above
two strategies.
In this two-phase approach we first perform a coarse explo-
ration, where we investigate which issue-slots can be removed
from the initial prototype. This provides us with an initial
set of promising candidate architectures. We then use the
architecture model to determine which resources were actually
used by each candidate prototype, and explore one or more of
the candidate architectures in more detail by fine-tuning the
available resources.
For example, when the first exploration stage finds a register
file with a register pressure of 33, the second exploration
stage will attempt to resize that register file to 32 elements.
Resizing the register files to fit within a smaller power of
2 both results in a physically smaller register file and in a
shorter encoding for the register file entry. This encoding is
especially interesting as it has a significant impact on the size
of the program memory. For each bit in the register file index
encoding, the program memory requires a bit for each input-
and each output-port of the register file. With 2 input, 3 output
register files, a size commonly used in our ASIP template, this
saves 5 bits in the instruction word.
C. Exploration strategies
Similarly to [8], we provide two exploration strategies, best
match and first match, both based on the same cost metrics.
1) Best match: The best match exploration considers the
removal of each separate component and selects the com-
ponent that shows the best improvement of the cost metric.
For example, when removing an issue-slot from a nissue-slot
initial prototype during the first exploration phase, selecting
the best match exploration strategy will construct a set of
candidate prototypes with n1issue-slots containing each
possible subset of n1issue-slots from the initial prototype.
2nd Mediterranean Conference on Embedded Computing MECO - 2013 & ECyPS’2013 Budva, Montenegro
2) First match: The first match exploration considers dif-
ferent alternative components sequentially and selects the first
component that shows an improvement of the cost metric.
The first match strategy results in a much smaller number of
considered design points. This is not a problem when choices
are symmetric, i.e. there is no difference between two choices
(e.g. removing one of two equal issue-slots), but may result in
suboptimal solutions when the design choices are asymmetric
(e.g. register file sizing).
D. Cost metrics
As briefly mentioned above, both the best match and first
match strategy depend on the ability to identify one candidate
to be ‘better’ than another one. For this purpose, our ASIP
exploration framework offers two different, commonly used,
cost metrics: energy-delay (ED) product and energy-delay-
squared (EDD) product. Where the energy-delay product puts
more emphasis on the energy consumption, the energy-delay-
squared product puts more emphasis on the delay. In both
cases, a design is better when it has a lower score. Our ASIP
exploration framework automatically computes, and displays,
both scores normalized relatively to the initial prototype,
making it very easy to select the final architecture from a
set of constructed promising candidate architectures.
III. EXP ER IM EN TS
To demonstrate the value and benefits of our new ASIP
instruction-set architecture exploration method on a real-life
application, we selected the Pan-Tompkins QRS detection
algorithm [9] for an electro-cardiogram (ECG) monitor, per-
formed the architecture exploration for this application, and
mapped this application onto the selected architecture. For
this application, both the power consumption and real-time
behaviour of the created architecture are critical.
A. The algorithm
The goal of the Pan-Tompkins QRS detection algorithm is
to detect the QRS peak (illustrated in figure 2) in an ECG
signal. The distance between two consecutive QRS peaks, the
RR interval, is commonly used to measure the hart rate of a
patient.
The Pan-Tompkins QRS detection algorithm is implemented
as a set of five filters and a detection function. Firstly, the
Fig. 2: Example ECG signal showing the QRS complex
input signal is filtered by a band-pass filter (constructed using
a low-pass followed by a high-pass filter). Thereafter, the
derivative of the signal is taken, this derivative is then squared
and, finally, the last 30 squared results are integrated using a
moving-window integrator. The so filtered signal is presented
to a detection function which implements a set of thresholds
which are dynamically adjusted based on the previously mea-
sured inputs. Please refer to [9] for more information on the
algorithm.
B. Constructing the initial prototype
A C-code application specification mapped on a PC plat-
form was used as an initial version of the algorithm. This
specification is first analyzed in order to estimate the maximal
parallelism of the algorithm using the method proposed in
[10]. From the analysis, it follows that the main bottleneck
of the application has its maximum performance with an
instruction-level parallelism of 6. Further application analysis
shows that this application needs only small buffers between
the filters which could probably be mapped into register files.
Therefore, a VLIW ASIP processor with 6 issue-slots and a
single data memory is constructed as the initial prototype.
The initial C implementation is then ported to our ini-
tial processor (platform instantiation) to complete the initial
prototype. Subsequently, the so constructed initial HW/SW
prototype is passed through the simulator to verify its function-
ality and to create the application profile for the profile-based
estimation based exploration. We used 10000 samples taken
from the Pysionet ECG signal database [11] as a reference
input set for the simulations.
C. Comparing exploration strategies
Table I shows an overview of our experiments with different
exploration strategies, their total exploration time, and their
final cost results. From this table, we can see that the coarse
issue-slot exploration can be performed very quickly and that
the result for both the best match and the first match strategies
is exactly the same. This is due to the symmetry of the
available issue-slots in our initial architecture. However, this
symmetry is not available when exploring the function-units
directly and there we clearly see that the first-match strategy
gets stuck in a locally optimal solution.
Figure 3 illustrates this by visualizing the cost of consec-
utive solutions found by the different approaches. It shows
TABLE I: Different search configurations and their final
solution
exploration exploration designs total normalized
style strategy considered time final cost
issue-slot best match 17 5m23s 0.503
first match 8 2m38s 0.503
function-unit best match 835 4h17m 0.649
first match 83 24m54s 0.950
two-phase best match 320 1h32m 0.498
first match 186 48m34s 0.498
2nd Mediterranean Conference on Embedded Computing MECO - 2013 & ECyPS’2013 Budva, Montenegro
0 200 400 600 800
0.5 0.6 0.7 0.8 0.9 1.0
ID of selected (intermediate) prototypes
normalized energy-delay cost
two-phase first-match
two-phase best-match
function-unit first-match
function-unit best-match
issue-slot first-match
issue-slot best-match
Fig. 3: The cost of selected intermediate prototypes for various search strategies. Each trace starts from the initial prototype
which can be found in the top-left, points along the trace show improved energy-delay product cost solutions
how each strategy starts with the initial architecture at the top
left and successively finds reduced architectures with a lower
energy-delay cost. From table I we can see that the two-phase
methods produce the best results, although only 1% better than
the issue-slot methods. Again, both the best match and the first
match method produce the same result. However, the reason
why both exploration strategies produce the same result is less
evident in this case and it is quite likely that different results
can be obtained for both methods when a different application
is being explored.
Overall, we can see that our exploration method reduced
the energy consumption by over 50%, compared to the au-
tomatically created initial prototype, while keeping the total
cycle count of the application within 10% of its original value.
However, manually optimizing the initial prototype also fo-
cusses on optimizing the issue-width of the processor, and thus
produces a similar result as the issue-slot style exploration.
Our automated two-phase exploration is capable of finding
a slightly better result (1% cost improvement) but greatly
improves on the design time of a hand-crafted processor.
IV. CON CLUSI ON
In this paper we have introduced and compared several
strategies for instruction-set architecture exploration of clus-
tered VLIW ASIPs. We have shown that our two-phase
approach gives the best results. Using this method, we are
able to find a highly optimized customized VLIW ASIP. It
allowed us to automatically reduce the energy-delay product of
an automatically generated architecture for the ECG test-case
application by 50%, thereby producing an architecture that
is on par with a hand-crafted design while greatly reducing
the effort required for the customization of VLIW ASIP
architectures. Future research will focus on more thorough
benchmarking, improving the initial prototype, and experi-
menting with different exploration strategies.
REF ER EN CE S
[1] FlexASP project, “TTA-based co-design environment.” [Online].
Available: http://tce.cs.tut.fi/
[2] S. Aditya, B. Rau, and V. Kathail, “Automatic architecture synthesis and
compiler retargeting for VLIW and EPIC processors,” in ISSS 12 — The
12th International Symposium on System Synthesis, vol. 9, 1999.
[3] V. Kathail, S. Aditya, R. Schreiber, B. Ramakrishna Rau, D. Cronquist,
and M. Sivaraman, “PICO: automatically designing custom computers,
IEEE Computer, vol. 35, no. 9, pp. 39–47, 2002.
[4] A. Terechko, E. Le Thenaff, M. Garg, J. Van Eijndhoven, and H. Corpo-
raal, “Inter-cluster communication models for clustered vliw processors,”
in HPCA-9 — The 9th International Symposium on High-Performance
Computer Architecture. IEEE, 2003, pp. 354–364.
[5] R. Corvino, A. Gamatie, M. Geilen, and L. Jozwiak, “Design space ex-
ploration in application-specific hardware synthesis for multiple commu-
nicating nested loops,” in SAMOS XII — 12th International Conference
on Embedded Computer Systems, Samos, Greece, July 2012.
[6] P. R. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E. Brockmeyer,
C. Kulkarni, A. Vandercappelle, and P. G. Kjeldsberg, “Data and mem-
ory optimization techniques for embedded systems,” ACM Transactions
on Design Automation of Electronic Systems, vol. 6, no. 2, pp. 149–206,
2001.
[7] K. Danckaert, K. Masselos, F. Cathoor, H. J. De Man, and C. Goutis,
“Strategy for power-efficient design of parallel systems,IEEE Transac-
tions on Very Large Scale Integration Systems, vol. 7, no. 2, pp. 258–265,
1999.
[8] J. Hoogerbrugge and H. Corporaal, “Automatic synthesis of transport
triggered processors,” in ASCI — The First Annual Conference of the
Advanced School for Computing and Imaging, 1995, pp. 1–10.
[9] J. Pan and W. J. Tompkins, “A real-time qrs detection algorithm,IEEE
Transactions on Biomedical Engineering, no. 3, pp. 230–236, 1985.
[10] R. Jordans, R. Corvino, L. J ´
o´
zwiak, and H. Corporaal, “Exploring
processor paralellism: Estimation methods and optimization strategies,”
in DDECS 2013 — 16th International Symposium on Design and
Diagnostics of Electronic Circuits and Systems. Karlovy Vary, Czech
Republic: IEEE, April 2013, pp. 1–6.
[11] A. L. Goldberger, L. A. N. Amaral, L. Glass, J. M. Hausdorff, P. C.
Ivanov, R. G. Mark, J. E. Mietus, G. B. Moody, C.-K. Peng, and H. E.
Stanley, “PhysioBank, PhysioToolkit, and PhysioNet: Components of a
new research resource for complex physiologic signals,Circulation,
vol. 101, no. 23, pp. e215–e220, 2000 (June 13).
... We have implemented the BuildMaster framework and integrated it into our design space exploration framework [35]. This allowed us to test various cache configurations under different exploration runs. ...
... In our early research [35], we found that exploring either the issue-slots or the function-units in a single run has its limitations. Limiting the active exploration to the issue-slot level removes the ability of the exploration to try to combine costly operations executed in multiple issue-slots onto fewer issue-slots. ...
... While in the second stage, for the found issue width, we refine the composition of each of the remaining issue-slots through an exploration of their function-units. Our initial experiments [35] showed that this offers the benefits of both having a high degree of possible specialization in the resulting design and having a fairly limited number of design alternatives to reach the conclusion. ...
Thesis
Full-text available
The high energy efficiency and performance demands of image and signal processing components of modern mobile and autonomous applications have resulted in a situation where it is no longer feasible to only use general purpose processing systems to serve those applications. This has caused a strong shift to heterogeneous systems containing multiple highly specialized processors. While some tool support for the design process of such specialized processor architectures exists, key decisions are still made by human designers, usually based on incomplete and imprecise information. Combining this with the short interval between different product generations and limited design times strongly reduces the number of design alternatives that can be considered, and results in a sub-optimal design quality. Current state-of-the-art technologies offer design automation for several steps of the design process by automating key activities such as the construction of a processor architecture from a high level description, the evaluation of candidate designs through simulation or emulation, or proposing extensions to an existing processor architecture. While these tools already substantially improve the design times over a completely manual design, further significant improvements can still be obtained, specifically through automation of the design analysis and decision making process. This dissertation proposes several significant improvements of the design effectiveness and efficiency through automation of several stages of the design process. A three step approach to processor architecture design is presented which starts by using our new application analysis methods to obtain parallelism and performance estimates for the various compute intensive parts of the target application. These estimates are then used during an application restructuring phase which aims at improving the available parallelism and decides upon the mapping of application data into the processors internal memories. Taking this transformed version of the target application, its memory hierarchy as defined by the memory mapping, and the parallelism estimates allows us to propose an initial processor architecture which completes the second step. The third step is then to further refine the processor architecture and results in a highly specialized processor architecture description. The research presented in this dissertation focuses on improving the following steps in the design process. • A parallelism estimation method for estimating the instruction-level parallelism exposed by the application is presented. This method provides parallelism feedback used during the exploration of the application restructuring, but is also used for determining the appropriate number of issue-slots in the initial architecture. As a result, we are able to construct an initial processor architecture that both meets the performance requirements for the target application yet still is reasonably close to the final refined processor design. • A processor architecture refinement method which allows us to avoid the (time consuming) construction of intermediate candidate processor architectures. Our approach only needs to construct both the initial and defined designs, all other considered candidate architectures need not be constructed. • A rapid energy consumption methodology which combines the block execution profile of a simulation of the target application with its scheduled assembly listing. This makes our energy estimation method independent of the number of simulated processor clock cycles and enables the use of larger, more representative, input data sets, thus allowing for a both a faster and more realistic evaluation of the candidate designs. • An architecture exploration framework called BuildMaster, which simplifies the implementation of our architecture refinement exploration strategies. This framework automatically detects when compilation and simulation results obtained for previously considered candidates can be re-used for the evaluation of newly proposed candidate architectures. This intermediate result caching system allows us, for example, to avoid on average over 90% of the originally required simulation time by re-using previously obtained profile information for the energy estimation. • A set of exploration strategies which effectively refine the processor architecture and a comparison between these strategies on both the quality of the obtained result, as well as, the required exploration time. We show that the proposed exploration heuristics find results whose quality is comparable to the results found using a genetic algorithm while requiring an order of magnitude less exploration time. Combining the presented techniques results in a highly efficient and extensible instruction-set architecture exploration methodology. In our experiments we show that our framework is able to explore hundreds of processor architecture variations per hour while consistently producing compact results that meet the expected performance.
... Very-long instruction word (VLIW) processors constitute a highly energy efficient architecture base for creating application specific instruction-set processors (ASIPs). The design space exploration of such VLIW ASIP based systems is a complex optimization problem and several frameworks exist [1]- [4] that help to automate this exploration. However, automated tools are often not trusted yet and many companies still design VLIW ASIPs by hand. ...
... Section III presents the configuration of the genetic algorithm, as well as, some optimizations to the genetic algorithm which allowed us to reduce the total exploration time. Section IV presents an investigation of the effectiveness of the exploration through comparison of the presented genetic algorithm to the heuristic exploration presented in [4]. Section V concludes the paper. ...
... We improved upon this by inserting more knownto-work architectures. These architectures were obtained from a quick exploration of the VLIW issue-width using the firstmatch search strategy presented in [4]. This allowed us to get a much better initial population and greatly reduced the exploration time. ...
Conference Paper
Full-text available
Genetic algorithms are commonly used for automatically solving complex design problem because exploration using genetic algorithms can consistently deliver good results when the algorithm is given a long enough run-time. However, the exploration time for problems with huge design spaces can be very long, often making exploration using a genetic algorithm practically infeasible. In this work, we present a genetic algorithm for exploring the instruction-set architecture of VLIW ASIPs and demonstrate its effectiveness by comparing it to two heuristic algorithms. We present several optimizations to the genetic algorithm configuration, and demonstrate how caching of intermediate compilation and simulation results can reduce the exploration time by an order of magnitude.
... The ASIP architecture exploration process proposed in [1] and [13], evaluates and selects VLIW ASIP architectures according to a predefined template. Exploring the instructionset architecture of VLIW ASIPs is however much more complex than for simple sequential (e.g. ...
... The BuildMaster framework provides an easy interface to the compiler and simulator for our exploration routines [13] and automates the caching of intermediate results. This significantly simplifies the implementation of the architecture exploration strategies and allows the developer of such strategies to focus on the strategy itself and to ignore possible negative effects on the exploration time from considering equivalent architectures. ...
... We have implemented the BuildMaster framework and integrated it into our design space exploration framework [13]. This allowed us to test various cache configurations under different exploration runs. ...
Conference Paper
Full-text available
In this paper we introduce and discuss the BuildMaster framework. This framework supports the design space exploration of application specific VLIW processors and offers automated caching of intermediate compilation and simulation results. Both the compilation and the simulation cache can greatly help to shorten the exploration time and make it possible to use more realistic data for the evaluation of selected designs. In each of the experiments we performed, we were able to reduce the number of required simulations with over 90% and save up to 50% on the required compilation time.
... This processor architecture optimization step tries to improve the ASIP architecture both by the addition of application-specific instruction-set extensions as custom operations [42,43,44], and by the removal of unused or scarcely used processor components which were included as part of the processor building blocks used during the construction of the initial processor prototype. Several improvements to the processor area and energy models used in this step were previously presented in [45], the exploration algorithms of this step in [46,47], and techniques for reducing the exploration time 10 in [48]. The resulting refined ASIP design, together with more precise area, energy, and execution time estimates, are then returned to the macro-level architecture exploration 11 . ...
... Doing so will render them unused in the resulting application mapping which allows for the successive architecture shrinking to remove them from the final design. The instruction-set exploration strategies implemented for this active reduction stage have been described in [46,47]. An appropriate combination of the above three architecture refinement techniques results in a fast and efficient method for investigation of instruction-set architecture variants of the initial prototype produced in the second stage of the micro-level architecture exploration and selection of the most preferred architecture. ...
Article
Full-text available
The design of high-performance application-specific multi-core processor systems still is a time consuming task which involves many manual steps and decisions that need to be performed by experienced design engineers. The ASAM project sought to change this by proposing an automatic architecture synthesis and mapping flow aimed at the design of such application specific instruction-set processor (ASIP) systems. The ASAM flow separated the design problem into two cooperating exploration levels, known as the macro-level and micro-level exploration. This paper presents an overview of the micro-level exploration level, which is concerned with the analysis and design of individual processors within the overall multi-core design starting at the initial exploration stages but continuing up to the selection of the final design of the individual processors within the system. The designed processors use a combination of very-long instruction-word (VLIW), single-instruction multiple-data (SIMD), and complex custom DSP-like operations in order to provide an area- and energy-efficient and high-performance execution of the program parts assigned to the processor node.
... These experiments were performed using the Pan-Tompkins QRS detection algorithm and based on the experiments pre-sented in [18]. In [18] we presented six different search strategies, each considering between 8 and 835 design points. ...
... These experiments were performed using the Pan-Tompkins QRS detection algorithm and based on the experiments pre-sented in [18]. In [18] we presented six different search strategies, each considering between 8 and 835 design points. Figure 4 shows how the exploration time increases as dependent on the number of cosidered design points for each cost evaluation method. ...
Conference Paper
Full-text available
Design space exploration for ASIP instruction-set design is a very complex problem, involving a large set of architectural choices. Existing methods are usually handcrafted and time-consuming. In this paper, we propose and investigate a rapid method to estimate the energy consumption of candidate architectures for VLIW ASIP processors. The proposed method avoids the time-consuming simulation of the candidate prototypes, without any loss of accuracy in the predicted energy consumption. We experimentally show the effect of this fast cost evaluation method when used in an automated instruction-set architecture exploration. In our experiments, we compare three different methods for cost estimation and find that we can accurately predict the energy consumption of proposed architectures while avoiding simulation of the complete system.
... Many researches aim at exploiting the instruction-level parallelism of programs to maximize the utilization of all data-paths. In [9], the authors present an instruction set exploration method for clustered VLIW processor. In [10], the streaming programming model is applied on the NoCbased processor. ...
... A naive ad hoc instruction set customization may not result in the required performance improvement, leading to a waste of computing and energy resources. Therefore, when performing the custom instruction selection, complex tradeoffs between the processing speed, circuit area and power consumption must be closely observed [11]. ...
Conference Paper
Instruction Set Customization is a well-known technique to enhance the performance and efficiency of Application-Specific Processors (ASIPs). An extensive application profiling can indicate which parts of a given application, or class of applications, are most frequently executed, enabling the implementation of such frequently executed parts in hardware as custom instructions. However, a naive ad hoc instruction set customization process may identify and select poor instruction extension candidates, which may not result in a significantly improved performance with low circuit-area and energy footprints. In this paper we propose and discuss an efficient instruction set customization method and automatic tool, which exploit the maximal common subgraphs (common operation patterns) of the most frequently executed basic blocks of a given application. The speed results from our tool for a VLIW ASIP are provided for a set of benchmark applications. The average execution time reduction ranges from 30% to 40%, with only a few custom instructions.
... clustering topology and the number of FUs use a coarse FU allocation model [15]. Lapinskii et al. overprovision each issue slot with every possible FU, followed by a machine shrinking optimization that removes FUs [14]. This approach is inapplicable here due to the possibility of placing ports of the same FU in separate slots. ...
Conference Paper
Full-text available
Most power dissipation in Very Large Instruction Word (VLIW) processors occurs in their large, multi-port register files. Transport Triggered Architecture (TTA) is a VLIW variant whose exposed datapath reduces the need for RF accesses and ports. However, the comparative advantage of TTAs suffers in practice from a wide instruction word and complex interconnection network (IC). We argue that these issues are at least partly due to suboptimal design choices. The design space of possible TTA architectures is very large, and previous automated and ad-hoc design methods often produce inefficient architectures. We propose a reduced design space where efficient TTAs can be generated in a short time using execution trace-driven greedy exploration. The proposed approach is evaluated by optimizing the equivalent of a 4-issue VLIW architecture. The algorithm finishes quickly and produces a processor with 10% reduced core energy product compared to a fully connected TTA. Since the generated processor has low IC power and a shorter instruction word than a typical 4-issue VLIW, the results support the hypothesis that these drawbacks of TTA can be worked around with efficient IC design.
Article
This paper focuses on mastering the automatic architecture synthesis and application mapping for heterogeneous massively-parallel MPSoCs based on customizable application-specific instruction-set processors (ASIPs). It presents an overview of the research being currently performed in the scope of the European project ASAM of the ARTEMIS program. The paper briefly presents the results of our analysis of the main challenges to be faced in the design of such heterogeneous MPSoCs. It explains which system, design, and electronic design automation (EDA) concepts seem to be adequate to address the challenges and solve the problems. Finally, it discusses the ASAM design-flow, its main stages and tools and their application to a real-life case study.
Conference Paper
Full-text available
Application specific MPSoCs are often used to implement high-performance data-intensive applications. MP-SoC design requires a rapid and efficient exploration of the hardware architecture possibilities to adequately orchestrate the data distribution and architecture of parallel MPSoC computing resources. Behavioral specifications of data-intensive applications are usually given in the form of a loop-based sequential code, which requires parallelization and task scheduling for an efficient MPSoC implementation. Existing approaches in application spe-cific hardware synthesis, use loop transformations to efficiently parallelize single nested loops and use Synchronous Data Flows to statically schedule and balance the data production and consump-tion of multiple communicating loops. This creates a separation between data and task parallelism analyses, which can reduce the possibilities for throughput optimization in high-performance data-intensive applications. This paper proposes a method for a concurrent exploration of data and task parallelism when using loop transformations to optimize data transfer and storage mechanisms for both single and multiple communicating nested loops. This method provides orchestrated application specific decisions on communication architecture, memory hierarchy and computing resource parallelism. It is computationally efficient and produces high-performance architectures.
Article
Full-text available
Automatic optimization of application-specific instruction-set processor (ASIP) architectures mostly focuses on the internal memory hierarchy design, or the extension of reduced instruction-set architectures with complex custom operations. This paper focuses on very long instruction word (VLIW) architectures and, more specifically, on automating the selection of an application specific VLIW issue-width. The issue-width selection strongly influences all the important processor properties (e.g. processing speed, silicon area, and power consumption). Therefore, an accurate and efficient issue-width estimation and optimization are some of the most important aspects of VLIW ASIP design. In this paper, we first compare different methods for the estimation of required the issue-width, and subsequently introduce a new force-based parallelism estimation method which is capable of estimating the required issue-width with only 3% error on average. Furthermore, we present and compare two techniques for estimating the required issue-width of software pipelined loop kernels and show that a simple utilization-based measure provides an error margin of less than 1% on average.
Article
Full-text available
We present a survey of the state-of-the-art techniques used in performing data and memory-related optimizations in embedded systems. The optimizations are targeted directly or indirectly at the memory subsystem, and impact one or more out of three important cost metrics: area, performance, and power dissipation of the resulting implementation. We first examine architecture-independent optimizations in the form of code transoformations. We next cover a broad spectrum of optimization techniques that address memory architectures at varying levels of granularity, ranging from register files to on-chip memory, data caches, and dynamic memory (DRAM). We end with memory addressing related issues.
Conference Paper
Full-text available
Clustering is a well-known technique to improve the implementation of single register file VLIW processors. Many previous studies in clustering adhere to an inter-cluster communication means in the form of copy operations. This paper, however, identifies and evaluates five different inter-cluster communication models, including copy operations, dedicated issue slots, extended operands, extended results, and broadcasting. Our study reveals that these models have a major impact on performance and implementation of the clustered VLIW. We found that copy operations executed in regular VLIW issue slots significantly constrain the scheduling freedom of regular operations. For example, in the dense code for our four cluster machine the total cycle count overhead reached 46.8% with respect to the unicluster architecture, 56% of which are caused by the copy operation constraint. Therefore, we propose to use other models (e.g. extended results or broadcasting), which deliver higher performance than the copy operation model at the same hardware cost.
Article
Full-text available
The PICO (program in, chip out) project is a long-range HP Labs research effort that aims to auto-mate the design of optimized, application-specific computing systems-thus enabling the rapid and cost-effective design of custom chips when no adequately specialized, off-the-shelf design is available. PICO research takes a systematic approach to the hierarchical design of complex systems and advances technologies for automatically designing custom nonprogrammable accelerators and VLIW processors. While skeptics often assume that automated design must emu late human designers who invent new solutions to problems, PICO's approach is to automatically pick the most suitable designs from a well-engineered space of designs. Such automation of embedded computer design promises an era of yet more growth in the number and variety of innovative smart products by lowering the barriers of design time, designer availability, and design cost.
Article
We present a survey of the state-of-the-art techniques used in performing data and memoryrelated optimizations in embedded systems. The optimizations are targeted directly or indirectly at the memory subsystem, and impact one or more out of three important cost metrics: area, performance, and power dissipation of the resulting implementation. We first examine architecture-independent optimizations in the form of code transformations. We next cover a broad spectrum of optimization techniques that address memory architectures at varying levels of granularity, ranging from register files to on-chip memory, data caches, and dynamic memory (DRAM). We end with memory addressing related issues.
Article
architecture synthesis, micro-architecture synthesis, VLIW processors, EPIC processors, automatic processor design, abstract architecture specification, datapath design, resource allocation, mdes extraction, compiler retargeting, controlpath design, instruction pipeline design, RTL generation This paper describes a mechanism for automatic design and synthesis of very long instruction word (VLIW), and its generalization, explicitly parallel instruction computing (EPIC) processor architectures starting from an abstract specification of their desired functionality. The process of architecture design makes concrete decisions regarding the number and types of functional units, number of read/write ports on register files, the datapath interconnect, the instruction format, its decoding hardware, and the instruction unit datapath. The processor design is then automatically synthesized into a detailed RTL-level structural model in VHDL along with an estimate of its area. The system also generates the corresponding detailed machine description and instruction format description that can be used to retarget a compiler and an assembler respectively. This process is part of an overall design system, called Program-In-Chip-Out (PICO), which has the ability to perform automatic exploration of the architectural design space while customizing the architecture to a given application and making intelligent, quantitative, cost-performance tradeoffs.
Article
The newly inaugurated Research Resource for Complex Physiologic Signals, which was created under the auspices of the National Center for Research Resources of the National Institutes of Health, is intended to stimulate current research and new investigations in the study of cardiovascular and other complex biomedical signals. The resource has 3 interdependent components. PhysioBank is a large and growing archive of well-characterized digital recordings of physiological signals and related data for use by the biomedical research community. It currently includes databases of multiparameter cardiopulmonary, neural, and other biomedical signals from healthy subjects and from patients with a variety of conditions with major public health implications, including life-threatening arrhythmias, congestive heart failure, sleep apnea, neurological disorders, and aging. PhysioToolkit is a library of open-source software for physiological signal processing and analysis, the detection of physiologically significant events using both classic techniques and novel methods based on statistical physics and nonlinear dynamics, the interactive display and characterization of signals, the creation of new databases, the simulation of physiological and other signals, the quantitative evaluation and comparison of analysis methods, and the analysis of nonstationary processes. PhysioNet is an on-line forum for the dissemination and exchange of recorded biomedical signals and open-source software for analyzing them. It provides facilities for the cooperative analysis of data and the evaluation of proposed new algorithms. In addition to providing free electronic access to PhysioBank data and PhysioToolkit software via the World Wide Web (http://www.physionet. org), PhysioNet offers services and training via on-line tutorials to assist users with varying levels of expertise.
Conference Paper
The paper describes a mechanism for automatic design and synthesis of very long instruction word (VLIW), and its generalization, explicitly parallel instruction computing (EPIC) processor architectures starting from an abstract specification of their desired functionality. The process of architecture design makes concrete decisions regarding the number and types of functional units, number of read/write ports on register files, the datapath interconnect, the instruction format, its decoding hardware, and the instruction unit datapath. The processor design is then automatically synthesized into a detailed RTL-level structural model in VHDL, along with an estimate of its area. The system also generates the corresponding detailed machine description and instruction format description that can be used to retarget a compiler and an assembler respectively. All this is part of an overall design system, called Program-In-Chip Out (PICO), which has the ability to perform automatic exploration of the architectural design space while customizing the architecture to a given application and making intelligent, quantitative, cost-performance tradeoffs