-
[show abstract]
[hide abstract]
ABSTRACT: Coarse Grain Reconfigurable Array (CGRA) architectures have been extensively used for accelerating time consuming loops. The
design of such systems requires good balance between the architecture abilities and the loops’ characteristics. A reliable
design is characterized by optimized cost-performance trade-off. The main target of this paper is to present an exploration
framework that automates the evaluation of CGRA architectures. In specific, the framework helps the designer to identify CGRA
architectures tuned toward a specific application domain. The whole process is assisted: (1) by an optimized retargetable
compiler based on modulo scheduling and (2) by the Synopsys Design Compiler that provides realization metrics such as the
area and clock frequency. Both target on the description of a parametric CGRA architecture template which is capable of instantiating
a large diversity of these architectures. Until now, many studies suggest that clock frequency influences performance. However,
none of them examines the impact of architecture on clock frequency and performance. Our work studies in a unified way for
the first time the area, the clock frequency, the instructions per cycle and performance. Hence, architectures with good compromise
between cost and performance can be identified. Another objective of the paper is to present the advances made to the compiler
approach used by the exploration framework. In specific, a new more effective priority scheme is proposed while the modulo
scheduler has been equipped with backtracking capability. The experiments outline the algorithm’s efficiency and scalability
for a given set of DSP benchmarks. Moreover, optimized architectures with respect to cost-performance trade-off have been
identified by an exploration over 72 CGRA architecture alternatives.
The Journal of Supercomputing 04/2012; 48(2):115-151. · 0.58 Impact Factor
-
Microprocessors and Microsystems - Embedded Hardware Design. 01/2009; 33:91-105.
-
Journal of Systems Architecture - Embedded Systems Design. 01/2008; 54:479-490.
-
Journal Comp. Netw. and Communic. 01/2008; 2008.
-
Signal Processing Systems. 01/2008; 50:179-200.
-
[show abstract]
[hide abstract]
ABSTRACT: The speedups achieved in a generic microprocessor system by employing a high-performance data-path are presented. The data-path
acts as a coprocessor that accelerates time critical code segments, called kernels, thereby increasing the overall performance.
The data-path has been previously introduced by the authors and it is composed by Flexible Computational Components (FCCs)
that can realize any two-level template of primitive operations. A design flow, integrating the automated coprocessor synthesis
method, for executing applications on the system is presented. For evaluating the effectiveness of our coprocessor approach,
analytical exploration in respect to the type of the custom data-path and to the microprocessor architecture is performed.
The kernel and the overall application speedups of six real-life applications, relative to the software execution on the microprocessor,
are estimated using the design flow. Kernel speedups up to 155 are achieved that result in an average overall improvement
of 2.78 with a small overhead in circuit area. The design flow achieved the acceleration of the applications near to theoretical
bounds. Acomparison with another high-performance data-path showed that the proposed coprocessor achieves better performance
while having smaller area-time products for the generated data-paths.
The Journal of Supercomputing 02/2007; 39(3):251-271. · 0.58 Impact Factor
-
21th International Parallel and Distributed Processing Symposium (IPDPS 2007), Proceedings, 26-30 March 2007, Long Beach, California, USA; 01/2007
-
Proceedings of the 17th ACM Great Lakes Symposium on VLSI 2007, Stresa, Lago Maggiore, Italy, March 11-13, 2007; 01/2007
-
IEEE Trans. VLSI Syst. 01/2007; 15:1362-1366.
-
Integrated Circuit and System Design. Power and Timing Modeling, Optimization and Simulation, 17th International Workshop, PATMOS 2007, Gothenburg, Sweden, September 3-5, 2007, Proceedings; 01/2007
-
Proceedings of the 17th ACM Great Lakes Symposium on VLSI 2007, Stresa, Lago Maggiore, Italy, March 11-13, 2007; 01/2007
-
Microprocessors and Microsystems. 01/2007; 31:1-14.
-
CoRR. 01/2007; abs/0710.4844.
-
The Journal of Supercomputing. 01/2007; 40:127-157.
-
ACM Trans. Design Autom. Electr. Syst. 01/2007; 12.
-
Proceedings of the 4th Conference on Computing Frontiers, 2007, Ischia, Italy, May 7-9, 2007; 01/2007
-
[show abstract]
[hide abstract]
ABSTRACT: A partitioning methodology between the reconfigurable hardware blocks of different granularity, which are embedded in a generic
heterogeneous architecture, is presented. The fine-grain reconfigurable logic is realized by an FPGA unit, while the coarse-grain
reconfigurable hardware by a 2-Dimensional Array of Processing Elements. Critical parts, called kernels, are mapped on the
coarse-grain reconfigurable logic for improving performance. The partitioning method is mainly composed by three steps: the
analysis of the input code, the mapping onto the Coarse-Grain Reconfigurable Array and the mapping onto the FPGA. The partitioning
flow is implemented by a prototype software framework. Analytical partitioning experiments, using five real-world applications,
show that the execution time speedup relative to an all-FPGA solution ranges from 1.4 to 5.0.
The Journal of Supercomputing 01/2006; 38(1):17-34. · 0.58 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: A hardware/software partitioning methodology for improving performance in single-chip systems composed by processor and Field
Programmable Gate Array reconfigurable logic is presented. Speedups are achieved by executing critical software parts on the
reconfigurable logic. A hybrid System-on-Chip platform, which can model the majority of existing processor-FPGA systems, is
considered by the methodology. The partitioning method uses an automated kernel identification process at the basic-block
level for detecting critical kernels in applications. Three different instances of the generic platform and two sets of benchmarks
are used in the experimentation. The analysis on five real-life applications showed that these applications spend an average
of 69% of their instruction count in 11% on average of their code. The extensive experiments illustrate that for the systems
composed by 32-bit processors the improvements of five applications ranges from 1.3 to 3.7 relative to an all software solution.
For a platform composed by an 8-bit processor, the performance gains of eight DSP algorithms are considerably greater, as
the average speedup equals 28.
The Journal of Supercomputing 01/2006; 35(2):185-199. · 0.58 Impact Factor
-
Proceedings of 2006 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (IC-SAMOS 2006), Samos, Greece, July 17-20, 2006; 01/2006
-
20th International Parallel and Distributed Processing Symposium (IPDPS 2006), Proceedings, 25-29 April 2006, Rhodes Island, Greece; 01/2006