Fig 1 - uploaded by Kyprianos Papadimitriou
Content may be subject to copyright.
Source publication
Current and future computing systems increasingly require that their functionality stays flexible after the system is operational, in order to cope with changing user requirements and improvements in system features, i.e. changing protocols and data-coding standards, evolving demands for support of different user applications, and newly emerging ap...
Similar publications
Performance antipatterns document bad design patterns that have negative
influence on system performance. In our previous work we formalized such
antipatterns as logical predicates that predicate on four views: (i) the static
view that captures the software elements (e.g. classes, components) and the
static relationships among them; (ii) the dynami...
Citations
... To enhance device utilization and energy efficiency (for unpredictable scenarios) concepts like run-time remapping [4], [3] and dynamic parallelism [5], [6], [7] have been proposed. Run-time remapping changes the physical placement of an application to reduce communication [4], memory [8], and/or reconfiguration [9] costs. Dynamic parallelism parallelizes an application to induce speedup and generate additional time slacks that allow the platform to operate at a lower voltage/frequency. ...
Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications, with arbitrary communication and computation patterns. Compile-time mapping decisions are neither optimal nor desirable to efficiently support the diverse and unpredictable application requirements. As a solution to this problem, recently proposed architectures offer run-time remapping. The run-time remappers displace or expand (parallelize/serialize) an application to optimize different parameters (such as platform utilization). However, the existing remappers support application displacement or expansion in either horizontal or vertical direction. Moreover, most of the works only address dynamic remapping in packet-switched networks and therefore are not applicable to the CGRAs that exploit circuitswitching for low-power and high predictability. To enhance the optimality of the run-time remappers, this paper presents a design framework called Run-time Rotatable-expandable Partitions (RuRot). RuRot provides architectural support to dynamically remap or expand (i.e. parallelize) the hosted applications in CGRAs with circuit-switched interconnects. Compared to state of the art, the proposed design supports application rotation (in clockwise and anticlockwise directions) and displacement (in horizontal and vertical directions), at run-time. Simulation results using a few applications reveal that the additional flexibility enhances the device utilization, significantly (on average 50 % for the tested applications). Synthesis results confirm that the proposed remapper has negligible silicon (0.2 % of the platform) and timing (2 cycles per application) overheads.
In the era of platforms hosting multiple applications with arbitrary inter application communication and computation patterns, compile time mapping decisions are neither optimal nor desirable. As a solution to this problem, recently proposed architectures offer run-time remapping. The run-time remapping techniques displace or parallelize/serialize an application to optimize different parameters (e.g., utilization and energy). To implement the dynamic remapping, reconfigurable architectures commonly store multiple (compile-time generated) implementations of an application. Each implementation represents a different platform location and/or degree of parallelism. The optimal implementation is selected at run-time. However, the compile-time binding either incurs excessive configuration memory overheads and/or is unable to map/parallelize an application even when sufficient resources are available. As a solution to this problem, we present Transformation based reMapping and parallelism (TransMap). TransMap stores only a single implementation and applies a series for transformations to the stored bitstream for remapping or parallelizing an application. Compared to state of the art, in addition to simple relocation in horizontal/vertical directions, TransMap also allows to rotate an application for mapping or parallelizing an application in resource constrained scenarios. By storing only a single implementation, TransMap offers significant reductions in configuration memory requirements (up to 73 percent for the tested applications), compared to state of the art compaction techniques. Simulation results reveal that the additional flexibility reduces the energy requirements by 33 percent and enhances the device utilization by 50 percent for the tested applications. Gate level analysis reveals that TransMap incurs negligible silicon (0.2 percent of the platform) and timing (6 additional cycles per application) penalty.
For high-performance embedded hard-real-time systems, ASICs and FPGAs hold advantages over general-purpose processors and graphics accelerators (GPUs). However, developing signal processing architectures from scratch requires significant resources. Our design methodology is based on sets of configurable building blocks that provide storage, dataflow, computation, and control. Based on our building blocks, we generate hundreds of thousands of our dynamic streaming engine processors that we call DSEs. We store our DSEs in a repository that can be queried for (online) design space exploration. From this repository, DSEs can be downloaded and instantiated within milliseconds on FPGAs. If a loss of flexibility can be tolerated then ASIC implementations are feasible as well. In this article we focus on FPGA implementations. Our DSEs vary in cores, computational lanes, bitwidths, power consumption, and frequency. To the best of our knowledge we are the first to propose online design space exploration based on repositories of precompiled cores that are assembled of common building blocks. For demonstration purposes we map algorithms for image processing and financial mathematics to DSEs and compare the performance to existing highly optimized signal and graphics accelerators.