Conference Paper

The microarchitecture of FPGA-based soft processors

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

As more embedded systems are built using FPGA platforms, there is an increasing need to support processors in FPGAs. One option is the soft processor, a programmable instruction processor implemented in the reconfigurable logic of the FPGA. Commercial soft processors have been widely deployed, and hence we are motivated to understand their microarchitecture. We must re-evaluate microarchiteture in the soft processor context because an FPGA platform is significantly different than an ASIC platform---for example, the relative speed of memory and logic is quite different in the two platforms, as is the area cost. In this paper we present an infrastructure for rapidly generating RTL models of soft processors, as well as a methodology for measuring their area, performance, and power. Using our automatically-generated soft processors we explore the microarchitecture trade-off space including: (i) hardware vs software multiplication support; (ii) shifter implementations; and (iii) pipeline depth, organization, and forwarding. For example, we find that a 3-stage pipeline has better wall-clock-time performance than deeper pipelines, despite lower clock frequency. We also compare our designs to Altera's NiosII commercial soft processor variations and find that our automatically generated designs span the design space while remaining very competitive.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... FPGA's are flexible devices because, among various features, allow hardware to be described using Hardware Description Languages (HDL's) and several hardware reconfigurations (custom hardware design). In this work specifically, there is a free soft-processor [31] provided by FPGA's manufacturer that can be used for co-design development. A soft-processor is an Intellectual Property (IP) core which is 100% implemented using the logic primitives of the FPGA. ...
... A soft-processor is an Intellectual Property (IP) core which is 100% implemented using the logic primitives of the FPGA. Other definition is: a programmable instruction processor implemented in the reconfigurable logic of the FPGA [31]. Soft-processors have several advantages and one of the most relevant for actual designs is the possibility to implement multi-core solutions defining the exact number of soft-processors required by the application. ...
... Soft-processors have several advantages and one of the most relevant for actual designs is the possibility to implement multi-core solutions defining the exact number of soft-processors required by the application. Since it is implemented in configurable logic (FPGA), a soft-processor can be tuned by customizing its implementation and complexity, by changing its internal configuration, processor functional blocks or even adding new instructions to the processor, in order to match the exact requirements of an application [31] . The remainder of the paper is organized as follows: section 2 presents some recent hardware development using NCC. ...
Chapter
Full-text available
Cross-correlation is an important image processing algorithm for template matching widely used on computer vision based systems. This work follows a profile-based hardware/software co-design method to develop an architecture for normalized cross-correlation coefficient calculus using Nios II soft-processor. Results present comparisons between general purpose processor implementation and different customized softprocessor implementation considering execution time and the influence of image and sub-image size. Nios II soft-processors configured with floating-point hardware acceleration achieved a 8.31 speedup. KeywordsCross-Correlation–Hardware/Software Co-Design–Profile- Based Method
... In an SC overlay, a single operation node is mapped to an individual FU and data is shifted between FUs over a programmable, but temporally dedicated, point-to-point link. That is, the FU and interconnect configuration are fixed while the [80] 2006 QUKU [75] 2010 IF [15] 2011 VDR [10] 2011 Heracles [43] 2012 ZUMA [7] 2012 Octavo [53] 2012 reMORPH [65] 2013 VCGRA [30] 2013 CARBON [8] 2013 MXP [74] 2013 SCGRA [60] 2013 TILT [64] 2015 DSP-based [33] 2016 Linear TM [57] 2016 DeCO [35] 2016 GRVI Phalanx [26] kernel executes. The benefit of an SC overlay is that kernel execution achieves an initiation interval (II) [54] of one, with throughput just determined by the operating frequency of the overlay. ...
... SPREE. The Soft Processor Rapid Exploration Environment (SPREE) was developed to automatically generate synthesizable HDL implementations of soft processor architectures from textual descriptions of the ISA and datapath [80], facilitating the microarchitectural exploration of soft processors. The SPREE processor with a 3-stage pipeline demonstrates 9% less area and 11% speedup in wall-clock-time compared to the Nios II family of commercial soft processors. ...
Article
Full-text available
This article presents a comprehensive survey of time-multiplexed (TM) FPGA overlays from the research literature. These overlays are categorized based on their implementation into two groups: processor-based overlays, as their implementation follows that of conventional silicon-based microprocessors, and; CGRA-like overlays, with either an array of interconnected processor-based functional units or medium-grained arithmetic functional units. Time-multiplexing the overlay allows it to change its behavior with a cycle-by-cycle execution of the application kernel, thus allowing better sharing of the limited FPGA hardware resource. However, most TM overlays suffer from large resource overheads, due to either the underlying processor-like architecture (for processor-based overlays) or due to the routing array and instruction storage requirements (for CGRA-like overlays). Reducing the area overhead for CGRA-like overlays, specifically that required for the routing network, and better utilizing the hard macros in the target FPGA are active areas of research.
... The most related soft processors to our investigations are the PicaRISC processor from 1 Now owned by Netronome Systems Inc. Altera [69] which has not been publicly documented yet and the multithreaded UT-II processor [43], which we describe in Chapter 4. SPREE [161] gives an overview of the performance and area consumption of soft processors. As a reference point, on a Stratix EP1S40F780C5 device with the fastest speed grade, a platform that we use in Chapter 4, the NiosII-fast (the variation with the fastest clock rate [16]) reaches 135 MHz (but instructions are not retired on every cycle). ...
... As well, we are interested in using a compiler to enable code transformations that can potentially save reconfigurable hardware resources. We perform our experiments on the processors generated by the SPREE infrastructure [161,162]. SPREE takes as input an architectural description of a processor and generates an RTL implementation of it, based on a library of pre-defined modules. The RTL currently targets hardware blocks of the Altera FPGAs so the Quartus tools are integrated in the SPREE infrastructure to characterize the area, frequency and power of the resulting processors. ...
Article
Packet processing is the enabling technology of networked information systems such as the Internet and is usually performed with fixed-function custom-made ASIC chips. As communication protocols evolve rapidly, there is increasing interest in adapting features of the processing over time and, since software is the preferred way of expressing complex computation, we are interested in finding a platform to execute packet processing software with the best possible throughput. Because FPGAs are widely used in network equipment and they can implement processors, we are motivated to investigate executing software directly on the FPGAs. Off-the-shelf soft processors on FPGA fabric are currently geared towards performing embedded sequential tasks and, in contrast, network processing is most often inherently parallel between packet flows, if not between each individual packet. Our goal is to allow multiple threads of execution in an FPGA to reach a higher aggregate throughput than commercially available shared-memory soft multi-processors via improvements to the underlying soft processor architecture. We study a number of processor pipeline organizations to identify which ones can scale to a larger number of execution threads and find that tuning multithreaded pipelines can provide compact cores with high throughput. We then perform a design space exploration of multicore soft systems, compare single-threaded and multithreaded designs to identify scalability limits and
... Other definition: a programmable instruction processor implemented in the reconfigurable logic of the FPGA [4]. Soft processors have several advantages and one of the most relevant for actual designs is the possibility to implement the exact number of soft-processors required by the application. ...
... Soft processors have several advantages and one of the most relevant for actual designs is the possibility to implement the exact number of soft-processors required by the application. Since it is implemented in configurable logic, a soft processor can be tuned by varying its implementation and complexity to match the exact requirements of an application [4]. ...
Article
Full-text available
Evolutionary algorithms are very common techniques used in computational intelligence and robotics field applications. Some algorithms need a large amount of memory and processing power, making them difficult to implement into embedded systems. In this work a profile-based approach is proposed and applied in an evolutionary algorithm with some characteristics that allow it’s use on embedded systems and robotics: the micro-GA. The main goal is to implement a new hardware-software co-design architecture for this genetic algorithm with better execution time than algorithms implemented in software (using general purpose hardware solutions). The presented results show a comparison between different code sign implementations and discussion about new architecture advantages.
... Hard processors are available in modern FPGA, which are much faster than soft processors. However, they have a few drawbacks [2]. Firstly, the number of built-in hard processors in the FPGA is limited and may be sub-optimal for the application. ...
... Hence, softcore designs offer several advantages: (1) the exact number and type of processors may be chosen for a given application; (2) customization may be performed at several granularity levels, resulting in optimum design; (3) the silicon neutrality blurs the hardware/software line, increasing the hardware design productivity to software levels and (4) softcores suffer a much smaller performance loss from off-chip memory latencies due to their reduced clock frequencies [3], if off-chip memory is used at all. If on-chip memory is used, there is no need to explore the memory hierarchy [4]. The two extremes of the softcore spectrum are generalpurpose processors (e.g., OpenSPARC, MIPS) and ASIPs. ...
Conference Paper
The growth in embedded systems complexity has created the demand for novel tools which allow rapid systems development and facilitate the designer's management of complexity. Especially since systems must incorporate a variety of often contradictory characteristics, achieving design metrics in short development time is an increasing challenge. This paper presents RAPTOR-Design, a framework for System-on-Chip (SoC) design which incorporates a customizable processor architecture and allows rapid software-to-hardware migration, custom hardware integration in a tightly-coupled fashion and seamless Fault Tolerance (FT) capabilities for FPGA platforms. Impact on design metrics of processor customization, FT-capabilities and custom hardware integration are presented, as well as an overview of the design process using RAPTOR-Design.
... Soft microprocessors for FPGA implementation are available as both open designs and commercial products (see Yiannacouras et al. (2005)). These are universal, but customizable processors that use various processor architectures. ...
Conference Paper
Computationally intensive algorithms for closed loop control and similar systems today can be implemented using reconfigurable FPGA devices. One approach is using soft microprocessor designs and implementing the algorithms as programs. This paper reports about a case study within an ongoing pro-ject that investigates a multi-core soft microprocessor solution for a closed-loop control application. Re-quirements within the project lead to a specialized processor design. Some results from experimental work are presented, demonstrating feasibility and efficiency of the approach.
... FPGA-based processors are utilized more and more in embedded applicationspecific systems. For instance, Yiannacouras, Rose, and Steffan [4] propose an FPGAbased , automatically-generated, application-specific vector processor and show the possibility of scaling its performance. In their paper [5] , Coyne, Cyganski, and Duckworth present an FPGA-based co-processor for accelerating the SART method for RF source localization. ...
Article
Full-text available
The paper presents the results of investigations concerning the possibilities of using programmable logic devices (FPGA) for building virtual multi-core processors dedicated to the chosen application. The paper shows the designed architecture of multi-core processor specialized for performing a particular task and discuss its computation efficiency depending on the number of cores being used. The evaluation of the results are also discussed.
... The deparser compiler ( §4) is described in Python and generates synthesizable VHDL code for the proposed deparser architecture. The generated architecture leverages the inherent configurability of FPGAs to avoid hardware constructs that cannot be efficiently implemented on FPGAs, such as crossbars or barrel shifters [1,25]. The simulation environment is based on cocotb [11], which allows using several off-the-shelf Python packages, such as Scapy, to generate test cases. ...
Preprint
Full-text available
The P4 language has drastically changed the networking field as it allows to quickly describe and implement new networking applications. Although a large variety of applications can be described with the P4 language, current programmable switch architectures impose significant constraints on P4 programs. To address this shortcoming, FPGAs have been explored as potential targets for P4 applications. P4 applications are described using three abstractions: a packet parser, match-action tables, and a packet deparser, which reassembles the output packet with the result of the match-action tables. While implementations of packet parsers and match-action tables on FPGAs have been widely covered in the literature, no general design principles have been presented for the packet deparser. Indeed, implementing a high-speed and efficient deparser on FPGAs remains an open issue because it requires a large amount of interconnections and the architecture must be tailored to a P4 program. As a result, in several works where a P4 application is implemented on FPGAs, the deparser consumes a significant proportion of chip resources. Hence, in this paper, we address this issue by presenting design principles for efficient and high-speed deparsers on FPGAs. As an artifact, we introduce a tool that generates an efficient vendor-agnostic deparser architecture from a P4 program. Our design has been validated and simulated with a cocotb-based framework. The resulting architecture is implemented on Xilinx Ultrascale+ FPGAs and supports a throughput of more than 200 Gbps while reducing resource usage by almost 10$\times$ compared to other solutions.
... FPGAs are reconfigurable hardwares [11] that can be configured, reconfigured and physically tested before the real hardware production. In this case the FPGA manufacturer provides a soft-processor (a processor that can be fully configured in a reconfigurable hardware [12]) that is able to run system software and provides communication between the software and the developed hardware. ...
Conference Paper
Full-text available
Artificial neural networks are a parallel, fault tolerant, robust solution for computational tasks such as associative memories, pattern recognition and function approximation. There are many proposed implementations for artificial neural networks and network's learning algorithms both in hardware and software. Hardware implementation of learning algorithms are a computational challenge because some constraints as maximum number of neurons and layers, training time, precision, and data representation are difficult to be optimized together. This paper describes a hardware/software co-design implementation of the error-back propagation algorithm on multi-layer perceptron networks. Different types of processors, with different hardware features and goals, were created and the results were analyzed considering mentioned constraints. The results present a hardware/software co-design that allows a large number of neurons and layers, that maintains initial precision without restrictions on data representation. Platform limitations resulted in high execution times but solutions to this problem are also proposed. So the developed hardware proved to be a good alternative considering current hardware implementations of training algorithms and also the mentioned requirements.
... Area can be efficiently utilized from these techniques but reconfiguration time is more and consumes more power while reconfiguring the hardware. Slow memory accessing and routing are the major drawbacks of FPGA based soft and hard processors respectively [4]. Programmable RISC signal processors provide reduced instruction set to handle signal processing functions such as multiply-accumulate (MAC) efficiently. ...
Conference Paper
Full-text available
Energy consumption is an issue in general in information and communication technologies (ICT), but also cellular communications in particular. Therefore there is a need for assessing the hardware implementation methods used for cellular from energy consumption perspective. OFDM is a bandwidth efficient multiple access method. It is being used by all forms of future wireless mobilela communication systems including Long Term Evolution (LTE). The RF frontend, baseband algorithms and the protocols of these systems including 4G cellular need implementations with low power consumption, adequate processing speed, and low cost. In view of its importance and computational complexity, in this paper, three forms of implementing the baseband communication systems, including DSP, ASIC and RISC DSP based solutions are considered for the study of OFDM based multiple access methods of LTE through FFT algorithm.
... Instead of application-specific compute engines, soft processors implemented in FPGA logic may be used to perform computation. Soft processors have different design constraints than fixed processors [6]; they must rely on the FPGA's embedded block RAM for register files or caches and embedded DSP blocks for complex operations such as shifting and multiplication. Operations such as multiplexing are relatively more resource intensive on FPGAs, while addition is relatively cheap thanks to built in carry chains. ...
Conference Paper
VENICE is a new soft vector processor (SVP) for FPGA applications that is designed for maximum through-put with a small number (1 to 4) of ALUs. By increasing clock speed and eliminating bottlenecks in ALU utilization, VENICE achieves over 2x better performance-per-logic block than VEGAS, the previous best SVP. VENICE is also simpler to program, as its instructions use standard C pointers into a scratchpad memory rather than vector registers.
... Area can be efficiently utilized from these techniques with the cost of more reconfiguration time and more power consumption, while reconfiguring the hardware. Slow memory accessing and routing are the major drawbacks of FPGA based soft and hard processors respectively making them less energy efficient [6]. Furthermore, depending on the granularity the fine, medium, coarse grain arrays came in reconfigurable architectures. ...
Conference Paper
Full-text available
Recent wireless broadband cellular standards are aimed at making provisions for supporting very high data rate applications in limited available bandwidth. The most sophisticated as well as computationally complex subsystem of a transceiver of any such system is the baseband processing part of the system. A multi-core processor is typically needed to provide the required computational power for implementing the complex baseband processing subsystem such as that of LTE transceiver. The energy consumption in the baseband part of subsystem is a very significant component of the total energy expenditure of a cellular radio system, particularly when system employs MIMO and advanced VLSI state of art. This paper aims at reducing the energy consumption and also quantify the achievable energy savings by applying the recent trends in VLSI such as CMOS technology scaling and usage of new heterogeneous multi-core architectures specific to signal processing. To be able to explore and apply these energy efficient techniques, we have first estimated the energy consumption in LTE baseband functions then we have explored possible energy savings obtained from technology scaling and optimum heterogeneous combination of architectures for mapping baseband algorithms.
... The ever increasing density and performance of FPGAs, has increased the importance and popularity of soft processors, defined as processors that are implemented, and used on reconfigurable logic of FPGA [6]. Soft processors benefit from caches to improve performance, and thus, energy efficiency of the system [7]. ...
... The ISA of NIOS soft processors is based on a MIPS instruction set architecture (ISA), while that of Microblaze is a proprietary reduced instruction set computer (RISC) ISA. SPREE [16] is a development tool for automatically generating custom soft processors from a given specification. These soft processor architectures are fairly simple single-threaded processors that do not exploit parallelism other than pipelining. ...
Article
Full-text available
We propose a soft processor programming model and architecture inspired by graphics processing units (GPUs) that are well-matched to the strengths of FPGAs, namely, highly parallel and pipelinable computation. In particular, our soft processor architecture exploits multithreading, vector operations, and predication to supply a floating-point pipeline of 64 stages via hardware support for up to 256 concurrent thread contexts. The key new contributions of our architecture are mechanisms for managing threads and register files that maximize data-level and instruction-level parallelism while overcoming the challenges of port limitations of FPGA block memories as well as memory and pipeline latency. Through simulation of a system that (i) is programmable via NVIDIA's high-level Cg language, (ii) supports AMD's CTM r5xx GPU ISA, and (iii) is realizable on an XtremeData XD1000 FPGA-based accelerator system, we demonstrate the potential for such a system to achieve 100% utilization of a deeply pipelined floating-point datapath.
... In this section we provide a brief description of the VESPA architecture (see our previous paper for full details [9]). The VESPA processor consists of a scalar MIPS processor which was automatically generated using the SPREE sys- tem [7, 8], coupled with a parameterized vector coprocessor based on the VIRAM [4] instruction set.Table 1 shows all the parameters for VESPA whileFigure 1 shows a block diagram of the VESPA architecture. The scalar processor and vector coprocessor share the instruction stream and are both in-order pipelines, but can execute out-of-order with respect to each other except for memory operations which are serialized to maintain sequential consistency. ...
Article
Recently proposed vector processing extensions [9, 10] can significantly improve the performance of a conventional FPGA-based soft processor, but significantly increase the pressure on the memory system to keep pace. In this work we investigate methods of improving the memory system for soft vector processors via (i) tuning the data cache config-uration, namely its depth and line size, and (ii) hardware prefetching mechanisms. We evaluate on our VESPA soft vector processor connected to DDR and executing hand-vectorized benchmarks from the EEMBC industry-standard benchmark suite. We find that cache configuration provides a significant area/performance trade-off for designers to wield, providing near 2x average performance for 1.8x the system area. We also demonstrate that proper prefetching can improve performance by 28% on average, and up to 2.2x in the best case.
... Soft processors are programmable processors implemented in the reconfigurable logic fabric of FPGAs [51] . Soft processors provide a real design test case. ...
Article
Power and temperature are key design concerns in modern computing systems. Power minimization is essential for battery-operated devices and for large-scale data center facilities. The spatial and temporal allocation of within-die power consump-tion lead to thermal gradients and hot spots during operation. Temperature impacts key circuit metrics such as reliability, speed, and leakage power, and it is a major constraint towards improving the performance of high-end computing devices. Due to the enor-mous complexities and sheer number of modeling parameters of state-of-the-art designs, pre-silicon power and thermal models cannot be trusted blindly. It is necessary to complement pre-silicon analysis with post-silicon thermal and power characterization on the fabricated devices, and then to use the characterization results to improve the design during re-spins before ramp and production. In this paper, we describe new techniques for thermal and power characterization of real computing devices. We show how the measurements from infrared imaging, embedded thermal sensors, and current meters can be integrated to accurately char-acterize the temperatures and power of computing devices during operation. We describe the key algorithmic and experimental techniques required to overcome the challenges encountered when working with real devices. We present characterization results of a dual-core processor and a programmable logic device.
... The Quartus II computeraided design tools work with both the chips on the DE2 and other Altera devices. NIOS II processors implement a 32-bit instruction set based on a general-purpose RISC architecture (Yiannacouras 2005Yiannacouras , 2006). Because it is a soft-core configurable processor (Sheldon 2006), FPGA developers can choose from a myriad of system configurations, picking the systems. ...
Article
Full-text available
This paper describes a Field Programmable Gate Array (FPGA) paint program similar to the one found on the Windows operating system. This paint brush application uses a modern educational kit that has recently been available for Computer Science hardware courses. The ed-ucational package employs state-of-the-art technology in both hardware and software. This new technology is currently being used in many universities and many electronic commercial products. With an integrated design environment and its reconfigurable capabilities on the board where hard-ware can be changed on the fly, we implement the system-on-chip application and immediately see execution results. Further development of this program can lead to a more complex and sophisticated application.
... En parallèle, les besoins croissants de flexibilité et d'évolution rapide des produits imposent l'intégration de solutions programmables, et donc l'embarcation de processeurs. En fait, l'utilisation de ces processeurs représente une solution de choix lorsque les contraintes de performance, de coût, de surface ou de consommation sont fortes [1]. Le choix d'un processeur pour le SOPC peut se faire sur différents critères :  un processeur hardcore qui possède des performances considérable mais une flexibilité moyenne. ...
Conference Paper
Full-text available
Cet article présente une étude comparative entre deux processeurs softcores NIOS II d'ALTERA et LEON 3 de Gaisler Research. Notre comparaison porte sur deux critères qui sont le nombre de ressources occupées par chacun des deux processeurs softcores et cela à travers les résultats de synthèse réalisée avec Quartus II et la performance à travers les résultats obtenus avec la suite de benchmark STANFORD. Cette comparaison nous a permis de constater que LEON 3 offre des performances en termes de puissance de calcul plus importantes que celles obtenues avec NIOS II, mais cela contre une occupation des ressources supérieure à celle obtenues avec NIOS II. Résumé Cet article présente une étude comparative entre deux processeurs softcores NIOS II d'ALTERA et LEON 3 de Gaisler Research. Notre comparaison porte sur deux critères qui sont le nombre de ressources occupées par chacun des deux processeurs softcores et cela à travers les résultats de synthèse réalisée avec Quartus II et la performance à travers les résultats obtenus avec la suite de benchmark STANFORD. Cette comparaison nous a permis de constater que LEON 3 offre des performances en terme de puissance de calcul plus importantes que celles obtenues avec NIOS II, mais cela contre une occupation des ressources supérieure à celle obtenues avec NIOS II. Mots Clefs Processeur softcore, FPGA, NIOS II, LEON 3, Benchmark. I. INTRODUCTION Les progrès de la technologie de fabrication des circuits intégrés permettent d'atteindre une densité telle que des systèmes complets peuvent maintenant être regroupés sur une seule puce (System On Programmable Chip « SOPC »). Les applications de téléphonie mobile, de télévision numérique ou de visiophonie sont des exemples de tels systèmes intégrés. En parallèle, les besoins croissants de flexibilité et d'évolution rapide des produits imposent l'intégration de solutions programmables, et donc l'embarcation de processeurs. En fait, l'utilisation de ces processeurs représente une solution de choix lorsque les contraintes de performance, de coût, de surface ou de consommation sont fortes [1]. Le choix d'un processeur pour le SOPC peut se faire sur différents critères :  un processeur hardcore qui possède des performances considérable mais une flexibilité moyenne.  un processeur softcore qui sera beaucoup plus flexible mais avec une performance moins importante.
... The ever increasing density and performance of FPGAs, has increased the importance and popularity of soft processors, defined as processors that are implemented, and used on reconfigurable logic of FPGA [6]. Soft processors benefit from caches to improve performance, and thus, energy efficiency of the system [7]. ...
Conference Paper
A reconfigurable cache architecture for object-oriented application-specific instruction set processors (ASIP) is presented in this paper. The embedded ASIPs we follow in this research are specifically designed to suit object-oriented applications and are synthesized form an object-oriented high-level specification. The ASIPs are composed of a processor core along with a number of hardware functional units. In order to support concurrent execution of the functional units, we propose a cache architecture which is virtually divided into a number of partitions. The partition sizes can be dynamically changed depending on the run-time behavior of the application. Partitioning the cache not only provides the concurrent memory access for the functional units, but also reduces the number of tag comparisons per cache access. We also develop a simple and energy-efficient cache consistency mechanism among cache partitions. In this paper we evaluate the impact of the proposed cache architecture on the cache energy consumption, by implementing it in a number of our ASIPs. The results show that the proposed cache architecture reduces the number of tag comparisons per cache access by 39% on average
... De nombreux travaux de recherches ont été effectués afin de faciliter la spécification d'un ASIP. Ils ont aboutis à l'émergence des langages ADL 5 [46][70] [80][78], dédiés à la description d'architectures de processeur. Avec un tel outil, il est possible de spécifier le processeur (son architecture, son jeu d'instructions, etc) avec une description de haut niveau et d'en évaluer les caractéristiques par simulation. ...
Article
The conception of a real time embedded vision system imposes multiple and severe design constraints, which could be conflicting. Smart camera applications usually requires integration of sofisticated processing near the transducer, with sufficient processing power to run the algorithms at the information flow rate, and using a system of minimal size that consumes little power. Today, transistor integration density enables concentration of the main part, or even all, of a complete system on a sole component (System on a Chip - SoC). A strongly automated design flow is proposed, it reduces the design effort and conception costs, in order to enable fast implementation of complex algorithms into a SoC. Our overall hardware implementation method is based upon meeting algorithm processing power requirement and communication needs with refinement of a generic parallel architecture model. Actual hardware implementation is done by the choice and parameterization of readily available reconfigurable hardware modules and customizable commercially available IPs. Software conception is based upon parallelisation of the algoritms. An homogeneous and regular hardware architecture model is chosen which enables to use parallelisation tools. With the design method, most of the works presented in this thesis are focused on enabling a automated hardware design enviromnent. This includes various works, tools enabling automated generation of a homogeneous network of communicating processors, or hardware components (IP) for communication network, including a paquet router. The presented approach is illustrated with the embedding a real time image stabilization algorithm on SoPC technology.
... Various open-source soft-core processors [7]- [12] have been widely used for academic and commercial purposes. Such soft-cores are developed based on full description of a processor from scratch using Hardware Description Language (HDL) [13], modification of pre-tested Intellectual Property (IP) cores (e.g., MicroBlaze and Nios II), or pre-characterized cores that are developed by higher abstraction level tools (e.g., SPREE [14] and Chisel [15]). ...
Article
Full-text available
Nowadays, embedded processors are widely used in wide range of domains from low-power to safety-critical applications. By providing prominent features such as variant peripheral support and flexibility to partial or major design modifications, Field-Programmable Gate Arrays (FPGAs) are commonly used to implement either an entire embedded system or a Hardware Description Language (HDL)-based processor, known as soft-core processor. FPGA-based designs, however, suffer from high power consumption, large die area, and low performance that hinders common use of soft-core processors in low-power embedded systems. In this paper, we present an efficient reconfigurable architecture to implement soft-core embedded processors in SRAM-based FPGAs by using characteristics such as low utilization and fragmented accessibility of comprising units. To this end, we integrate the low utilized functional units into efficiently designed Look-Up Table (LUT) based Reconfigurable Units (RUs). To further improve the efficiency of the proposed architecture, we used a set of efficient Configurable Hard Logics (CHLs) that implement frequent Boolean functions while the other functions will still be employed by LUTs. We have evaluated effectiveness of the proposed architecture by implementing the Berkeley RISC-V processor and running MiBench benchmarks. We have also examined the applicability of the proposed architecture on an alternative open-source processor (i.e., LEON2) and a Digital Signal Processing (DSP) core. Experimental results show that the proposed architecture as compared to the conventional LUT-based soft-core processors improves area footprint, static power, energy consumption, and total execution time by 30.7%, 32.5%, 36.9%, and 6.3%, respectively.
... Yannacouras et al. [18] present a comprehensive set of architectural modifications on the Altera Nios II processor, including pipeline depth, multiply/divide units and shifter implementation. Unfortunately, FPGAs are seldom an option for IoT, as these devices are very cost-sensitive and energy efficiency is mandatory. ...
Article
Contemporary embedded systems require low-power solutions while still keeping a minimum performance level, and this is even more acute in the Internet of Things (IoT) domain, with its vast design space. This work proposes a configurable RISC processor associated to a design flow that includes a hardware synthesis flow and a software toolchain. This design flow is useful to explore design space and trade-offs of processor cores for IoT applications, by enabling multiple hardware configurations with variable degrees of complexity, while maintaining compatibility with the chosen instruction set architecture, which is itself configurable. Results rely on example designs targeting a 65 nm technology and post-mapped hardware simulations of two benchmarks sets, the CoreMark and Mälardalen suites. These results indicate that substantial power savings can be obtained by tailoring the architecture to a given application class, while reducing hardware complexity and maintaining performance figures. Findings show that the proposed processor provides an interesting resource to target low-end and middle-sized IoT applications, while demonstrating that reducing hardware complexity usually leads to the best trade off between performance and power.
... There are also other frameworks which explore the design space of softcore processors such as SPREE (Yiannacouras et al., 2005), UNUM (Dave and Pellauer, 2005), PEAS-III (Kitajima et al., 2001), Xtensa Xplorer Gonzalez (2000) and AR-Chitect (ARC, 2012) not allowing, however, the gateware acceleration of RTOS functionalities. ...
Thesis
Full-text available
Embedded systems are increasingly more complex computational systems, often heterogeneous and also with real-time requirements, supporting sophisticated and demanding software tasks. To deal with this complexity, real-time constraints and increasingly shorter time-to-market, Real-Time Operating Systems (RTOSes) are used to provide an abstraction layer on top of the hardware, providing several mechanisms to simplify and coordinate the system's behavior. These mechanisms induce latencies and large CPU time consumption, which consequently increase overhead and contribute to the system's performance degradation. This thesis proposes to study and implement tools and methodologies that, considering the application's requirements and the programmable hardware's restrictions, alleviate this overhead through hardware acceleration, by: (1) incorporating RTOSes' primitives and structures in the CPU, whenever possible; (2) providing customization capabilities to enable design space exploration (DSE) while migrating such primitives and structures, as well as, application specific functionalities to gateware and (3) offering an agnostic solution to promote portability among different RTOSes. The implemented solution contributes to the real-time embedded systems field by presenting novel micro-architectural features to cope with real-time requirements. This thesis presents a co-designed software/hardware multithreaded architecture that promotes configurability, determinism, performance, energy-efficiency (to some extent) and portability from the outset. Experimental results demonstrate that the implemented solution surpasses the state of the art, by providing a complete and agnostic solution which is independent of any specific RTOS, with only a small cost on hardware area. Appropriate benchmarking shows the benefits of the implemented solution on tests targeting FreeRTOS and μCOSII. http://hdl.handle.net/1822/40435
... In the current FPGA evaluation on Altera DE4 board, four VP cores are integrated into the Multi-Vector System. Two VP cores use 32 lanes, and An SPREE [16] scalar processor is used to execute the scalar operations of Multi-Vector System. The SPREE is a 3-stage MIPS pipeline with full forwarding core and has a 4K-bit branch history table for branch prediction. ...
Article
Full-text available
With the increase in the density and performance of digital electronics, the demand for a power-efficient high-performance computing (HPC) system has been increased for embedded applications. The existing embedded HPC systems suffer from issues like programmability, scalability, and portability. Therefore, a parameterizable and programmable high-performance processor system architecture is required to execute the embedded HPC applications. In this work, we proposed an Embedded Multi Vector-core System (EMVS) which executes the embedded application by managing the multiple vectorized tasks and their memory operations. The system is designed and ported on an Altera DE4 FPGA development board. The performance of EMVS is compared with the Heterogeneous Multi-Processing Odroid XU3, Parallela and GPU Jetson TK1 embedded systems. In contrast to the embedded systems, the results show that EMVS improves 19.28 and 10.22 times of the application and system performance respectively and consumes 10.6 times less energy.
... An SPREE [19] scalar processor is used to program the VESPA system and perform scalar operations. The SPREE is a 3-stage MIPS pipeline with full forwarding core and has a 4K-bit branch history table for branch prediction. ...
Article
Full-text available
To manage power and memory wall affects, the HPC industry supports FPGA reconfigurable accelerators and vector processing cores for data-intensive scientific applications. FPGA based vector accelerators are used to increase the performance of high-performance application kernels. Adding more vector lanes does not affect the performance, if the processor/memory performance gap dominates. In addition if on/off-chip communication time becomes more critical than computation time, causes performance degradation. The system generates multiple delays due to application’s irregular data arrangement and complex scheduling scheme. Therefore, just like generic scalar processors, all sets of vector machine – vector supercomputers to vector microprocessors – are required to have data management and access units that improve the on/off-chip bandwidth and hide main memory latency. In this work, we propose an Advanced Programmable Vector Memory Controller (PVMC), which boosts noncontiguous vector data accesses by integrating descriptors of memory patterns, a specialized on-chip memory, a memory manager in hardware, and multiple DRAM controllers. We implemented and validated the proposed system on an Altera DE4 FPGA board. The PVMC is also integrated with ARM Cortex-A9 processor on Xilinx Zynq All-Programmable System on Chip architecture. We compare the performance of a system with vector and scalar processors without PVMC. When compared with a baseline vector system, the results show that the PVMC system transfers data sets up to 1.40x to 2.12x faster, achieves between 2.01x to 4.53x of speedup for 10 applications and consumes 2.56 to 4.04 times less energy.
... Yannacouras et al. [18] present a comprehensive set of architectural modifications on the Altera Nios II processor, including pipeline depth, multiply/divide units and shifter implementation. Unfortunately, FPGAs are seldom an option for IoT, as these devices are very cost-sensitive and energy efficiency is mandatory. ...
... Many embedded systems are built using FPGA for research and product development, which increases the number of soft processor uses. While hard processors are available in modern FPGA, which are much faster than the soft processors and do not consume LUT, but they have a few drawbacks [2]. Firstly, the number of built-in hard processors in the FPGA may be sub-optimal or redundant for the application. ...
Preprint
RISC-V, an open instruction set architecture, is getting the attention of soft processor developers. Implementing only a basic 32-bit integer instruction set of RISC-V, which is defined as RV32I, might be satisfactory for embedded systems. However, multiplication and division instructions are not present in RV32I, rather than defined as M-extension. Several research projects have proposed both RV32I and RV32IM processor. However, there is no indication of how much performance can be improved by adding M-extension to RV32I. In other words, when we should consider adding M-extension into the soft processor and how much hardware resource requirements will increase. In this paper, we propose an extension of the RVCoreP soft processor (which implements RV32I instruction set only) to support RISC-V M-extension instructions. A simple fork-join method is used to expand the execution capability to support M-extension instructions as well as a possible future enhancement. We then perform the benchmark using Dhrystone, Coremark, and Embench programs. We found that RV32IM is 1.87 and 3.13 times better in performance for radix-4 and DSP multiplier, respectively. In addition to that, our RV32IM implementation is 13\% better than the equivalent RISC-V processor.
... The deparser compiler ( §4) is described in Python and generates synthesizable VHDL code for the proposed deparser architecture. The generated architecture leverages the inherent configurability of FPGAs to avoid hardware constructs that cannot be efficiently implemented on FPGAs, such as crossbars or barrel shifters [1,25]. The simulation environment is based on cocotb [11], which allows using several off-the-shelf Python packages, such as Scapy, to generate test cases. ...
Conference Paper
The P4 language has drastically changed the networking field as it allows to quickly describe and implement new networking appli- cations. Although a large variety of applications can be described with the P4 language, current programmable switch architectures impose significant constraints on P4 programs. To address this shortcoming, FPGAs have been explored as potential targets for P4 applications. P4 applications are described using three abstractions: a packet parser, match-action tables, and a packet deparser, which reassembles the output packet with the result of the match-action tables. While implementations of packet parsers and match-action tables on FPGAs have been widely covered in the literature, no gen- eral design principles have been presented for the packet deparser. Indeed, implementing a high-speed and efficient deparser on FPGAs remains an open issue because it requires a large amount of inter- connections and the architecture must be tailored to a P4 program. As a result, in several works where a P4 application is implemented on FPGAs, the deparser consumes a significant proportion of chip resources. Hence, in this paper, we address this issue by presenting design principles for efficient and high-speed deparsers on FPGAs. As an artifact, we introduce a tool that generates an efficient vendor- agnostic deparser architecture from a P4 program. Our design has been validated and simulated with a cocotb-based framework. The resulting architecture is implemented on Xilinx Ultrascale+ FPGAs and supports a throughput of more than 200 Gbps while reducing resource usage by almost 10× compared to other solutions.
... The FPGA based Generic Multi-Vector Processor architecture is presneted in Figure 1. The architecture of FPGA based Generic Multi-Vector Processor is further subdivided into the Scalar core, the Scheduler and the Memory System. 1) Scalar Core: An SPREE [7] is integrated in the design to program the Multi-Vector Processor architecture and to execute general purpose scalar instructions. SPREE is a scalar RISC processor, having 3-stage MIPS pipeline architecture and branch prediction support by using a 4K-bit history table. ...
... The FPGA based Generic Multi-Vector System architecture is shown in Figure 1. The system architecture is further divided into the Scalar core, the Scheduler and the Memory System. 1) Scalar Core: An SPREE [11] scalar processor is used to program the Multi-Vector System and perform scalar operations. The SPREE is a 3-stage MIPS pipeline with full forwarding core and has a 4K-bit branch history table for branch prediction. ...
Data
With the increase in FPGA density and performance, the demand for multiple high-performance computing (HPC) units has been increased for various scientific and technological fields. Multi-scalar processor architectures do not give the best performance on FPGAs for HPC applications. This performance degradation demands a parameterizable high-performance processor architecture to process HPC applications. In this article, we proposed an FPGA based Multi-Vector Processor Architecture by integrating an efficient scheduler into existing Programmable Vector Memory Controller (PVMC). The proposed design is known as Multi-Vector Processor Architecture (MVPA) which proficiently handles multiple vectorized tasks and their data movements. The system is tested on n Altera FPGA DE4 development board. The MVPA system results are compared with a generic Multi-Vector Processor and multi-scalar core systems. The results show that the MVPA system handles computation task more efficiently and improves system performance between 8.1x to 30.1x and 1.99x to 4.31x against scalar multi-core and generic Multi-Vector Processor systems respectively for 10 applications.
... Then control signals created by controllers transfer through the pipeline. But in distributed decoding mode, controllers are distributed in multiple stages and controllers only create the control signal which related to the core units in the same stage [9,10]. In this paper, the supply-matching method uses a hybrid decoding mode which means Tuse uses centralized decoding mode and Tnew uses distributed decoding mode. ...
Article
Full-text available
In order to improve the throughput of the processors, pipeline technique is widely used to implement the instruction-level parallelism. However, this technique also leads to data hazards which has a great influence on the performance. This paper proposed a method called supply-matching to detect and solve data hazards efficiently. The logic of bypassing and stalling can be easily realized through this method. Furthermore, an RTL description of instructions was also introduced in this paper to reduce resource utilization. The case study was conducted through a five-stage microprocessor based on the PowerPC architecture with different approaches. Experiment results show our method requires less resources and achieves better performance.
Article
Parameterized components are becoming more commonplace in system design. The process of customizing parameter values for a particular application, called tuning, can be a challenging task for a designer. Here we focus on the problem of tuning a parameterized soft-core microprocessor to achieve the best performance on a particular application, subject to size constraints. We map the tuning problem to a well-established statistical paradigm called design of experiments (DoE), which involves the design of a carefully selected set of experiments and a sophisticated analysis that has the objective to extract the maximum amount of information about the effects of the input parameters on the experiment. We apply the DoE method to analyze the relation between input parameters and the performance of a soft-core microprocessor for a particular application, using only a small number of synthesis/execution runs. The information gained by the analysis in turn drives a soft-core tuning heuristic. We show that using DoE to sort the parameters in order of impact results in application speedups of 6times-17times versus an un-tuned base soft-core. When compared to a previous single-factor tuning method, the DoE-based method achieves 3times-6times application speedups, while requiring about the same tuning runtime. We also show that tuning runtime can be reduced by 40-45% by using predictive tuning methods already built into a DoE tool
Article
This paper discusses ongoing research into the design and development of custom configurable hardware for use with file formats that to date require special software for access, with the goal of improved system performance and flexibility. In particular, a field programmable gate array media player is introduced for use with RealMedia files. Preliminary results are promising, and the successful completion of this research will lead to the study and design of many more of these types of systems devices.
Conference Paper
State of the art FPGAs allow the implementation of small to medium sized Systems-on-Chip (SoCs) where configurability is key in order to achieve design goals. Thus, SoCs are frequently designed around soft extensible processors, which provide a tradeoff between design flexibility and fast time to market. This paper presents the impact of micro-architectural features on several design metrics of a multithreading extensible processor. Using the MiBench benchmark, it is shown how Custom Computational Units (CCUs) can significantly increase performance while providing lower power solutions than software-only implementations. An efficient architecture that facilitates the insertion of CCUs is described and the effect of multithreading and thread scheduling policies on the design metrics is also demonstrated. Results show that multithreading policies can have positive impact on key parameters (e.g., up to 20% increase on performance and up to 10% energy savings in the given application), depending on application characteristics as well as micro-architectural features.
Article
Field-programmable gate arrays (FPGAs) are increasingly used to implement embedded digital systems, however, the hardware design necessary to do so is time-consuming and tedious. The amount of hardware design can be reduced by employing a microprocessor for less-critical computation in the system. Often this microprocessor is implemented using the FPGA reprogrammable fabric as a soft processor which presently have simple architectures and moderate performance. Our goal is to scale the performance of existing soft processors hence expanding their suitability to more critical computation. To this end we propose extending soft processors with vector extensions to exploit the abundant data parallelism found in many embedded kernels. Such a soft vector processor can execute these kernels much faster than a single-core hence reducing the need for hardware implementations. We observe this improved execution speed through experimentation with vector extended soft processor architecture (VESPA) which is designed, implemented, and evaluated on real FPGA hardware. VESPA is shown to effectively scale performance up to 32 lanes, while providing substantial architectural flexibility to create a fine-grained design space. With these characteristics, and portability across FPGA devices, soft vector processors can provide exact-fit architectures which can efficiently and more easily implement data parallel workloads over custom FPGA hardware design.
Conference Paper
Full-text available
Energy consumption is an issue in general in information and communication technologies (ICT), but also cellular communications in particular. Base station sites are the main energy consumers in a mobile network. Their energy consumption significantly contributes to green house gas (CO2) emissions. Hence, service providers are looking for energy saving solutions in their existing and future deployments which in turn reduces the CO2 foot print and operational costs. There is a little study in assessing energy requirements of baseband functions of base station (e-nodeb) due to its energy consumption contribution is significantly low when compared to other more energy hungry components such as power amplifiers, analog frontend and protocols. In this paper, we have assessed energy consumption in MIMO-OFDM baseband functions of LTE eNodeB on a typical multi-core processor. In particular, implementing such computationally complex algorithms energy efficiently pose certain challenges such as less complex algorithm selection, architecture selection, algorithm architecture mapping, and energy management strategies. We have given the guide lines for addressing these challenges.
Conference Paper
Full-text available
Neste artigo, apresentamos um método de aprendizado de arquitetura de computadores, que é fundamentado em aprendizado baseado em problemas e soft processor. Nosso objetivo principal é propor e desenvolver um método que estimule, facilite e otimize o processo de aprendizado de arquitetura de computadores. Para verificação deste método, foi proposto o projeto e o desenvolvimento de processadores dos nós de redes de sensores sem fio (RSSF). Concluímos que devido à oportunidade de poder manipular um processador real, alterando suas características, o aluno se sentirá mais interessado e incentivado, o que facilitará o aprendizado dos conceitos de Arquitetura de Computadores. A contribuição deste artigo é a criação de um método que motive, facilite e proporcione uma melhoria na qualidade do aprendizado.
Conference Paper
While Coarse-Grained Reconfigurable Architectures (CGRAs) are very efficient at handling regular, compute-intensive loops, their weakness at control-intensive processing and the need for frequent reconfiguration require another processor, for which usually a main processor is used. To minimize the overhead arising in such collaborative execution, we integrate a dedicated sequential processor (SP) with a reconfigurable array (RA), where the crucial problem is how to share the memory between SP and RA while keeping the SP's memory access latency very short. We present a detailed architecture, control, and program example of our approach, focusing on our optimized on-chip shared memory organization between SP and RA. Our preliminary results demonstrate that our optimized memory architecture is very effective in reducing kernel execution times (23.5% compared to a more straightforward alternative), and our approach can reduce the RA control overhead and other sequential code execution time in kernels significantly, resulting in up to 23.1% reduction in kernel execution time, compared to the conventional system using the main processor for sequential code execution.
Conference Paper
Modern computing systems increasingly consist of multiple processor cores. From cell phones to datacenters, multicore computing has become the standard. At the same time, our understanding of the performance impact resource sharing has on these platforms is limited, and therefore, prevents these systems from being fully utilized. As the capacity of FPGAs has grown, they have become a viable method for emulating architecture designs as they offer increased performance and visibility into runtime behaviour compared to simulation. With future systems trending towards asymmetric and heterogeneous systems, and thus further increasing complexity, a framework that enables research in this area is highly desirable. In this work, we present PolyBlaze: a multicore Micro- Blaze based system with Linux Symmetric Multi-Processor (SMP) support on an FPGA. Starting with a single-core, Linux supported, MicroBlaze we detail the changes to the platform, both in hardware and software, required to bring Linux SMP support to the MicroBlaze. We then outline the series of tests performed on our platform to demonstrate both its stability (e.g. more than two weeks of up time) and scalability (up to eight cores on an FPGA, with resource usage increasing linearly with the number of cores).
Conference Paper
Full-text available
In this work, we propose a Programmable Vector Memory Controller (PVMC), which boosts noncontiguous vector data accesses by integrating descriptors of memory patterns, a specialized local memory, a memory manager in hardware, and multiple DRAM controllers. We implemented and validated the proposed system on an Altera DE4 FPGA board. We compare the performance of our proposal with a vector system without PVMC as well as a scalar only system. When compared with a baseline vector system, the results show that the PVMC system transfers data sets up to 2.2x to 14.9x faster, achieves between 2.16x to 3.18x of speedup for 5 applications and consumes 2.56 to 4.04 times less energy.
Article
In this paper we propose a multi-channel speech pickup system for calling quality enhancement of hands-free communication using ALTERA Nios-II processor. Multi-channel speech pickup system uses Delay-and-Sum beamformer with zero-padding interpolator. This paper implements speech pickup system using the Nios-II processor with real-time I/O data processing speed. The proposes speech pickup embedded system shows a good agreement with those of computer simulation(MATLAB) and conventional DSP processor(TMS320C6711) result. The proposed method is effective more than previous methods in cost and design processing time. As a result, LE(Logic Element) of hardware used 3,649/5,980(61%) on a chip.
Article
This paper compares the delay and area of a comprehensive set of processor building block circuits when implemented on custom CMOS and FPGA substrates, then uses these results to show how soft processor microarchitectures should be different from those of hard processors. We find that the ratios of the custom CMOS versus FPGA area for different building blocks varies considerably more than the speed ratios, thus, area ratios have more impact on microarchitecture choices. Complete processor cores on an FPGA use $17hbox{--}27times$ more area (“area ratio”) than the same design implemented in custom CMOS. Building blocks with dedicated hardware support on FPGAs such as SRAMs, adders, and multipliers are particularly area-efficient $(2hbox{--}7times)$ , while multiplexers and content-addressable memories (CAM) are particularly area-inefficient $({>}100times)$ . Applying these results, we find out-of-order soft processors should use physical register file organizations to minimize CAM size.
Article
In this work, we present PolyBlaze, a scalable and configurable multicore platform for FPGA-based embedded systems and systems research. PolyBlaze is an extension of the MicroBlaze soft processor, leveraging the configurability of the MicroBlaze and bringing it into the multicore era with Linux Symmetric Multi- Processor (SMP) support. This work details the hardware modifications required for the MicroBlaze processor and its software stack to enable fully validated SMP operations, including atomic operation support, shared interrupts and timers, and exception handling. New in this work, we present a scalable and flexible memory hierarchy optimized for Field Programmable Gate Arrays (FPGAs), which manages atomic operations and provides support for future flexible memory hierarchies and heterogeneous systems. Also new is an in-depth analysis of key performance characteristics, including memory bandwidth, latency, and resource usage. For all system configurations, bandwidth is found to scale linearly with the addition of processor cores until the memory interface is saturated. Additionally, average memory latency remains constant until the memory interface is saturated; after which, it scales linearly with each additional processor core.
Article
Full-text available
This paper describes the Altera Stratix logic and routing architecture. The primary goals of the architecture were to achieve high performance and logic density. We give an overview of the entire device, and then focus on the logic and routing architecture. The Stratix logic architecture is based on a cluster of ten 4-input LUTs and its routing consists of staggered routing lines. We describe the development of the routing architecture, including its directional bias, its direct-drive routing which reduces both area and delay. The logic array block and logic cell design is also described, and new routing structures with in the logic array block, and logic element features are described.
Conference Paper
Full-text available
In recent years the challenge of high performance, low power retargettable embedded system has been faced with different technological and architectural solutions. In this paper we present a new configurable unit explicitly designed to implement additional reconfigurable pipelined datapaths, suitable for the design of reconfigurable processors. A VLIW reconfigurable processor has been implemented on silicon in a standard 0.18 μ m CMOS technology to prove the effectiveness of the proposed unit. Testing on a signal processing algorithms benchmark showed speedups from 4.3x to 13.5x and energy consumption reduction up to 92%.
Conference Paper
Full-text available
This paper describes the Altera Stratix II™ logic and routing architecture. This architecture features a novel adaptive logic module (ALM) that is based on a 6-LUT, but can be partitioned into two smaller LUTs to efficiently implement circuits containing a range of LUT sizes that arises in conventional synthesis flows. This provides a performance increase of 15% in the Stratix II architecture while reducing area by 2%. The ALM also includes a more powerful arithmetic structure that can perform two bits of arithmetic per ALM, and perform a sum of up to three inputs. The routing fabric adds a new set of fast inputs to the routing multiplexers for another 3% improvement in performance, while other improvements in routing efficiency cause another 6% reduction in area. These changes in combination with other circuit and architecture changes in Stratix II contribute 27% of an overall 51% performance improvement (including architecture and process improvement). The architecture changes reduce area by 10% in the same process, and by 50% after including process migration.
Conference Paper
Full-text available
In this paper, the effectiveness of the ASIP (Application Specific Instruction set Processor) design system PEAS-III is evaluated through experiments. Examples in experiments are a MIPS R3000 compatible processor, DLX, a simple RISC controller, and PEAS-I core. While they are simple in-order pipelined processors, they have enough facilities for real embedded system design. Through experiments, easiness of design and modification for improvement and design quality in terms of performance and hardware cost are discussed. It has been confirmed that the design method used in PEAS-III is effective to design space exploration for simple pipelined processors.
Conference Paper
Full-text available
The development of application specific instruction set processors comprises several design phases: architecture exploration, software tools design, system verification and design implementation. The LISA processor design platform (LPDP) based on machine descriptions in the LISA language provides one common environment for these design phases. Required software tools for architecture exploration and application development can be generated from one sole specification. This paper focuses on the implementation phase and the generation of synthesizable HDL code from a LISA model. The derivation of the architectural structure, decoder and even approaches for the implementation of the data path are presented. Moreover the synthesis results of a generated and a handwritten implementation of a low-power DVB-T post processing unit are compared
Article
Full-text available
Architecture description languages are widely used to perform architecture exploration for application-driven designs, whereas the RT-level is the commonly accepted level for hardware implementation. For this reason, design parameters such as timing, area or power consumption cannot be taken into consideration accurately during design space exploration. Design automation tools currently used to bridge this gap are either limited in the flexibility provided or only generate fragments of the architecture. This paper presents a synthesis tool which preserves the full flexibility of the architecture description language LISA, while being able to generate the complete architecture on RT-level using SystemC. This paper also presents two real world architecture case studies to prove the feasibility of our approach. 1
Article
Full-text available
This paper discusses the design of a MIPS-I processor core using VHDL. The control structure of this processor is distributed, with a small controller in each pipeline stage controlling sequencing of operations and communication with adjacent pipeline stages. Instruction flow management is performed using asynchronous communication signals. Due to its high-level description and distributed control structure, the core can easily be extended. Thus, instruction set extension hardware/software co-evaluation can be performed efficiently using rapid prototyping.
Article
As embedded systems continue to face increasingly higher performance requirements, deeply pipeli-ned processor architectures are being employed to meet desired system performance. System archi-tects critically need modeling techniques to rapidly explore and evaluate candidate architectures based on area, clock frequency, power, and performance constraints. We present an exploration framework for pipelined processors. We use the EXPRESSION Architecture Description Language (ADL) to capture a wide spectrum of processor architectures. The ADL has been used to enable performance driven exploration by generating a software toolkit from the ADL specification. In this paper, we present how to automatically generate synthesizable RTL from the ADL specification using a functional abstraction technique. Automatic generation of RTL enables rapid exploration of candidate architectures under given design constraints such as area, clock frequency, power, and performance. Our exploration results demonstrate the power of reuse in composing heteroge-neous architectures using functional abstraction primitives allowing for a reduction in the time for specification and exploration by at least an order of magnitude.
Article
This paper gives an overview of methods used for design space exploration (DSE) of micro-architectures and systems. The DSE problem generally considers two orthogonal issues: (I) How can a single design point be evaluated, (II) how can the design space be covered during the exploration process? The latter question arises since an exhaustive exploration of the design space is usually prohibitive due to the sheer size of the design space. We explain trade-offs linked to the choice of appropriate evaluation and coverage methods. The designer has to balance the following issues: the accuracy of the evaluation, the time it takes to evaluate one design point (including the implementation of the evaluation model), the precision/granularity of the design space coverage, and, last but not least, the possibilities for automating the exploration process. We also summarize common representations of the design space and compare current system and micro-architecture level design frameworks. This review eases the choice of a decent exploration policy by providing a comprehensive survey and classification of recent related work. It is focused on system-on-a-chip designs, particularly those used for network processors. These systems are heterogeneous in nature using multiple computation, communication, memory, and peripheral resources.
Conference Paper
The Arithmetic-Logic-Unit (ALU) is at the heart of a modern microprocessor, and its size and speed are often significant contributors to the overall processor's cost and performance. This paper presents the design of the ALU used in Altera's NIOS 2.0 soft processor implemented on Altera's Apex 20KE FPGA architecture. This ALU enabled the 32-bit NIOS 2.0 to consume only 1200 LEs and run at 85MHz. This is a 50% size reduction and 70% speed improvement over its predecessor, NIOS 1.1. The Logic-element (LE) is the basic building block within the Apex architecture. Making full use of the advanced features of the LE has resulted in this novel ALU design. A functional representation of the logic is used to describe how the ALU performs the core set of NIOS instructions, and an LE representation shows the amount of logic-resources needed for the implementation. The cost of additional features such as a barrel- shifter and custom instructions is also described. Likely worst-case delays for different routing and logic elements are used to estimate the ALU's speed. Further speed and size optimizations are also presented from which it is possible to create ALU ranging in speed from 87 MHz to over 100 MHz.
Conference Paper
Summary form only given. Altera's SOPC Builder Tool enables engineers to create tailor-made systems in an FPGA with a short development cycle; one of the most popular components in SOPC Builder is Altera's NIOS II processor. As well as ease of use and flexibility, the NIOS II family of processors offers up to 200 DMIPs of performance and can cost as little as 35 cents worth of programmable logic. This high level of performance has been achieved by tailoring the processor architecture to fully exploit the FPGA resources used. Logic, registers, memory, and multipliers have different relative costs in an FPGA when compared to an ASIC; in an FPGA, registers and memories are relatively cheap, whereas logic and in particular, the implementation of multiplexers can be of relatively high cost. These cost differences have an influence on how engineers should design for FPGAs, and defined the design of NIOS II at the architectural level. This paper presents some novel techniques for implementing multiplexers and barrel-shifters efficiently, using the NIOS II processor as an example. These techniques are useful for improving FPGA designs in general, and have typically lead to area reductions and performance improvements of 20%.
Conference Paper
In this paper an architectural level processor design environment PEAS-III is proposed. Pipelined processors designed by this system can include multi-cycle operation, delayed branch and external interrupt. The data path and control logic of the processor are generated from the clock based micro-operation description of instructions. The ease of large design space exploration and effectiveness of the system have been confirmed through experiments using several subsets of MIPS R3000 instruction set.
Conference Paper
This paper discusses the design of a MIPS-I processor kernel using VHDL. The control structure of this processor is distributed with a small controller in each pipeline stage controlling sequencing of operations and communication with adjacent pipeline stages. Instruction flow management is performed using asynchronous communication signals. Due to its high-level description and distributed control structure, the kernel can easily be extended. Thus, instruction set extension hardware/software co-evaluation can be performed efficiently using rapid prototyping
Article
To find the best designs, architects must rapidly simulate many design alternatives and have confidence in the results. Unfortunately, the most prevalent simulator construction methodology, hand-writing monolithic simulators in sequential programming languages, yields simulators that are hard to retarget, limiting the number of designs explored, and hard to understand, instilling little confidence in the model. Simulator construction tools have been developed to address these problems, but analysis reveals that they do not address the root cause, the error-prone mapping between the concurrent, structural hardware domain and the sequential, functional software domain. This paper presents an analysis of these problems and their solution, the Liberty Simulation Environment (LSE). LSE automatically constructs a simulator from a machine description that closely resembles the hardware, ensuring fidelity in the model. Furthermore, through a strict but general component communication contract, LSE enables the creation of highly reusable component libraries, easing the task of rapidly exploring ever more exotic designs.
Altera Corporation. Private Communication
  • R Cliff