International Journal of Embedded Systems

Published by Inderscience
Online ISSN: 1741-1076
Print ISSN: 1741-1068
Publications
We describe a cryptographic processor for elliptic curve cryptography (ECC). ECC is evolving as an attractive alternative to other public-key schemes such as RSA by offering the smallest key size and the highest strength per bit. The processor performs point multiplication for elliptic curves over binary polynomial fields GF(2<sup>m</sup>). In contrast to other designs that only support one curve at a time, our processor is capable of handling arbitrary curves without requiring reconfiguration. More specifically, it can handle both named curves as standardized by NIST as well as any other generic curves up to a field degree of 255. Efficient support for arbitrary curves is particularly important for the targeted server applications that need to handle requests for secure connections generated by a multitude of heterogeneous client devices. Such requests may specify curves which are infrequently used or not even known at implementation time. Our processor implements 256 bit modular multiplication, division, addition and squaring. The multiplier constitutes the core function as it executes the bulk of the point multiplication algorithm. We present a novel digit-serial modular multiplier that uses a hybrid architecture to perform the reduction operation needed to reduce the multiplication result: hardwired logic is used for fast reduction of named curves and the multiplier circuit is reused for reduction of generic curves. The performance of our FPGA-based prototype, running at a clock frequency of 66.4 MHz, is 6955 point multiplications per second for named curves over GF(2<sup>163</sup>) and 3308 point multiplications per second for generic curves over GF(2<sup>163</sup>).
 
Summary form only given. We present efficient solutions for the noncontiguous linear placement of data paths for reconfigurable fabrics. A strip-based architecture is assumed for the reconfigurable fabric. A preorder tree-expression or a general graph is placed in a strip, which can have active and/or inactive preplaced cores representing blockages and/or cores available for reuse. Two very efficient algorithms are proposed to solve the simpler problem of noncontiguous placement with blockages but without core reuse for tree graphs. The linear ordering obtained with any of the above algorithms is used as input for a third efficient algorithm to solve the problem of noncontiguous placement with both active and inactive cores. A fourth algorithm is proposed to solve the problem of noncontiguous placement with both core and connectivity reuse. Simulations results are reported.
 
Microprocessors are designed with a fixed set of microarchitectural resources. However, resource requirements vary across programs and within a program as it passes through different phases of execution. This mismatch between the microarchitecture and program requirements leads to sub-optimal power/performance. We present an adaptive microarchitecture that dynamically adapts to changing program requirements in order to achieve power efficiency with minimal performance loss. The microarchitecture employs four multi-configuration units which are controlled by a phase-based tuning algorithm. We show, via simulation, that the best performing tuning algorithm achieves significant reduction in leakage power with performance loss of ~1%.
 
This paper presents a hardware/software co-design method for the implementation of AAC audio decoder. We present the approach not only for the characteristics of the algorithm, but also provide the numerical decision for evaluation of the various approaches. The overall system is first analysed and profiled with ARM profiler. Then the decoder system is partitioned into software part and hardware part respectively based on the property of analysis. The software part is developed for the implementation of intensive decision making operations needed for audio bitstreams. The hardware part is a dedicated hardware for the regular and computation-intensive operations in AAC audio decoding.
 
We show in this paper how to contract the TPN state space into a graph that captures all its CTL* properties. This graph, called Atomic State Class Graph (ASCG), is finite if and only if, the model is bounded. To achieve this objective, we use a refinement technique similar to what is proposed in Berthomieu and Vernadat (2003) and Yoneda and Ryuba (1998). In such a technique, an intermediate contraction of the TPN state space is first built then refined until CTL* properties are restored. Compared with the approaches in Berthomieu and Vernadat (2003) and Yoneda and Ryuba (1998), we use inclusion abstraction during all phases of the construction process while reducing the complexity of computations. Our approach allows us to construct smaller ASCGs in shorter times (more than five times faster in certain cases).
 
This paper discusses the design and implementation of a new generation of digital stethoscopes capable of collecting and processing body sound without the need of a personnel computer and hardware interface. The cost of the proposed device is a fraction of that of the data acquisition system used with current digital stethoscopes to collect body sound in a digital format. The new design uses system-on-chip technology and hardware-software co-design to integrate all the functions needed by this application into a single field programmable gate array (FPGA). The new design strategy saves hardware, space, and power consumption. It also allows for signal processing and data interpretation in the same device. The body sound device has been implemented and tested. Its performance compares very favourably to that of existing PC-based digital stethoscope.
 
The source code analysis technique of graph-based program slicing is extended to model interactions across the hardware-software boundary in the context of embedded systems. Specifically, this work proposes: a set of inter-process dependences to model software interacting with hardware; an asynchronous concurrency representation of dependences present in embedded systems; an algorithm to compute context-sensitive slices that can transitively follow dependences from software, through hardware, and back to software. A prototype tool applies the proposed worklist algorithm to several test cases. Additionally, a detailed, step-by-step example demonstrates its operation on a device driver interacting with hardware.
 
Scratch Pad Memories (SPMs) have received considerable attention lately as on-chip memory building blocks. The main characteristic that distinguishes an SPM from a conventional cache memory is that the data flow is controlled by software. The main focus of this paper is the management of an SPM space shared by multiple applications that can potentially share data. The proposed approach has three major components; a compiler analysis phase, a runtime space partitioner, and a local partitioning phase. Our experimental results show that the proposed approach leads to minimum completion time among all alternate memory partitioning schemes tested.
 
Behavioural synthesis has received considerable attention recently and new action-oriented hardware specification formalisms have been proposed. We call such formalisms Concurrent Action Oriented Specifications (CAOS). CAOS models have low granularity concurrent atomic action descriptions with a semantics similar to Dijkstra's guarded command language. Such models have been shown to generate efficient hardware designs, and the importance of making such CAOS-based synthesis process power-aware cannot be undervalued. In this paper, we formulate the problems of power-optimal synthesis for CAOS, discuss several heuristics and show some numerical examples illustrating the use of such heuristics during CAOS-based synthesis.
 
Energy efficiency has become a primary design criterion for mobile multimedia devices. Prior work has proposed saving energy through coordinated adaptation in multiple system layers, in response to changing application demands and system resources. The scope and frequency of adapta- tion pose a fundamental conflict in such systems. The Illi- nois GRACE project addresses this conflict through a hier- archical solution which combines (1) infrequent (expensive) global adaptation that optimizes energy for all applications in the system and (2) frequent (cheap) per-application (or per-app) adaptation that optimizes for a single application at a time. This paper demonstrates the benefits of the hi- erarchical adaptation through a second-generation proto- type, GRACE-2. Specifically, it shows that in a network bandwidth constrained environment, per-app application adaptation yields significant energy benefits over and above global adaptation.
 
With the recent development in compression and network technology, stream media has been adopted in internet and intranet. Application of stream media technology will play a key role in future development of fast networks. Original video/audio data will be stored in the devices for storage after having been pre-compressed by a video/audio compression algorithm. When clients have requirements, the stream server retrieves the video/audio data from the storage devices through the network. This paper discusses a simple and effective real-time scheduling algorithm, adaptive Layer-Based Least-Laxity-First (LB-LLF) scheduling algorithm, to improve the output quality of video/audio on a network and to achieve synchronised playback effect. The proposed algorithm considered real-time constraint, unequal priorities of scalable media stream in different layers, and a good tradeoff between coding efficiency and drifting error. This guarantees the effective usage of available channel bandwidth and the better quality of playback in client.
 
In recent years there has been significant interest in the area of Hardware-based Genetic Algorithms (HGA) implemented using a Field Programmable Gate Array (FPGA). This paper presents a hardware-based genetic optimiser applied to adjusting an adaptive antennae receiver. The proposed architecture employs a combination of pipelining and parallelisation to achieve significant speedups over software implementation. The proposed HGA is implemented on a prototyping board with a Xilinx Virtex-E FPGA and reaches a speedup factor of more than 500 when compared to the software implementation.
 
Existing compilation techniques for coarse-grained reconfigurable arrays are closely related to approaches from the DSP world. These approaches employ several loop transformations, like pipelining or temporal partitioning, but they are not able to exploit the full parallelism of a given algorithm and the computational potential of a typical 2-dimensional array. In this paper: we present an overview of constraints which have to be considered when mapping applications to coarse-grained reconfigurable arrays; we present our design methodology for mapping regular algorithms onto massively parallel arrays which is characterised by loop parallelisation in the polytope model; and, in a first case study, we adapt our design methodology for targeting reconfigurable arrays. The case study shows that the presented regular mapping methodology may lead to highly efficient implementations taking into account the constraints of the architecture.
 
Embedded real-time systems must satisfy not only logical functional requirements but also para-functional properties such as timeliness, Quality of Service (QoS) and reliability. We have developed a model-based tool called Time Weaver which enables the modeling of functional and para-functional behaviors of real-time systems. It also performs automated schedulability analysis, and generates glue code to integrate the final runtime executable for the system. Its extensive glue code generation capabilities include the ability to insert inter-processor communications code at arbitrary software boundaries. In other words, from a functional point of view, a software component may be viewed as a single logical entity but from the tool point of view, the component can be partitioned into two or more pieces running on different nodes. This capability opens up many dif- ferent possibilities to map (partitioned) software components to hardware nodes. The objective of this deployment is to minimize hardware requirements while satisfying the timing constraints of the software. The classical approach to addressing this problem is to use bin-packing techniques. In this paper, we study Partitioning Bin Packing, an extension to bin-packing algorithms to exploit the capability of partitioning software modules into smaller pieces. We analytically show that the number of bins required can be reduced. We also evaluate a number of heuristics to minimize not only the number of processors (bins) needed but also the network bandwidth required by communicating software modules that are partitioned across different processors. We find that a significant reduction in the number of bins is possible. Finally, we show how different heuristics lead to different tradeoffs in processing vs network needs.
 
NUMBER OF INSTRUCTIONS AND CYCLES FOR OPERATIONS IN BLOCK CIPHERS
Block ciphers are used to encrypt data and provide data confidentiality. For interoperability reasons, it is desirable to support a variety of block ciphers efficiently. Of the basic operations in block ciphers, only bit permutation is very slow on existing processors, followed by integer multiplication. Although new permutation instructions proposed recently can accelerate bit permutations in general-purpose processors, reducing the number of instructions needed to achieve an arbitrary n-bit permutation from O(n) to O(log<sub align="right">2</sub>n), the data dependency between permutation instructions prevents them from being executed in fewer than log<sub align="right">2</sub>n cycles, even on superscalar processors. Since Application-Specific Instruction-Set Processors (ASIPs) have fewer constraints on maintaining standard processor datapath and control conventions, six alternative ASIP approaches are proposed in this paper to achieve arbitrary 64-bit permutations in one or two cycles without increasing the cycle time. These approaches use new BFLY and IBFLY instructions. We also compare these approaches and their efficiency in performing arbitrary 64-bit permutations.
 
Field-programmable analog arrays (FPAAs) provide a method for rapidly prototyping analog systems. While currently available FPAAs vary in architecture and interconnect design, they are often limited in size and flexibility. For FPAAs to be as useful and marketable as modern digital reconfigurable devices, new technologies must be explored to provide area efficient, accurately programmable analog circuitry that can be easily integrated into a larger digital/mixed signal system. By leveraging recent advances in floating gate transistors, a new generation of FPAAs are achievable that will dramatically advance the current state of the art in terms of size, functionality, and flexibility.
 
Model-Driven Development (MDD) is an emerging paradigm that uses Domain-Specific Modelling Languages (DSMLs) to provide 'correct-by-construction' capabilities for many software development activities. This paper describes a MDD tool suite called Component Synthesis using Model-Integrated Computing (CoSMIC), a collection of DSMLs that support the development, configuration, deployment, and validation of component-based DRE systems. We also describe how we have applied CoSMIC to an avionics mission computing application to resolve key component-based DRE system development challenges. Our results show that the design-, deployment- and Quality Assurance (QA)-time capabilities provided by CoSMIC help to eliminate key complexities associated with development of QoS-enabled component middleware applications.
 
Today's embedded computing applications are characterised by increased functionality, and hence increased design complexity and processing requirements. The resulting design spaces are vast and designers are typically able to evaluate only small subsets of solutions due to lack of efficient design tools. In this paper, we propose an architectural level design methodology that provides a means for a comprehensive design space exploration for smart camera applications and enable designers to select higher quality solutions and provides substantial savings on the overall cost of the system. We present efficient, accurate and intuitive models for performance estimation and validate them with experiments.
 
In this paper, a SystemC based system level design methodology is proposed, which enables the designer to reason about the architecture on a much higher level of abstraction. The goal of this methodology is to define a system architecture, which provides sufficient performance, flexibility and cost efficiency as required by demanding applications, such as broadband networking or wireless communications. Co‐simulating multiple levels of abstraction simultaneously enables reuse of abstract models of the functional verification of synthesisable implementation models. We share our experiences with special emphasis on the architecture exploration phase, where several architectural alternatives are evaluated with respect to their impact on the system performance.
 
In this paper, we describe a new dynamically reconfigurable neuron hardware architecture based on the modified Xilinx Picoblaze microcontroller and self-organising learning array (SOLAR) algorithm reported earlier. This architecture is aiming at using hundreds of traditional reconfigurable field programmable gate arrays (FPGAs) to build the SOLAR learning machine. SOLAR has many advantages over the traditional neural network hardware implementation. Neurons are optimised for area and speed, and the whole system is dynamically self-reconfigurable during the runtime. The system architecture is expandable to a large multiple-chip system.
 
Run-time reconfiguration of field programmable devices can change their internal structure and behaviour in response to dynamic requests. Thus, reconfigurable systems with programmable fabrics can offer a cost effective solution to address the multi functionalities of today's applications. This paper recognises the cost benefits that such run-time adaptability can provide and proposes a novel reconfigurable architecture synthesis methodology to achieve a cost-effective reconfigurable system solution. The proposed architecture synthesis methodology converts a recognised dynamic environment into an assembled micro-level system. New design steps of the methodology identify a multi-task and multi-mode workload, determine an appropriate reconfiguration granularity and synthesise a workload-specific static architecture for a run-time reconfigurable system that enables on-chip assembly of pre-constructed components. The experimental results show the cost benefits of the proposed methodology which saves 73% of area and 29.8% of power compared to fixed design approach for implementing multiple visual processors.
 
Instance of multiprocessor architecture platform  
SYNTHESIS RESULTS OF THE TWO ARCHITECTURES IMPLEMENTING THE PACKET ROUTING SWITCH
Generic model of a communication coprocessor interface  
TIME NEEDED TO FIT THE IS95 CDMA ON THE MULTIPROCESSOR PLATFORM
VDSL microarchitecture with application specific point-to-point on-chip communication network  
System-on-chip (SoC) is developing as a new paradigm in electronic system design. This allows an entire hardware/software system to be built on a single chip, using pre- designed components. This paper examines the achievements and future of novel approach and flow for an efficient design of application-specific multiprocessor system-on-chip (called GAM- SoC). The approach is based on a generic architecture model which is used as a template throughout the design process. The key characteristics of this model are its great modularity, flexibility and scalability which make it reusable for a large class of applications. In the flow, architectural parameters are first extracted from a high-level system specification and then used to instantiate architectural components, such as processors, memory modules, IP-hardware blocks, and on-chip communication networks. The flow includes the generation of hardware/software wrappers that adapts the processor to the on- chip communication network in an application-specific way. The feasibility and effectiveness of this approach are illustrated by significant demonstration examples.
 
High-assurance systems require a level of rigor, in both design and analysis, not typical of conventional systems. This paper provides an overview of the Multiple Independent Levels of Security and Safety (MILS) approach to high-assurance system design for security and safety critical embedded systems. MILS enables the development of a system using manageable units, each of which can be analyzed separately, avoiding costly analysis required of more conventional designs. MILS is particularly well suited to embedded systems that must provide guaranteed safety or security properties.
 
Reconfigurable ALU Array (RAA) architectures – representing a popular class of Coarse-grained Reconfigurable Architectures – are gaining in popularity especially for media applications due to their flexibility, regularity, and efficiency. In such architectures, memory is critical not only for configuration data but also for the heavy data traffic required by the application. In this paper, we offer a scheme for system designers to quickly estimate the performance of media applications on RAA architectures. Our experimental results demonstrate the flexibility of our memory architecture evaluation scheme as well as the varying effects of the memory architectures on the application performance.
 
This paper introduces an efficient hardware approach to reduce the register file energy consumption by turning unused registers into a low power state. Bypassing the register fields of the fetch instruction to the decode stage allows the identification of registers required by the current instruction (instruction predecode) and allows the control logic to turn them back on. They are put into the low-power state after the instruction use. This technique achieves an 85% energy reduction with no performance penalty.
 
The increasing logic density of current Field Programmable Gate Arrays (FPGA) enables the integration of whole systems on one programmable chip. Using concepts of partial dynamic reconfiguration allows the adaptation of complex systems to changing requirements at run-time. In this paper we present a realisable approach for dynamic system integration on Xilinx Virtex FPGAs. In contrast to existing approaches that consider fixed slots for module placement, our approach allows fine-grained placement of modules with variable width along a horizontal communication infrastructure. By simulation we show that the proposed 1D-approach outperforms 2D-approaches by means of the device utilisation and external fragmentation.
 
If-conversion refers to a compiler optimisation that eliminates conditional branches by transforming a control flow region into an equivalent set of conditional instructions. VLIW architectures, used in the design of embedded multimedia processors, can support predicated execution with a limited number of conditional instructions or a full predicated ISA. We present in this paper a speculation framework and a set of Single Static Assignment (SSA) transformations to incrementally build if-converted regions on architectures supporting the select model of conditional moves. This framework has been further extended to support a configurable set of predicated instructions and used to explore architectural variants with predicated memory instructions. We implemented this SSA if-conversion algorithm in the Open64 code generator for the ST231 processor. We show an average cycles improvement of 31% without code size penalty.
 
We present a design space exploration for applications using runtime reconfigurable FPGAs. The studied example is a mechatronic control system which changes between different controller tasks at runtime. For each task we implement six alternative distributed arithmetic designs with area/computation time trade-offs. Values are estimated and later on compared to synthesis results. For exchanging controllers at runtime we propose three different mappings to the FPGA. Given the application characteristics and the reconfiguration speed of the target FPGA, our analysis derives the optimal selection of the alternative task implementations and the corresponding mapping.
 
With the ever growing complexity of modern microprocessors and the increasing demands for performance (speed, power, cost), understanding the actual nature of the workloads to be run on a target platform is crucial to meeting these requirements. This paper explores the behaviour of industry based Java benchmarks for embedded applications, looking at both the interactions at the native architectural level, as well as the virtual machine level. We find that even with an architecturally optimised interpreter, the interpretation cost of a single bytecode is nearly 21 native instructions and yields average CPI rates of 2.67 on a typical ARM platform.
 
This paper presents a new approach for the evaluation of FPGA routing resources in the presence of faulty switches and wires. Switch stuck-open (switch permanently off) and stuck-closed faults (switch permanently on) as well as wire faults are addressed. This study is directly related to fault tolerance of the interconnect for testing and reconfiguration at manufacturing and run-time application. Signal routing in the presence of faulty resources is analysed at switch block and switch block array levels. Probabilistic routing (routability) is used as figure of merit for evaluating the programmable interconnect resources of FPGA architectures. The proposed approach is based on finding a permutation (one-to-one mapping) between the input and output endpoints. A probabilistic approach is also presented to evaluate fault tolerant routing for the entire FPGA by connecting switch blocks in chains as required for testing and to account for the I/O pin restrictions of an FPGA chip. The results are reported for various commercial and academic FPGA architectures.
 
Recently, new Video-on-Demand (VoD) architectures using batching, patching and periodic broadcasting are introduced that are much more scalable than traditional unicast VoD systems. However the problem of designing an efficient server to implement these new multicast VoD architectures has received little attention. While existing server designs using round-based schedulers can still be used, results show that such designs are sub-optimal as they do not exploit the characteristics of fixed-schedule periodic broadcasting channels. This work addresses this challenge by presenting an efficient server design that can increase the system capacity by up to 60% compared to traditional video server designs.
 
In this paper, we present an overview of the Artemis workbench, which provides modelling and simulation methods and tools for efficient performance evaluation and exploration of heterogeneous embedded multimedia systems. More specifically, we describe the Artemis system-level modelling methodology, including its support for gradual refinement of architecture performance models as well as for calibration of the system-level models. We show that this methodology allows for architectural exploration at different levels of abstraction while maintaining high-level and architecture independent application specifications. Moreover, we illustrate these modelling aspects using a case study with a Motion-JPEG application.
 
Reducing energy consumption is one of the main concerns in the design and the implemen- tation of embedded real-time systems. For this reason, the current generation of processors allows to vary voltage and operating frequency to balance computational speed versus energy consumption. This technique is called Dynamic Voltage Scaling (DVS). When applying DVS to hard real-time systems, it is important to provide the worst-case computational requirement, otherwise a task may miss some timing constraint. However, the probability of a task executing for its worst-case execution time is very low. In this paper, we show how to exploit probabilistic information about the execution time of a task in order to reduce the energy consumed by the processor. Optimal speed assignments and transition points are found using a very general model for the processor. The model accounts for the processor idle power and for both the time and the energy overheads due to frequency transitions. We also show how these results can be applied to some significant cases.
 
form only given. Placement time is an overhead on the application execution time in an online placement system. In a partially reconfigurable system, the inherent parallelism of the reconfigurable hardware can be explored to speed up the placement process. We present three different architectures for two dimensional online placement. Each architecture makes different trade-offs between area usage, memory requirement and execution time. These architectures are capable of achieving very fast placement while using a very small number of hardware resources.
 
Generalised Rate Monotonic Scheduling (GRMS) theory has now been widely adopted in practice and supported by open standards. In recent years, Ada runtime and COTS RTOSs supporting GRMS have met the DO 178B flight control standard (http://www.ddci.com/products_darts.shtml; http://ic.arc.nasa.gov/publications/pdf/2000-0213.pdf). This creates strong incentives for the avionics industry to migrate from a traditional cyclical executive based system called a federated architecture to a GRMS based system to take advantage of GRMS's flexibility, and theoretical schedulability analysis. However, the industry is reluctant to migrate due to enormous potential re-certification cost. This paper presents a novel software architecture called a Logical Federated Architecture (LFA) that greatly reduces re-certification cost due to the change of scheduling methods.
 
This paper presents a HW/SW co-design methodology for the design of resource limited embedded devices. The design methodology takes advantage of system abstraction and attached vertical and horizontal co-design flows. The vertical co-design flow focuses on a system consisting of a standard processor and a custom hardware as co-processor. The horizontal co-design flow implements the system's functionality entirely in software, but the underlying processor is developed at the same time. The focus of the paper is on the detailed description of the two design flows under the aspect of power awareness.
 
In this paper, we study the problem of reducing both the dynamic and leakage energy consumption for real-time systems with (m,k)-constraints, which require that at least m out of any k consecutive jobs of a task meet their deadlines. Two energy efficient scheduling approaches incorporating both dynamic voltage scheduling (DVS) and dynamic power down (DPD) are proposed in this paper. The first one statically determines the mandatory jobs that need to meet their deadlines in order to satisfy the (m,k)-constraints, and the second one does so dynamically. The simulation results demonstrate that, with more accurate workload estimation, our proposed techniques outperformed previous research in both overall and idle energy reduction while providing the (m,k)-guarantee.
 
The goal of Dynamic Voltage Scaling (DVS) is to max- imize the energy savings while ensuring that applications' real-time requirements are met. Accurate predictions of task run-times are necessary to compute an appropriate CPU frequency that achieves high energy savings, avoids dead- lines misses, and reduces the overheads caused by frequent changes between different frequency levels. This paper ex- perimentally explores an architecture based on the XScale PXA255 processor and shows that workload-awareness is not only required for accurate predictions of utilization, but also that in systems with a discrete number of frequency levels, the energy savings achieved by existing dual-speed DVS approaches (where an optimal theoretical CPU speed is computed and then approximated by choosing the two neighboring discrete speed levels) are suboptimal. As a consequence, this work introduces an online approach to dual-speed DVS that formulates a model for speed selec- tion based on the workload characteristics of the current task set and computes a frequency pair that yields the best possible energy savings for a given task set and workload.
 
From April 3rd to April 8th 2005, a scientific workshop on the topic of "Power-aware Computing Systems" was held at Schloss Dagstuhl, Germany. The main seminar result is a classification of the obstacles, and therefore research directions, with respect to power consumption seen for different classes of devices, ranging from ultra low-power devices, over handheld devices, to servers and work stations. In a next step the seminar identified the impact different levels of dealing with power concerns can have. This paper summarises the objectives and structure as well as the outcome of that workshop.
 
The power density inside high performance systems continues to rise with every process technology generation, thereby increasing the operating temperature and creating "hot spots" on the die. As a result, the performance, reliability and power consumption of the system degrade. To avoid these "hot spots", "temperature-aware" design has become a must. For low-power embedded systems though, it is not clear whether similar thermal problems occur. These systems have very different characteristics from the high performance ones: they consume hundred times less power, they are based on a multi-processor architecture with lots of embedded memory and rely on cheap packaging solutions. In this paper, we investigate the need for temperature-aware design in a low-power systems-on-a-chip and provide guidlines to delimit the conditions for which temperature-aware design is needed.
 
We present two techniques to reduce the power consumption in FPGAs. The first technique uses two supply voltages: timing-critical paths run on normal Vdd, while the non-critical ones save power by using a lower Vdd. Our programmable dual-Vdd architectures and Vdd assignment algorithms provide an average power saving of 61% across the MCNC benchmarks. The second technique targets applications where configuration time is crucial. It uses Asymmetric SRAM (ASRAM) (instead of high-Vt SRAM) cells to implement the configuration memory. Our bit-inversion algorithm further reduces leakage by increasing the number of ASRAM cells that are in their preferred state.
 
FPGAs are becoming increasingly attractive – thanks to the improvement of their capacities and their performances. Today, FPGAs represent an efficient design solution for numerous systems. Moreover, since FPGAs are important for electronic industry, it becomes necessary to improve their security, particularly for SRAM FPGAs, since they are more vulnerable than other FPGA technologies. This paper proposes a solution to improve the security of SRAM FPGAs through flexible bitstream encryption. This proposition is distinct from other works because it uses the latest capabilities of SRAM FPGAs like partial dynamic reconfiguration and self-reconfiguration. It does not need an external battery to store the secret key. It opens a new way of application partitioning oriented by the security policy.
 
Many recent dynamic voltage-scaling (DVS) algorithms use hardware events (such as cache misses, memory bus transactions, or instruction execution rates) as the basis for deciding how much a program region can be slowed down with ac- ceptable performance loss. Although these approaches result in power savings, the hardware events measured are at best indirectly related to execution time and clock frequency. We propose a new metric for evaluating the performance loss caused by DVS, a metric that is logically related to clock frequency and execution time, namely the percentage drop in cycles. Further, we show that we can predict with high accuracy the execution time of a code region at any clock frequency after measuring the total number of cycles spent in that region for two clock frequencies|the maximum and the second highest clock frequency. Measurements using several real-world applications show that this \two-point" model predicts execution times with an accuracy that is greater than 95% in many cases. This result can be used to develop low-overhead DVS algorithms that are more system-aware than many of the current algorithms, which rely on measuring indirect eects.
 
Modern embedded systems execute a small set of applications or even a single one repeatedly. Specialising cache configurations to a particular application is well-known to have great benefits on performance and power. However, the fact that the behaviour of an application varies from phase to phase has been shown in recent years. Tuning cache configuration to fit a target application in different phases gives a further improvement in power consumption. This work presents a mechanism which determines the optimal configurations in different phases during an execution process. By applying corresponding cache configuration for each time interval of an execution process on L1 instruction cache, this work shows that on average 91.6% energy saving is obtained by comparing with average energy consumption of all four-way set-associative caches in search space. On average 5.29% power reduction is achieved by comparing with energy consumption of benchmarks with their respective global optimal cache configurations.
 
In this study, we designed and implemented the gated-Vss technique on a wide range of embedded workloads and evaluated the effects of varying parameters on the static power, specifically for caches (180,130,100, and 70 nm technology nodes with a customised simulator.) The static power increased by 50% when the cache sizes were increased from 16k to 128k for L1 cache and 256k to 2M for L2 cache. Gated-Vss is an ideal solution for larger caches, especially with small latencies. Our results also indicate that the static power savings are higher when the embedded systems have dense memory operations and less misprediction rates.
 
Reduction of the power consumption in portable wireless receivers is important for cellular systems, including UMTS and IMT2000. This paper explores the architectural design-space and methodologies for reducing the dynamic power dissipation in the Direct Sequence Code Division Multiple Access (DS-CDMA) downlink RAKE receiver. At the algorithm level, we investigate the tradeoffs of reduced precision and arithmetic complexity on the receiver performance. We then present and analyse two architectures for implementing the reference and reduced complexity receivers, with respect to dynamic power dissipation. The combined effect of reduced precision and complexity reduction leads to a 37.44% power savings.
 
In this paper, we present a functional partitioning method for low power real-time distributed embedded systems whose constituent nodes are systems-on-a-chip (SOCs). The systemlevel specification is assumed to be given as a set of task graphs. The goal is to partition the task graphs so that each partitioned segment is implemented as an SOC and the embedded system is realized as a distributed system of SOCs. Unlike most previous synthesis and partitioning tools, this technique merges partitioning and system synthesis (allocation, assignment, and scheduling) into one integrated process; both are implemented within a genetic algorithm. Genetic algorithms can escape local minima and explore the partitioning and synthesis design space efficiently. Through integration with an existing SOC synthesis tool, the proposed partitioning technique satisfies both the hard real-time constraints and the SOC area constraint of each partitioned segment. Under these constraints, our tool performs multi-objective optimization. Thus, with a single run of the tool, it produces multiple distributed SOC-based embedded system architectures that trade off the overall distributed system price and power consumption. Experimental results show the efficacy of our technique.
 
On-chip implementation of multiprocessor systems needs to planarise the interconnect networks onto the silicon foorplan. Compared with traditional ASIC/SoC architectures, Multiprocessor Systems on Chips (MPSoC) node processors are homogeneous, and MPSoC network topologies are regular. Therefore, traditional ASIC floorplanning methodologies that perform macro placement are not suitable for MPSoC designs. We propose an automated MPSoC physical planning methodology. REGULAY can generate an optimal floorplan for different topologies under different design constraints. Compared with traditional floorplanning approaches, REGULAY shows significant advantages in reducing the total interconnect wirelength while preserving the regularity and hierarchy of the network topology.
 
While a communication network is a critical component for an efficient system-on-chip multiprocessor, there are few approaches available to help with system-level architectural exploration of such a specialised interconnection network. This paper presents an integrated modelling, simulation and implementation tool. The network architecture can be co-simulated with embedded-software to obtain cycle-accurate performance metrics. This allows an energy and performance tuning of the NOC. Next it can be converted into VHDL for implementation. We discuss our approach by designing a flexible network-on-chip and present implementation results. The performance of our automatically generated network is comparable with a reference design directly developed in HDL.
 
form only given. We consider the problem of compiling programs, written in a general high-level programming language, into hardware circuits executed by an FPGA (field programmable gate array) unit. In particular, we consider the problem of synthesizing nested loops that frequently access array elements stored in an external memory (outside the FPGA). We propose an aggressive compilation scheme, based on loop unrolling and code flattening techniques, where array references from/to the external memory are overlapped with uninterrupted hardware evaluation of the synthesized loop's circuit. We implement a restricted programming language called DOL based on the proposed compilation scheme and our experimental results provide preliminary evidence that aggressive compilation can be used to compile large code segments into circuits, including overlapping of hardware operations and memory references.
 
Top-cited authors
Gordon Cichon
  • Ludwig-Maximilians-University of Munich
Sheikh Alimur Razi
  • Chittagong University of Engineering & Technology
Nur Mohammad
  • Chittagong University of Engineering & Technology
Akramul Haque
  • Premier University
Md. Shamsul Arifin
  • Chittagong University of Engineering & Technology