ACM Transactions on Embedded Computing Systems

Published by Association for Computing Machinery
Online ISSN: 1539-9087
Publications
Conference Paper
Sub-50 nm CMOS technologies are affected by significant variability which causes power and performance variations among nominally similar cores in MPSoC platforms. This undesired heterogeneity threatens execution predictability and energy efficiency. We propose two techniques to allocate sets of barrier-synchronized tasks (representative of a wide class of image processing workloads) onto variability-affected MPSoCs. The first technique models allocation as an ILP and achieves optimal results, but requires an off-line solver. The second techniques adopt a two-stage heuristic approach, and it can be adapted to work on-line. We tested our approach on the virtual prototype of a next-generation industrial multi-core platform. Experimental results demonstrate that our approach minimizes deadline violations while increasing energy efficiency.
 
The results of an optimal discretely variable voltage allocation corresponding to our LP solution by Alloc-vtcap for the example in Figure 5.  
Conference Paper
This paper presents a set of new important results for the problem of task scheduling and voltage allocation in dynamically variable voltage processor for minimizing the total processor energy consumption. The contributions are two folds: 1. for given multiple discrete supply voltages and tasks with arbitrary arrival-time/deadline constraints, we propose a voltage allocation technique which produces a feasible task schedule with optimal processor energy consumption; 2. we then extend the problem to include the case in which tasks have non-uniform load (i.e., switched) capacitances, and solve it optimally.
 
Conference Paper
JPEG XR is an emerging image coding standard, based on HD Photo developed by Microsoft. It supports high compression performance twice as high as the de facto image coding system, namely JPEG, and also has an advantage over JPEG 2000 in terms of computational cost. JPEG XR is expected to be widespread for many devices including embedded systems in the near future. In this paper, we propose a novel architecture for JPEG XR encoding. In previous architectures, entropy coding was the throughput bottleneck because it was implemented as a sequential algorithm to handle data with dependency. We found that there is no dependency in intra-macroblock data, and we could safely pipeline all the encoding processes including the entropy coding. The proposed fully-pipelined architecture achieves 100 M pixel/sec at 125 MHz which could not be achieved by previous works.
 
Conference Paper
Memory access can account for about half of a microprocessor system's power consumption. Customizing a microprocessor cache's total size, line size and associativity to a particular program is well known to have tremendous benefits for performance and power. Customizing caches has until recently been restricted to core-based flows, in which a new chip will be fabricated. However, several configurable cache architectures have been proposed recently for use in pre-fabricated microprocessor platforms. Tuning those caches to a program is still however a cumbersome task left for designers, assisted in part by recent computer-aided design (CAD) tuning aids. We propose to move that CAD on-chip, which can greatly increase the acceptance of configurable caches. We introduce on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program. We carefully designed the heuristic to avoid any cache flushing, since flushing is power and performance costly. By simulating numerous Powerstone and MediaBench benchmarks, we show that such a dynamic self-tuning cache can reduce memory-access energy by 45% to 55% on average, and as much as 97%, compared with a four-way set-associative base cache, completely transparently to the programmer.
 
Conference Paper
Over the last one decade there has been an increasing emphasis on driver-assistance systems for the automotive domain. In this paper we report our work on designing a camera-based surveillance system embedded in a ¿smart¿ car door. Such a camera is used to monitor the ambient environment outside the car - e.g., the presence of obstacles such as approaching cars or cyclists who might collide with the car door if opened - and automatically control the car door operations. This is an enhancement to the currently available side-view mirrors which the driver/passenger checks before opening the car door. The focus of this paper is on fast and robust image processing algorithms specifically targeting such a smart car door system. The requirement is to quickly detect traffic objects of interest from gray-scale images captured by omnidirectional cameras. Whereas known algorithms for object extraction from the image processing literature rely on color information and are sensitive to shadows and illumination changes, our proposed algorithms are highly robust, can operate on gray-scale images (color images are not available in our setup) and output results in real-time. To illustrate these, we present a number of experimental results based on image sequences captured from real-life traffic scenarios.
 
Basic FPGA Architecture  
Conference Paper
In this paper, we propose a placement method for island-style FPGAs, based on recursive bi-partitioning followed by application of space-filling curves. Experimental results of our method show 55% improvement in cost, when compared to random initial placement of the popular tool VPR. The solutions thus obtained require 44.5% fewer moves during final iterative refinement by ultra-low temperature simulated annealing, whereas the quality of solution is on the average 0.1% better. This establishes the utility of the method for fast reconfiguration of FPGA based co-processors.
 
Article
We are interested in verifying dynamic properties of finite state reactive systems under fairness assumptions by model checking. The systems we want to verify are specified through a top-down refinement process. In order to deal with the state explosion problem, we have proposed in previous works to partition the reachability graph, and to perform the verification on each part separately. Moreover, we have defined a class, called Bmod, of dynamic properties that are verifiable by parts, whatever the partition. We decide if a property P belongs to Bmod by looking at the form of the Buchi automaton that accepts the negation of P. However, when a property P belongs to Bmod, the property f => P, where f is a fairness assumption, does not necessarily belong to Bmod. In this paper, we propose to use the refinement process in order to build the parts on which the verification has to be performed. We then show that with such a partition, if a property P is verifiable by parts and if f is the expression of the fairness assumptions on a system, then the property f => P is still verifiable by parts. This approach is illustrated by its application to the chip card protocol T=1 using the B engineering design language.
 
Illustration of one acyclic task scheduled on a processor. Three task instances indexed by k − 1, k, and k + 1 are plotted.
The scheduled behaviors of Γ within [9.29, 9.63] seconds. The upper figure is produced by TrueTime, the lower figure is produced by the dynamic timing model. Jitters are marked by arrows.
Chen and Mora's battery model
Article
This paper establishes a novel analytical approach to quantify robustness of scheduling and battery management for battery supported cyber-physical systems. A dynamic schedulability test is introduced to determine whether tasks are schedulable within a finite time window. The test is used to measure robustness of a real-time scheduling algorithm by evaluating the strength of computing time perturbations that break schedulability at runtime. Robustness of battery management is quantified analytically by an adaptive threshold on the state of charge. The adaptive threshold significantly reduces the false alarm rate for battery management algorithms to decide when a battery needs to be replaced.
 
Hybrid system of Example 3  
Article
In this paper, we address the problem of safety verification of nonlinear hybrid systems. A hybrid symbolic-numeric method is presented to compute exact inequality invariants of hybrid systems efficiently. Some numerical invariants of a hybrid system can be obtained by solving a bilinear SOS programming via PENBMI solver or iterative method, then the modified Newton refinement and rational vector recovery techniques are applied to obtain exact polynomial invariants with rational coefficients, which {\it exactly} satisfy the conditions of invariants. Experiments on some benchmarks are given to illustrate the efficiency of our algorithm.
 
Article
Bluetooth Low Energy (BLE) is a wireless protocol well suited for ultra-low-power sensors running on small batteries. BLE is described as a new protocol in the official Bluetooth 4.0 specification. To design energy-efficient devices, the protocol provides a number of parameters that need to be optimized within an energy, latency and throughput design space. To minimize power consumption, the protocol parameters have to be optimized for a given application. Therefore, an energy-model that can predict the energy consumption of a BLE-based wireless device for different parameter value settings, is needed. As BLE differs from the original Bluetooth significantly, models for Bluetooth cannot be easily applied to the BLE protocol. Since the last one year, there have been a couple of proposals on energy models for BLE. However, none of them can model all the operating modes of the protocol. This paper presents a precise energy model of the BLE protocol, that allows the computation of a device's power consumption in all possible operating modes. To the best of our knowledge, our proposed model is not only one of the most accurate ones known so far (because it accounts for all protocol parameters), but it is also the only one that models all the operating modes of BLE. Furthermore, we present a sensitivity analysis of the different parameters on the energy consumption and evaluate the accuracy of the model using both discrete event simulation and actual measurements. Based on this model, guidelines for system designers are presented, that help choosing the right parameters for optimizing the energy consumption for a given application.
 
Article
This paper presents a novel approach that enhances the performance of 16-bit Thumb code. We have observed that throughout Thumb code there exist Thumb instruction pairs that are equivalent to a single ARM instruction. We have developed enhancements to the processor microarchitecture and the Thumb instruction set to exploit this property. We enhance the Thumb instruction set by incorporating Augmenting eXtensions (AX). A Thumb instruction pair that can be combined into a single ARM instruction is replaced by an AXThumb instruction pair by the compiler. The AX instruction is coalesced with the immediately following Thumb instruction to generate a single ARM instruction at decode time. The enhanced microarchitecture ensures that coalescing does not introduce pipeline delays or increase cycle time thereby resulting in reduction of both instruction counts and cycle counts. Using AX instructions and coalescing hardware we are also able to support e#cient predicated execution in 16-bit mode. Categories and Subject Descriptors: C.1 [Computer Systems Organization]: Processor Architectures; D.3.4 [Programming Languages]: Processors---compilers General Terms: Algorithms, Measurement, Performance Additional Key Words and Phrases: embedded processor, 32-bit ARM ISA, 16-bit Thumb ISA, code size, energy, performance, AX instructions, instruction coalescing 1.
 
Experimental Tasks 
Example Time Constraints Specified Using Time/Utility Functions  
Energy Model Settings 
Normalized Energy and AUR vs. Load with Step TUFs under Energy Setting E 2 tical performance requirement of νi = 1 and ρi = 0.96.  
Article
We present an energy-e#cient, utility accrual, real-time scheduling algorithm called the Resource-constrained EnergyE #cient Utility Accrual Algorithm (or ReUA). ReUA considers an application model where activities are subject to time/utility function (TUF) time constraints, resource dependencies including mutual exclusion constraints, and statistical performance requirements including activity (timeliness) utility bounds that are probabilistically satisfied. Further, ReUA targets mobile embedded systems where system-level energy consumption is also a major concern. For such a model, we consider the scheduling objectives of (1) satisfying the statistical performance requirements; and (2) maximizing the system-level energy e#ciency. At the same time, resource dependencies must be respected. Since the problem is NP-hard, ReUA makes resource allocations using statistical properties of application cycle demands and heuristically computes schedules with a polynomial-time cost. We analytically establish several timeliness and non-timeliness properties of the algorithm. Further, our simulation experiments illustrate the algorithm 's e#ectiveness.
 
Article
Sensor networks, a novel paradigm in distributed wireless communication technology, have been proposed for use in various applications including military surveillance and environmental monitoring. These systems could deploy heterogeneous collections of sensors capable of observing and reporting on various dynamic properties of their surroundings in a time sensitive manner. Such systems suffer bandwidth, energy, and throughput constraints that limit the quantity of information transferred from end to end. These factors coupled with unpredictable traffic patterns and dynamic network topologies make the task of designing optimal protocols for such networks difficult. Mechanisms to perform data centtic aggregation utilizing application specific knowledge provide a means to augmenting throughput, but have limitations due to their lack of adaptation and reliance on application specific decisions. lye therefore propose a novel aggregation scheme that adaptively performs application independent data aggregation in a time sensitive manner. Our work isolates aggregation decisions into a module that resides between the network and the data link layer and does not require any modifications to the currently existing MAC and network layer protocols. lYe take advantage of queuing delay and the broadcast nature of wireless communication to concatenate network units into an aggregate using a novel adaptive feedback scheme to schedule the delivery of this aggregate to the MAC layer for transmission. In our evaluation we show that end-to-end transmission delay is reduced by as much as 80% under heavy traffic loads. Additionally, we show as much as a 50% reduction in transmission energy consumption with the addition of only 2 bytes of header overhead per network unit. Theoretical analysis,...
 
Article
Lower threshold voltages in deep sub-micron technologies cause more leakage current, increasing static power dissipation. This trend, combined with the trend of larger/more cache memories dominating die area, has prompted circuit designers to develop SRAM cells with low-leakage operating modes (e.g., sleep mode). Sleep mode reduces static power dissipation but data stored in a sleeping cell is unreliable or lost. So, at the architecture level, there is interest in exploiting sleep mode to reduce static power dissipation while maintaining high performance.
 
Flow diagram for heterogeneous memory management on embedded systems. The thick arrows denote the primary flow of compilation and synthesis, while the thin arrows provide supporting data.
Normalized simulated runtimes for all benchmarks with varying memory configurations. Each group of bars displays the cases, for all benchmarks, when the SRAM size is equal to the program data size, 20% of the program data size (for all three allocation schemes), and the case when all program data is placed in DRAM.  
Normalized runtimes for benchmarks with varying SRAM size. DRAM and EEPROM sizes are fixed. X-axis = SRAM size, as a percentage of the total data size for that program. Y-axis = runtime, normalized to 1.0 for SRAM size = 100% of data size. Note the steep jump in runtime as the SRAM size approaches zero.
Normalized runtimes of Alternative 1 and Alternative 2 stack formulations for the BMM benchmark with varying SRAM size.  
Article
This paper presents a technique for the efficient compiler management of software-exposed heterogeneous memory. In many lower-end embedded chips, often used in micro-controllers and DSP processors, heterogeneous memory units such as scratch-pad SRAM, internal DRAM, external DRAM and ROM are visible directly to the software, without automatic management by a hardware caching mechanism. Instead the memory units are mapped to different portions of the address space. Caches are avoided because of their cost and power consumption, and because they make it difficult to guarantee real-time performance. For this important class of embedded chips, the allocation of data to different memory units to maximize performance is the responsibility of the software
 
Article
Most compilers ignore the problems of limited code space in embedded systems. Designers of embedded software often have no better alternative than to manually reduce the size of the source code or even the compiled code. Besides being tedious and error-prone, such optimization results in obfuscated code which is difficult to maintain and reuse. In this paper, we present a code-size-directed compiler. We phrase register allocation and code generation as an integer linear programming problem where the upper bound on the code size can simply be expressed as an additional constraint. Our experiments show that our compiler, when applied to two commercial microcontroller programs, generates code as compact as carefully crafted code.
 
Size distribution (in million cycles) for the sub-band filter events.
Article
Managing energy consumption has become vitally important to battery operated portable and embedded systems. A dynamic voltage scaling (DVS) technique reduces the processor 's dynamic power consumption quadratically at the expense of linearly decreasing the performance. Reducing energy using DVS in the context of real-time systems should consider this tradeoff. In this paper, we introduce a novel collaborative approach between the compiler and the operating system (OS) that uses fine-grained information about the execution times of a real-time application to reduce energy consumption. We use the compiler to annotate an application's source code with path-dependent information called power management hints (PMHs). This information captures the temporal behavior of the application, which varies by executing different paths. During program execution, the OS periodically changes the processor's frequency and voltage based on the temporal information provided by the PMHs. These speed adaptation points are called power management points (PMPs). We evaluate our scheme using two embedded applications: a video decoder and an automatic target recognition application. Our scheme shows an energy reduction of up to 79% over no power management and up to 50% over a static power management scheme.
 
Article
This paper describes the motivation, design, implementation and experimental evaluation (on sharply resource-constrained devices) of a self-configuring localization system using radio beacons. We identify beacon density as an important parameter in determining localization quality, which saturates at a transition density. We develop algorithms to improve localization quality by (i) automating placement of new beacons at low densities (HEAP) and (ii) rotating functionality among redundant beacons while increasing system lifetime at high densities (STROBE)
 
Article
The Network Processor market is one of the fastest growing segments of the microprocessor industry today. In spite of this increasing market importance, there does not exist a common framework to compare the performance of different network Processor designs. Our primary goal in this study is to fill this gap by creating the NetBench benchmarking suite. NetBench is designed to represent Network Processor workloads. NetBench contains 11 programs that form 18 different applications. The programs are selected from all levels of packet processing: Small, low-level code fragments as well as large application level programs are included in the suite. These applications are representative of the Network Processor applications in the market. Using the SimpleScalar simulator to model an ARM processor, we study these programs in detail and compare key characteristics such as instructions per cycle, instruction distribution, cache behavior, and branch prediction accuracy with the programs from MediaBench. Using statistical analysis, we show that the simulation results for the programs in NetBench have significantly different characteristics than program in MediaBench. Finally, we present performance measurements from Intel IXP1200 Network Processor to show how NetBench can be utilized.
 
Article
We present design patterns used by software components in the TinyOS sensor network operating system. They differ significantly from traditional software design patterns because of the constraints of sensor networks and to TinyOS's focus on static allocation ...
 
Article
Real-time multimedia applications are increasingly being mapped onto MPSoC (multiprocessor system-on-chip) platforms containing hardware--software IPs (intellectual property), along with a library of common scheduling policies such as EDF, RM. The choice ...
 
Article
Building distributed deal-time embedded systems requires a stringent methodology, from early requirement capture to full implementation. However, there is a strong link between the requirements and the final implementation (e.g., scheduling and resource dimensioning). Therefore, a rapid prototyping process based on automation of tedious and error-prone tasks (analysis and code generation) is required to speed up the development cycle. In this article, we show how the AADL (Architecture Analysis and Design Language), which appeared in late 2004, helps solve these issues thanks to a dedicated tool suite. We then detail the prototyping process and its current implementation: Ocarina.
 
Article
The SSP is a hardware implementation of a subset of the JVM for use in high consequence embedded applications. In this context, a majority of the activities belonging to class loading, as it is dened in the specication of the JVM, can be performed statically. Static class loading has the net result of dramatically simplifying the design of the SSP as well as increasing its performance. The functionality of the class loader can be implemented using strategic programming techniques. The incremental nature of strategic programming is amenable to formal verication. This article gives an overview of the core class loading activities for the SSP, their implementation in the strategic programming language TL, and outlines the approach to formal verication of the implementation.
 
Article
An important correctness criterion for software running on embedded microcontrollers is stack safety: a guarantee that the call stack does not overflow. Our first contribution is a method for statically guaranteeing stack safety of interrupt-driven embedded software using an approach based on context-sensitive dataflow analysis of object code. We have implemented a prototype stack analysis tool that targets software for Atmel AVR microcontrollers and tested it on embedded applications compiled from up to 30,000 lines of C. We experimentally validate the accuracy of the tool, which runs in under 10 sec on the largest programs that we tested. The second contribution of this paper is the development of two novel ways to reduce stack memory requirements of embedded software.
 
Computational model: several threads of execution communicate via shared state variables and receive signals. public class RS232ReceiveInterruptHandler extends InterruptHandler { private RS232 rs232; private InterruptControl interruptControl; private byte UartRxBuffer[]; private short UartRxWrPtr; ... protected void handle() { synchronized(this) { UartRxBuffer[UartRxWrPtr++] = rs232.P0_UART_RX_TX_REG; if (UartRxWrPtr >= UartRxBuffer.length) UartRxWrPtr = 0; } rs232.P0_CLEAR_RX_INT_REG = 0; interruptControl.RESET_INT_PENDING_REG = RS232.CLR_UART_RX_INT_PENDING; } }
Device object classes and board factory classes  
Memory layout of the JOP JVM  
Article
Embedded systems use specialized hardware devices to interact with their environment, and since they have to be dependable, it is attractive to use a modern, type-safe programming language like Java to develop programs for them. Standard Java, as a platform-independent language, delegates access to devices, direct memory access, and interrupt handling to some underlying operating system or kernel, but in the embedded systems domain resources are scarce and a Java Virtual Machine (JVM) without an underlying middleware is an attractive architecture. The contribution of this article is a proposal for Java packages with hardware objects and interrupt handlers that interface to such a JVM. We provide implementations of the proposal directly in hardware, as extensions of standard interpreters, and finally with an operating system middleware. The latter solution is mainly seen as a migration path allowing Java programs to coexist with legacy system components. An important aspect of the proposal is that it is compatible with the Real-Time Specification for Java (RTSJ).
 
Discrete abstraction of the continuous state-space for the thermostat model: Each box or line on the right hand side corresponds to a consistent vector b ∈ B 10 for the predicates as specified in equation (1).  
Illustration of the computation of continuous successors. After two iterations the new abstract state (l, b ) is reachable from (l, b).  
The graph of reachable abstract states for the thermostat model
Article
Predicate abstraction has emerged to be a powerful technique for extracting finite-state models from infinite-state discrete programs. This paper presents algorithms and tools for reachability analysis of hybrid systems by combining the notion of predicate abstraction with recent techniques for approximating the set of reachable states of linear systems using polyhedra. Given a hybrid system and a set of predicates, we consider the finite discrete quotient whose states correspond to all possible truth assignments to the input predicates. The tool performs an on-the-fly exploration of the abstract system. We present the basic techniques for guided search in the abstract state-space, optimizations of these techniques, implementation of these in our verifier, and case studies demonstrating the promise of the approach. We also address the completeness of our abstraction-based verification strategy by showing that predicate abstraction of hybrid systems can be used to prove bounded safety.
 
Article
Currently, system-on-chip (SoC) designs are becoming increasingly complex, with more and more components being integrated into a single SoC design. Communication between these components is increasingly dominating critical system paths and frequently becomes the source of performance bottlenecks. It, therefore, becomes imperative for designers to explore the communication space early in the design flow. Traditionally, system designers have used Pin-Accurate Bus Cycle Accurate (PA-BCA) models for early communication space exploration. These models capture all of the bus signals and strictly maintain cycle accuracy, which is useful for reliable performance exploration but results in slow simulation speeds for complex, designs, even when they are modeled using high-level languages. Recently, there have been several efforts to use the Transaction-Level Modeling (TLM) paradigm for improving simulation performance in BCA models. However, these transaction-based BCA (T-BCA) models capture a lot of details that can be eliminated when exploring communication architectures. In this paper, we extend the TLM approach and propose a new transaction-based modeling abstraction level (CCATB) to explore the communication design space. Our abstraction level bridges the gap between the TLM and BCA levels, and yields an average performance speedup of 120% over PA-BCA and 67% over T-BCA models, on average. The CCATB models are not only faster to simulate, but also extremely accurate and take less time to model compared to both T-BCA and PA-BCA models. We describe the mechanisms that produce the speedup in CCATB models and also analyze how the achieved simulation speedup scales with design complexity. To demonstrate the effectiveness of using CCATB for exploration, we present communication space exploration case studies from the broadband communication and multimedia application domains.
 
Article
Inexpensive, reliable hard disk storage is increasingly required in both businesses and the home. As disk capacities increase and multiple drives are combined in one system the probability of multiple disk failures increases. Through the adoption of RAID 6 the capability to recover from up to two simultaneous disk failures becomes available. In this article, we present three different RAID 6 implementations each tailored to support different target applications and optimized to reduce overall hardware resource utilization. We present an optimal Reed-Solomon-based RAID 6 implementation for arrays of four disks. We also present the smallest in terms of hardware resource utilization as well having the highest throughput RAID 6 hardware solution for disk arrays of up to 15 drives. Finally, we present an implementation supporting up to 255 disks in a single array.
 
Article
Home and office network gateways often employ a cost-effective embedded network processor to handle their network services. Such network gateways have received strong demand for applications dealing with intrusion detection, keyword blocking, antivirus and antispam. Accordingly, we were motivated to propose an appropriate fast scalable automaton-matching (FSAM) hardware to accelerate the embedded network processors. Although automaton matching algorithms are robust with deterministic matching time, there is still plenty of room for improving their average-case performance. FSAM employs novel prehash and root-index techniques to accelerate the matching for the nonroot states and the root state, respectively, in automation based hardware. The prehash approach uses some hashing functions to pretest the input substring for the nonroot states while the root-index approach handles multiple bytes in one single matching for the root state. Also, FSAM is applied in a prevalent automaton algorithm, Aho-Corasick (AC), which is often used in many content-filtering applications. When implemented in FPGA, FSAM can perform at the rate of 11.1Gbps with the pattern set of 32,634 bytes, demonstrating that our proposed approach can use a small logic circuit to achieve a competitive performance, although a larger memory is used. Furthermore, the amount of patterns in FSAM is not limited by the amount of internal circuits and memories. If the high-speed external memories are employed, FSAM can support up to 21,302 patterns while maintaining similar high performance.
 
Article
DRAM (dynamic random-access memory) energy consumption in low-power embedded systems can be very high, exceeding that of the data cache or even that of the processor. This paper presents and evaluates a scheme for reducing the energy consumption of SDRAM (synchronous DRAM) memory access by a combination of techniques that take advantage of SDRAM energy efficiencies in bank and row access. This is achieved by using small, cachelike structures in the memory controller to prefetch an additional cache block(s) on SDRAM reads and to combine block writes to the same SDRAM row. The results quantify the SDRAM energy consumption of MiBench applications and demonstrate significant savings in SDRAM energy consumption, 23%, on average, and reduction in the energy-delay product, 44%, on average. The approach also improves performance: the CPI is reduced by 26%, on average.
 
Article
Memory accesses represent a major bottleneck in embedded systems power and performance. Traditionally, designers tried to alleviate this problem by relying on a simple cache hierarchy, or a limited use of special purpose memory modules such as stream buffers. Although real-life applications contain a large number of memory references to a diverse set of data structures, a significant percentage of all memory accesses in the application are generated from a few memory instructions that exhibit predictable, well-known access patterns; this creates an opportunity for memory customization, targeting the needs of these access patterns. We present APEX, an approach that extracts, analyzes and clusters the most active access patterns in the application, and aggressively customizes the memory architecture to match the needs of the application. Moreover, though the memory modules are important, the rate at which the memory system can produce the data for the CPU is significantly impacted by the connectivity architecture between the memory subsystem and the CPU. Thus, it is critical to consider the connectivity architecture early in the design flow, in conjunction with the memory architecture. We couple the exploration of memory modules together with their connectivity, to evaluate a wide range of cost, performance, and energy connectivity architectures. We use a heuristic to prune the design space, guiding the exploration towards the most promising designs. We present experiments on a set of large real-life benchmarks, showing significant performance improvements for varied cost and power characteristics, allowing the designer to evaluate customized memory and connectivity configurations for embedded systems.
 
Article
The increasing complexity of embedded systems requires modeling at higher levels of abstraction. Transaction level modeling (TLM) has been proposed to abstract communication for high-speed system simulation and rapid design space exploration. Although being widely accepted for its high performance and efficiency, TLM often exhibits a significant loss in model accuracy. In this article, we systematically analyze and quantify the speed/accuracy trade-off in TLM. To this end, we provide a classification of TLM abstraction levels based on model granularity and define appropriate metrics and test setups to quantitatively measure and compare the performance and accuracy of such models. Addressing several classes of embedded communication protocols, we apply our analysis to three common bus architectures, the industry-standard AMBA advanced high-performance bus (AHB) as an on-chip parallel bus, the controller area network (CAN) as an off-chip serial bus, and the Motorola ColdFire Master Bus as an example for a custom embedded processor bus. Based on the analysis of these individual busses, we then generalize our results for a broader conclusion. The general TLM trade-off offers gains of up to four orders of magnitude in simulation speed, generally however, at the price of low accuracy. We conclude further that model granularity is the key to efficient TLM abstraction, and we identify conditions for accuracy of abstract models. As a result, this article provides general guidelines that allow the system designer to navigate the TLM trade-off effectively and choose the most suitable model for the given application with fast and accurate results.
 
Article
This paper proposes and evaluates compile time and instruction-set techniques for improving the accuracy of signal-processing algorithms run on fixed-point embedded processors. These techniques are proposed in the context of a profile guided floating- to fixed-point compiler-based conversion process. A novel fixed-point scaling algorithm (IRP) is introduced that exploits correlations be- tween values in a program by applying fixed-point scaling, retaining as much precision as possible without causing overflow. This approach is extended into a more aggressive scaling algorithm (IRP-SA) by leveraging the modulo nature of 2's complement addition and subtraction to discard most significant bits that may not be redundant sign-extension bits. A complementary scaling technique (IDS) is then proposed that enables the fixed-point scaling of a variable to be param- eterized, depending upon the context of its definitions and uses. Finally, a novel instruction-set enhancement—fractional multiplication with internal left shift (FMLS)—is proposed to further leverage interoperand correlations uncovered by the IRP-SA scaling algorithm. FMLS preserves a different subset of the full product's bits than traditional fractional fixed-point or integer multipli- cation. On average, FMLS combined with IRP-SA improves accuracy on processors with uniform bitwidth register architectures by the equivalent of 0.61 bits of additional precision for a set of signal-processing benchmarks (up to 2 bits). Even without employing FMLS, the IRP-SA scaling algorithm achieves additional accuracy over two previous fixed-point scaling algorithms by aver- ages of 1.71 and 0.49 bits. Furthermore, as FMLS combines multiplication with a scaling shift, it reduces execution time by an average of 9.8%. An implementation of IDS, specialized to single- nested loops, is found to improve accuracy of a lattice filter benchmark by the equivalent of more than 16-bits of precision.
 
Article
Accurate and fast system modeling is central to the rapid design space exploration needed for embedded-system design. With fast, complex SoCs playing a central role in such systems, system designers have come to require MIPS-range simulation speeds and near-cycle accuracy. The sophisticated simulation frameworks that have been developed for high-speed system performance modeling do not address power consumption, although it is a key design constraint. In this paper, we define a simulation-based methodology for extending system performance modeling frameworks to also include power modeling. We demonstrate the use of this methodology with a case study of a real, complex embedded system, comprising the Intel XScale embedded microprocessor, its WMMX SIMD co processor, L1 caches, SDRAM, and the on-board address and data buses. We describe detailed power models for each of these components and validate them against physical measurements from hardware, demonstrating that such frameworks enable designers to model both power and performance at high speeds without sacrificing accuracy. Our results indicate that the power estimates obtained are accurate within 5% of physical measurements from hardware, while simulation speeds consistently exceed a million instructions per second (MIPS).
 
Article
Accurate and fast system modeling is central to the rapid design space exploration needed for embedded-system design. With fast, complex SoCs playing a central role in such systems, system designers have come to require MIPS-range simulation speeds and near-cycle accuracy. The so- phisticated simulation frameworks that have been developed for high-speed system performance modeling do not address power consumption, although it is a key design constraint. In this paper, we define a simulation-based methodology for extending system performance-modeling frameworks to also include power modeling. We demonstrate the use of this methodology with a case study of a real, complex embedded system, comprising the Intel XScale® embedded microprocessor, its WMMXTM SIMD coprocessor, L1 caches, SDRAM and the on-board address and data buses. We describe detailed power models for each of these components and validate them against physical measurements from hardware, demonstrating that such frameworks enable designers to model both power and performance at high speeds without sacrificing accuracy. Our results indicate that the power estimates obtained are accurate within 5% of physical measurements from hardware, while simulation speeds consistently exceed a million instructions per second (MIPS).
 
Article
Applications in the signal processing domain are often modeled by dataflow graphs. Due to heterogeneous complexity requirements, these graphs contain both dynamic and static dataflow actors. In previous work, we presented a generalized clustering approach for these heterogeneous dataflow graphs in the presence of unbounded buffers. This clustering approach allows the application of static scheduling methodologies for static parts of an application during embedded software generation for multiprocessor systems. It systematically exploits the predictability and efficiency of the static dataflow model to obtain latency and throughput improvements. In this article, we present a generalization of this clustering technique to dataflow graphs with bounded buffers, therefore enabling synthesis for embedded systems without dynamic memory allocation. Furthermore, a case study is given to demonstrate the performance benefits of the approach.
 
Article
SystemC is a system-level modeling language that can be used effectively for hardware/software co-design. Since a major goal of SystemC is to enable verification at higher levels of abstraction, the tendency is now directing to introducing formal verification approaches for SystemC. In this article, we propose an approach for formal verification of SystemC designs, and provide the semantics of SystemC using Labeled Transition Systems (LTS) for this purpose. An actor-based language, Rebeca, is used as an intermediate language. SystemC designs are mapped to Rebeca models and then Rebeca verification toolset is used to verify LTL and CTL properties. To tackle the state-space explosion, Rebeca model checkers offer some reduction policies that make them appropriate for SystemC verification. The approach also benefits from the modular verification and program slicing techniques applied on Rebeca models. To show the applicability of our approach, we verified a single-cycle MIPS design and two hardware/software co-designs. The results show that our approach can effectively be used both in hardware and hardware/software co-verification.
 
A simple example of the use of classes in Ptolemy II.
Derived relation as in figure 6 but with the edges labeled with the distance function s.  
Article
Actor-oriented components emphasize concurrency and temporal semantics and are used for mod- eling and designing embedded software and hardware. Actors interact with one another through ports via a messaging schema that can follow any of several concurrent semantics. Domain- specific actor-oriented languages and frameworks are common (Simulink, LabVIEW, SystemC, etc.). However, they lack many modularity and abstraction mechanisms that programmers have become accustomed to in object-oriented components, such as classes, inheritance, interfaces, and polymorphism, except as inherited from the host language. This paper shows a form that such mechanisms can take in actor-oriented components, gives a formal structure, and describes a prototype implementation. The mechanisms support actor-oriented class definitions, subclass- ing, inheritance, and overriding. The formal structure imposes structural constraints on a model (mainly the "derivation invariant") that lead to a policy to govern inheritance. In particular, the structural constraints permit a disciplined form of multiple inheritance with unambiguous inher- itance and overriding behavior. The policy is based formally on a generalized ultrametric space with some remarkable properties. In this space, inheritance is favored when actors are "closer" (in the generalized ultrametric), and we show that when inheritance can occur from multiple sources, one source is always unambiguously closer than the other. Categories and Subject Descriptors: D.3.3 (Programming Languages): Language Constructs and Features—data types and structures
 
Article
We consider concurrent models of computation where “actors” (components that are in charge of their own actions) communicate by exchanging messages. The interfaces of actors principally consist of “ports,” which mediate the exchange of messages. Actor-oriented architectures contrast with and complement object-oriented models by emphasizing the exchange of data between concurrent components rather than transformation of state. Examples of such models of computation include the classical actor model, synchronous languages, data-flow models, process networks, and discrete-event models. Many experimental and production languages used to design embedded systems are actor oriented and based on one of these models of computation. Many of these models of computation benefit considerably from having access to causality information about the components. This paper augments the interfaces of such components to include such causality information. It shows how this causality information can be algebraically composed so that compositions of components acquire causality interfaces that are inferred from their components and the interconnections. We illustrate the use of these causality interfaces to statically analyze timed models and synchronous language compositions for causality loops and data-flow models for deadlock. We also show that that causality analysis for each communication cycle can be performed independently and in parallel, and it is only necessary to analyze one port for each cycle. Finally, we give a conservative approximation technique for handling dynamically changing causality properties.
 
Article
Reconfiguration delay is one of the major barriers in the way of dynamically adapting a system to its application requirements. The run-time reconfiguration delay is quite comparable to the application latency for many classes of applications and might ...
 
Article
Safety-critical embedded systems often operate in harsh environmental conditions that necessitate fault-tolerant computing techniques. In addition, many safety-critical systems execute real-time applications that require strict adherence to task deadlines. These embedded systems are also energy-constrained, since system lifetime is determined largely by the battery lifetime. In this paper, we investigate dynamic adaptation techniques based on checkpointing and dynamic voltage scaling (DVS) for fault tolerance and power management. We first present schedulability tests that provide the criteria under which checkpointing can provide fault tolerance and real-time guarantees. We then present an adaptive checkpointing scheme in which the checkpointing interval for a task is dynamically adjusted during execution, and checkpoints are inserted based not only on the available slack, but also on the occurrences of faults. Next, we combine adaptive checkpointing with DVS to achieve power reduction. Finally, we develop an adaptive checkpointing scheme for a set of multiple tasks in real-time systems. An offline preprocessing based on linear programming is used to determine the parameters that are provided as inputs to the online adaptive checkpointing procedure. Simulation results show that compared to previous methods, the proposed adaptive checkpointing approach increases the likelihood of timely task completion in the presence of faults. When combined with DVS, adaptive checkpointing also leads to considerable energy savings.
 
Article
Adaptation is increasingly used in the development of safety-critical embedded systems, in particular to reduce hardware needs and to increase availability. However, composing a system from many reconfigurable components can lead to a huge number of possible system configurations, inducing a complexity that cannot be handled during system design. To overcome this problem, we propose a new component-based modeling and verification method for adaptive embedded systems. The component-based modeling approach facilitates abstracting a composition of components to a hierarchical component. In the hierarchical component, the number of possible configurations of the composition is reduced to a small number of hierarchical configurations. Only these hierarchical configurations have to be considered when the hierarchical component is used in further compositions such that design complexity is reduced at each hierarchical level. In order to ensure well-definedness of components, we provide a model of computation enabling the formal verification of critical requirements of the adaptation behavior.
 
VMAC Abstraction
Packet Transmission in VMAC
Water Mark and Admission Decision
Article
As wireless devices and sensors are increasingly deployed on people, researchers have begun to focus on wireless body-area networks. Applications of wireless body sensor networks include healthcare, entertainment, and personal assistance, in which sensors collect physiological and activity data from people and their environments. In these body sensor networks, quality of service is needed to provide reliable data communication over prioritized data streams. This article proposes BodyQoS, the first running QoS system demonstrated on an emulated body sensor network. BodyQoS adopts an asymmetric architecture, in which most processing is done on a resource-rich aggregator, minimizing the load on resource-limited sensor nodes. A virtual MAC is developed in BodyQoS to make it radio-agnostic, allowing a BodyQoS to schedule wireless resources without knowing the implementation details of the underlying MAC protocols. Another unique property of BodyQoS is its ability to provide adaptive resource scheduling. When the effective bandwidth of the channel degrades due to RF interference or body fading effect, BodyQoS adaptively schedules remaining bandwidth to meet QoS requirements. We have implemented BodyQoS in NesC on top of TinyOS, and evaluated its performance on MicaZ devices. Our system performance study shows that BodyQoS delivers significantly improved performance over conventional solutions in combating channel impairment.
 
The DASAT cache organization.
Spec95 benchmarks: average memory access time of the conventional caches and the DASAT cache.
Article
This article presents the design of a simple hardware-controlled, high performance cache system. The design supports fast access time, optimal utilization of temporal and spatial localities adaptive to given applications, and a simple dynamic fetching mechanism with different fetch sizes. Support for dynamically varying the fetch size makes the cache equally effective for general-purpose as well as multimedia applications. Our cache organization and operational mechanism are especially designed to maximize temporal locality and spatial locality, selectively and adaptively. Simulation shows that the average memory access time of the proposed cache is equal to that of a conventional direct-mapped cache with eight times as much space. In addition, the simulations show that our cache achieves better performance than a 2-way or 4-way set associative cache with twice as much space. The average miss ratio, compared with the victim cache with 32-byte block size, is improved by about 41% or 60% for general applications and multimedia applications, respectively. It is also shown that power consumption of the proposed cache is around 10% to 60% lower than other cache systems that we examine. Our cache system thus offers high performance with low power consumption and low hardware cost.
 
Article
This article presents our methodology for implementing self-adaptivness within an OS-based and reconfigurable embedded system according to objectives such as quality of service, performance, or power consumption. We detail our approach to separate application-specific decisions and hardware/software-implementation decisions at system level. The former are related to the efficiency control of applications and based on the knowledge of application engineers. The latter are generic and address the choice between various hardware and software implementations according to user objectives. The decision management is implemented as an adaptive closed-loop model. We describe how each design step may be implemented and especially how we solved the issue of stability. Finally, we present a video-tracking application implemented on a FPGA to demonstrate the effectiveness of our solution, results are given for a system built around a NIOS soft-core with μCOS II RTOS and new services for managing hardware and software tasks transparently.
 
Article
In this paper, we propose a novel scheduling framework for a dynamic real-time environment with energy constraints. This framework dynamically adjusts the CPU voltage/frequency so that no task in the system misses its deadline and the total energy savings of the system are maximized. In this paper, we consider only realistic, discrete-level speeds.Each task in the system consumes a certain amount of energy, which depends on a speed chosen for execution. The process of selecting speeds for execution while maximizing the energy savings of the system requires the exploration of a large number of combinations, which is too time consuming to be computed online. Thus, we propose an integrated heuristic methodology, which executes an optimization procedure in a low computation time. This scheme allows the scheduler to handle power-aware real-time tasks with low cost while maximizing the use of the available resources and without jeopardizing the temporal constraints of the system. Simulation results show that our heuristic methodology is able to generate power-aware scheduling solutions with near-optimal performance.
 
Article
Adaptive filters are widely used in many applications of digital signal processing. Digital com-munications and digital video broadcasting are just two examples. Traditionally, small embed-ded systems have employed the least computationally intensive filter adaptive algorithms, such as normalized least mean squares (NLMS). This article shows that FPGA devices are a highly suitable platform for more computationally intensive adaptive algorithms. We present an op-timized core which implements GSFAP. GSFAP is an algorithm with far superior adaptation properties than NLMS, and with only slightly higher computational complexity. To further op-timize resource requirements we use logarithmic arithmetic, rather than conventional floating point, within the custom core. Our design makes effective use of the pipelined logarithmic ad-dition units, and takes advantage of the very low cost of logarithmic multiplication and divi-sion. The resulting GSFAP core can be clocked at more than 80MHz on a one million-gate Xil-inx XC2V1000-4 device. The core can be used to implement adaptive filters of orders 20 to 1000 performing echo cancellation on speech signals at a sampling rate exceeding 50kHz. For compar-ison, we implemented a similar NLMS core and found that although it is slightly smaller than the GSFAP core and allows a higher signal sampling rate for the corresponding filter orders, the GSFAP core has adaptation properties that are much superior to NLMS, and that our core can provide very sophisticated adaptive filtering capabilities for resource-constrained embedded systems.
 
Conference Paper
Loop caches provide an effective method for decreasing memory hierarchy energy consumption by storing frequently executed code in a more energy efficient structure than the level one cache. However, due to code structure restrictions and/or costly design time pre-analysis efforts, previous loop cache designs are not suitable for all applications and system scenarios. In this paper, we present an adaptive loop cache that is amenable to a wide range of system scenarios, providing an additional 20% average instruction memory hierarchy energy savings (with individual benchmark energy savings as high as 69%) compared to the best previous loop cache design.
 
Performing address register assignment followed by simple offset assignment generates memory sub-layouts that must be placed in memory. The problem of finding a placement that minimizes overhead is called the memory-layout permutation problem.  
Article
In digital signal processors (DSPs), variables are accessed using k address registers. The problem of finding a memory layout, for a set of variables, that minimizes the address-computation overhead is known as the General Offset Assignment (GOA) problem. The most common approach to this problem is to partition the set of variables into k partitions and to assign each partition to an address register. Thus, effectively decomposing the GOA problem into several Simple Offset Assignment (SOA) problems. Many heuristic-based algorithms are proposed in the literature to approximate solutions to both the variable partitioning and the SOA problems. However, the address-computation overhead of the resulting memory layouts are not accurately evaluated. This article presents an evaluation of memory layouts that uses Gebotys' optimal address-code generation technique. The use of this evaluation method leads to a new optimization problem: the Memory Layout Permutation (MLP) problem. We then use Gebotys' technique and an exhaustive solution to the MLP problem to evaluate heuristic-based offset-assignment algorithms. The memory layouts produced by each algorithm are compared against each other and against the optimal layouts. The results show that even in small access sequences with 12 variables or less, current heuristics may produce memory layouts with address-computation overheads up to two times higher than the overhead of an optimal layout.
 
Top-cited authors
Reinhard Wilhelm
  • Universität des Saarlandes
Guillem Bernat
  • Independent Researcher
Frank Mueller
  • North Carolina State University
Jakob Engblom
  • Intel Sweden
Isabelle Puaut
  • IRISA - Institut de Recherche en Informatique et Systèmes Aléatoires