Journal of Systems Architecture

Published by Elsevier
Print ISSN: 1383-7621
We address the problem of developing distributed simulation techniques to analyze Extended Concurrent Algebraic Term Nets (ECATNets). ECATNets are a kind of High-Level Algebraic Nets used for specifying various aspects of distributed and parallel systems. The conservative and optimistic approaches of Distributed Discrete Event Simulation (DDES) are used as the starting point to develop a simulation framework for studying their behaviour. The ECATNet model to be simulated is partitioned into several connected subnets simulated in parallel by several Logical Processes (LPs).
In this paper, we suggest an LRU/distance-aware combined second-level (L2) cache for scalable CC-NUMA multiprocessors, which is composed of a traditional LRU cache and an additional cache maintaining the distance information of individual cache blocks. The LRU cache selects a victim using age information, while the distance-aware cache does this using distance information. Both work together to reduce the overall distance effectively upon cache misses by keeping long-distance blocks as well as recently used blocks. It has been observed that the proposed cache outperforms the traditional LRU cache by up to 28% in the execution time. It is also found to perform even better than an LRU cache of twice the size.
In many application in VLSI CAD, a given netlist has to be partitioned into smaller sub-designs which can be handled much better. In this paper we present a new recursive bi-partitioning algorithm that is especially applicable, if a large number of final partitions, e.g. more than 1000, has to be computed. The algorithm consists of two steps. Based on recursive splits the problem is divided into several sub-problems, but with increasing recursion depth more run time is invested. By this an initial solution is determined very fast. The core of the method is a second step, where a very powerful greedy algorithm is applied to refine the partitions. Experimental results are given that compare the new approach to state-of-the-art tools. The experiments show that the new approach outperforms the standard techniques with respect to run time and quality. Furthermore, the memory usage is very low and is reduced in comparison to other methods by more than a factor of four.
We describe V-SAT, a tool for performing design space exploration of System-On-Chip (SOC) architectures. The key components of V-SAT include EXPRESSION, a language for specification of the architecture, SIMPRESS, a simulator generator for analysis/evaluation of the architecture, and the V-SAT GUl front-end for easy specification and detailed analysis. We give a brief overview of the components (EXPRESSION, SIMPRESS and GUI) and, using an example DLX architecture, demonstrate V-SAT's usefulness in exploration for an embedded SOC codesign flow by specifying and evaluating several modifications to the pipeline structure of the processor. We believe that V-SAT provides a powerful environment, both for early design space exploration, as well as for the detailed design of SOC architectures
Unicast algorithms in off-line routing, have been used as a one-to-one communication between a source node and a destination node in an n-dimensional hypercube, denoted as H<sub>n</sub>. These algorithms cannot be used in the case of distributed routing as each node knows the status of its neighbors only while in the case of off-line routing each node knows the status of all the nodes. In this paper, we propose a new method for distributed routing. It can also be used for off-line routing where the cost of collecting global information about the faulty nodes is avoided as well as the information about H, whether it is k-safe or not. The minimum requirements of the proposed method is to have the path between the source and the destination connected, hence, it may work when H<sub>n</sub> is disconnected
Many time-critical applications require predictable performance. Tasks corresponding to these applications have deadlines to be met despite the presence of faults. Failures can happen either due to processor faults or due to task errors. To tolerate both processor and task failures, the copies of every task have to be mutually excluded in space and also in time in the schedule. We assume, each task has two versions, namely, primary copy and backup copy. We believe that the position of the backup copy in the task queue with respect to the position of the primary copy (distance) is a crucial parameter which affects the performance of any fault-tolerant dynamic scheduling algorithm. To study the effect of distance parameter, we make fault-tolerant extensions to the well-known myopic scheduling algorithm which is a dynamic scheduling algorithm capable of handling resource constraints among tasks. We have conducted an extensive simulation to study the effect of distance parameter on the schedulability of fault tolerant myopic scheduling algorithm
This paper proposes a new analytical approach for evaluating the effects of a repair process on the defect level of multichip module (MCM) systems at assembly. Repair of MCMs is usually required to improve the yield and quality of these systems, while preserving cost effectiveness. In the proposed approach, we develop a novel quality model, which is solved analytically in O(rN<sup>3</sup>) (where r is the maximum number of allowed repair cycles and N is the number of chips in the MCM). The proposed model is based on a previously proposed quality model for MCMs which did not incorporate the effect of a repair process on the defect level. The proposed model relates the defect level to various figures of merit of repair, such as the probability of successfully repairing a fault (referred to as repairability), the probability of damaging the system and the maximum allowed number of repair cycles. Parameteric results show that due to the repair process the overall defect-level decreases as the MCM yield increases; however, there exists a bound in the number of repair cycles, to permit an increase in repairability. Using these results, it is possible to predict a more accurate value of the defect level of MCMs by taking into account the different parameters affecting the repair process, while realistically reducing the defect level of the final MCM product
This paper presents an improved register-transfer level functional partitioning approach for testability. Based on the earlier work (Kuchcinski and Peng (1994), the proposed method identifies the hard-to-test points initially based on data path testability and control state reachability. These points will be made directly accessible by DFT techniques. Then the actual partitioning procedure is performed by a quantitative clustering algorithm which clusters directly interconnected components based on a new proposed global testability of data path and global state reachability of control part. After each clustering step, we use a new estimation method which is based partially on explicit re-calculation and partially on gradient techniques for incremental testability and state reachability analysis to update the test property of the circuit. This process will be iterated until the design is partitioned into several disjoint subcircuits and each of them can be tested independently. The control part is then modified to control the circuit in normal and test mode accordingly. Therefore, test quality is improved by independent test generation and application for every partition by combing the effect of data path with control part. Experimental results show the advantages of the proposed algorithm compared with other conventional approaches
Portable products are being used increasingly. Because these systems are battery powered, reducing power consumption is vital. In this report we give the properties of low-power design and techniques to exploit them on the architecture of the system. We focus on: minimizing capacitance, avoiding unnecessary and wasteful activity, and reducing voltage and frequency. We review energy reduction techniques in the architecture and design of a hand-held computer and the wireless communication system including error control, system decomposition, communication and MAC protocols, and low-power short range networks.
A wide variety of in-vehicle devices such as camera sensors, navigation systems, telematics and communication equipments have been incorporated into a vehicle to realize Intelligent Transport Systems (ITS) applications. Because an efficient standardized network is required, ITS Data Bus (IDB) has been discussed to carry high-speed multimedia data for audio, video and other real-time ITS applications. For connecting devices in a standardized manner, the IDB network has architecture with a gateway called vehicle interface which is located between automaker’s proprietary network and the standardized IDB network. IEEE 1394 (also known as iLink or FireWire), which can transport multimedia data for consumer electronics, is a good candidate for IDB network. In this paper, we analyze the issues for existing AV/C protocol (application layer protocol over IEEE 1394) to comprise the IDB network. In addition, we designed and implemented the vehicle interface protocol as a higher layer of IEEE 1394 to address the AV/C protocol issues for realizing the whole IDB network architecture.
This paper presents an overview of a proposed protocol in a Real Time Operating system to provide application-transparent fault tolerant services. Fault tolerance is achieved by saving consistent states (checkpoints) of the processes belonging to a real time application. Most of the existing checkpoint protocols are not suitable for real time environments due to the uncontrolled delay introduced when a checkpoint is saved. Our approach proposes the extension of the real time operating system services to save a checkpoint when the user performs a system call. This scheme allows the real time application designer to know the temporal specifications of every system call (checkpointing included). A real time application is composed of several Real Time processes, that share data by using the interprocess communication facilities provided by the operating system: message queues, shared memory, signals, etc. Because of these communications the operating system has to ensure the consistency of checkpoints. Current real time checkpointing schemes use global checkpointing: when a process saves a checkpoint all the processes in the system save one. The proposed solution is to track processes interactions (dependencies) in order to reduce the number of processes involved in a checkpoint.
We describe a simulation environment that allows us to simulate the standard peripheral component interconnect (PCI) bus protocol, as well as modified PCI protocols. While there are standard benchmarks (such as the SPEC [IEEE Comput. 33 (7) (2000) 28] benchmarks) available for processor simulation, database system simulation, and now even for simulating embedded systems (from EDN Embedded Microprocessor Benchmarking Consortium, EEMBC,, there are no standard benchmarks for simulating computer buses in general and specifically, for simulating the PCI bus. To address this problem we describe a methodology for gathering information about the PCI traffic from a real system, and to use this information in order to generate PCI cycles that drive the simulator for both standard and modified PCI protocols. Finally, we use the simulation environment to run experiments with various parameters of the standard PCI protocols, and an extension that involves transferring a hint about the expected latency on the data bus at the time the target ends the current burst transaction.
The latest image compression standard, JPEG 2000 is well tuned for diverse applications, thus raising various throughput demands on its building blocks. Therefore, a JPEG 2000 encoder with the feature of scalability is favorable for its ability of meeting different throughput requirements. On the other hand, the large amounts of data streams underline the importance of bandwidth optimization in designing the encoder. The initial specification, especially in terms of loop organization and array indices, describes the data manipulations and, subsequently, influences the outcome of the architecture implementation. Therefore, there is a clear need for the exploiting support, and we believe the emphasis should lie on the loop level steering. In this paper, we apply loop transformation techniques to a scalable embedded JPEG 2000 encoder design during the architectural exploration stage, considering not only the balance of throughput among different blocks, but also the reduction of data transfer. The architecture is prototyped onto Xilinx FPGA.
This paper describes how the new Ada 2005 timing event and execution time control features were implemented for the GNAT bare-board Ravenscar run-time environment on the Atmel AVR32 architecture. High accuracy for execution time measurement was achieved by accounting for the effects of interrupts and executing entries by proxy. The implementation of timing events was streamlined by using a single alarm mechanism both for timing events and waking up tasks. Test results on the overhead and accuracy of the implemented features are presented. While the implementation is for the AVR32, it may serve as a blueprint for implementations on other architectures. It is also discussed how the presented design could be transferred to other systems such as C/POSIX and RTSJ.
In this paper, we propose a high performance elliptic curve cryptographic processor over GF(2163), one of the five binary fields recommended by National Institute of Standards and Technology (NIST) for Elliptic Curve Digital Signature Algorithm (ECDSA). The proposed architecture is based on the López–Dahab elliptic curve point multiplication algorithm and uses Gaussian normal basis for GF(2163) field arithmetic. To achieve high throughput rates, we design two new word-level arithmetic units over GF(2163) and derive parallelized elliptic curve point doubling and point addition algorithms with uniform addressing based on the López–Dahab method. We implement our design using Xilinx XC4VLX80 FPGA device which uses 24,263 slices and has a maximum frequency of 143 MHz. Our design is roughly 4.8 times faster with two times increased hardware complexity compared with the previous hardware implementation proposed by Shu et al. Therefore, the proposed elliptic curve cryptographic processor is well suited to elliptic curve cryptosystems requiring high throughput rates such as network processors and web servers.
A torus network has become increasingly important to multicomputer design because of its many features including scalability, low bandwidth and fixed degree of nodes. A multicast communication is a significant operation in multicomputer systems and can be used to support several other collective communication operations. This paper presents an efficient algorithm, TTPM, to find a deadlock-free multicast wormhole routing in two-dimensional torus parallel machines. The introduced algorithm is designed such that messages can be sent to any number of destinations within two start-up communication phases; hence the name Torus Two Phase Multicast (TTPM) algorithm. An efficient routing function is developed and used as a basis for the introduced algorithm. Also, TTPM allows some intermediate nodes that are not in the destination set to perform multicast functions. This feature allows flexibility in multicast path selection and therefore improves the performance. Performance results of a simulation study on torus networks are discussed to compare TTPM algorithm with a previous algorithm.
In this paper, we handle the problem of 1mapping divide-and-conquer idea to 3D mesh and torus interconnection networks. Binary tree is not an efficient computation structure, thus, we select the computation structure as binomial tree. We propose an algorithm for divide and conquer on 3D meshes/torus. After that we give dilation of this algorithm for any 3D mesh whose size is power of 2 and the congestion of this embedding is 1, since each binomial tree consists of two edge-disjoint binomial tree B(n − 1)s.The communication times of proposed algorithm for store-and-forward routing mechanisms are evaluated with respect to some specific values of message ratio α. The results of wormhole routing mechanism are better than the results of store-and-forward routing mechanism due to the nonunit dilation of embedding.The efficiency of the proposed algorithm is also investigated in this paper. If sequential algorithm has the complexity or number of computation as the quadratic form of size of data, then the proposed algorithm is cost-optimal depending on the routing mechanism being wormhole. In the store-and-forward routing mechanism, the number of computation in the sequential algorithm does not make the proposed algorithm be cost-optimal or not. The communication time is dominant and computation time is less effective than communication time.
The Austrian Research Centre Seibersdorf and its IT-Department are involved in the development of critical computer systems and in standardization in this field for many years (SAFECOMP '89, '90, '91, '93, IEC SC 65A WG9 and WG10, IEC TC 56, partners in the European initiative ESPITI and the networks ENCRESS and OLOS). The certification process for ISO 9001 started with a pre-audit in December 1993, and the certificate was successfully achieved at the end of June 1994. ISO 9000–3 (somehow more process-related than ISO 9001) and the ESA Software Engineering Standards (lifecycle model, process models) were the key input to the Quality Management (QM) System of the IT-Department. Additionally, the Department of Information Technology has successfully applied for a BOOTSTRAP license early in 1994. Four members of the staff of the IT department are qualified as external BOOTSTRAP assessors at the moment. In preparation for ISO 9000-certification and during BOOTSTRAP-training we learnt much about organizations, process improvement and project management, especially by reviewing our own processes critically as well as reviewing the impact and relevance of the schemes to follow when ISO 9000 certification or BOOTSTRAP licensing is the goal to achieve. Direct as well as indirect business benefits were achieved.
This paper presents practical experience in planning, implementing, and adopting a Quality Management System (QMS) for embedded systems development at VTT Electronics. The main objective for the development of the QMS was to make it practical for real-life embedded systems research and development (R&D) projects. We have applied the ISO 9001 standard and ISO 9000–3 guidelines in building the quality system. From our personnel's point of view, the most important parts of the system have been document skeletons and plan templates that were made accessible to everyone. One of the major quality improvement resulting from the use of the QMS has been the enhanced predictability of R&D projects. This makes it possible for the organization to concentrate on essential matters. From our clients' point of view, we have clearly improved the quality of our R&D services in terms of the customer satisfaction index. Based on QMS auditing activities, we have been able to evaluate and tune our R&D procedures in a systematic manner. We have decided to use Total Quality Management and Quality Award frameworks in the further development of the QMS. Customer service and human resources management are examples of important areas for further quality improvement.
Priority Ceiling Protocol (PCP) is a well-known resource access protocol for hard real-time systems. However, it has a problem of ceiling blocking which imposes a great hindrance to task scheduling in mixed real-time systems where tasks may have different criticality. In this paper, a new resource access protocol called the Conditional Abortable Priority Ceiling Protocol (CA-PCP) is proposed. It resolves the problem of ceiling blocking by incorporating a conditional abort scheme into the PCP. The new protocol inherits all the desirable properties of the PCP and the Ceiling Abort Protocol (CAP) which is yet another modification of the PCP. In the proposed protocol, a condition is defined to control the abort of a job so that the schedulability of the system will not be affected. Performance study has been done to compare the CA-PCP with the PCP. The results indicate that CA-PCP can significantly improve the performance of a system if the lengths of the abortable critical sections are well chosen.
In this paper we use Generalized Stochastic Petri Nets (GSPNs) and Stochastic Well-formed Nets (SWNs) for the performance analysis of Asynchronous Transfer Mode (ATM) Local Area Networks (LANs) that adopt the Available Bit Rate (ABR) service category in its Relative Rate Marking (RRM) version. We also consider a peculiar version of RRM ABR called Stop & Go ABR; this is a simplified ABR algorithm designed for the provision of best-effort services in low-cost ATM LANs, according to which sources can transmit only at two different cell rates, the Peak Cell Rate (PCR) and Minimum Cell Rate (MCR). Results obtained from the solution of GSPN models of simple ATM LAN setups comprising RRM or Stop & Go ABR users, as well as Unspecified Bit Rate (UBR) users, are first validated through detailed simulations, and then used to show that Stop & Go ABR is capable of providing good performance and fairness in a number of different LAN configurations. We also develop SWN models of homogeneous ABR LANs, that efficiently and automatically exploit system symmetries allowing the investigation of larger LAN configurations.
Image processing requires high computational power, plus the ability to experiment with algorithms. Recently, reconfigurable hardware devices in the form of field programmable gate arrays (FPGAs) have been proposed as a way of obtaining high performance at an economical price. At present, however, users must program FPGAs at a very low level and have a detailed knowledge of the architecture of the device being used. They do not therefore facilitate easy development of, or experimentation with, image processing algorithms. To try to reconcile the dual requirements of high performance and ease of development, this paper reports on the design and realisation of an FPGA based image processing machine and its associated high level programming model. This abstract programming model allows an application developer to concentrate on the image processing algorithm in hand rather than on its hardware implementation. The abstract machine is based on a PC host system with a PCI-bus add-on card containing Xilinx XC6200 series FPGA(s). The machine's high level instruction set is based on the operators of image algebra. XC6200 series FPGA configurations have been developed to implement each high level instruction.
Dynamic branch prediction in high-performance processors is a specific instance of a general time series prediction problem that occurs in many areas of science. Most branch prediction research focuses on two-level adaptive branch prediction techniques, a very specific solution to the branch prediction problem. An alternative approach is to look to other application areas and fields for novel solutions to the problem. In this paper, we examine the application of neural networks to dynamic branch prediction. We retain the first level history register of conventional two-level predictors and replace the second level PHT with a neural network. Two neural networks are considered: a learning vector quantisation network and a backpropagation network. We demonstrate that a neural predictor can achieve misprediction rates comparable to conventional two-level adaptive predictors and suggest that neural predictors merit further investigation.
We describe an efficient, high-level abstraction, multi-port memory-control unit (MCU) capable of providing data at maximum throughput. This MCU has been developed to take full advantage of FPGA parallelism. Multiple parallel processing entities are possible in modern FPGA devices, but this parallelism is lost when they try to access external memories. To address the problem of multiple entities accessing shared data we propose an architecture with multiple abstract access ports (AAPs) to access one external memory. Bearing in mind that hardware designs in FPGA technology are generally slower than memory chips, it is feasible to build a memory access scheduler by using a suitable arbitration scheme based on a fast memory controller with AAPs running at slower frequencies. In this way, multiple processing units connected through the AAPs can make memory transactions at their slower frequencies and the memory access scheduler can serve all these transactions at the same time by taking full advantage of the memory bandwidth.
Realizations of genetic algorithms (GAs) in a tree shape parallel computer architecture are presented using different levels of parallelism. In addition, basic models for parallel GAs are considered. The tree shape parallel computer system, GAPA (Genetic Algorithm Parallel Accelerator) with special hardware for GA computation, is described in detail. Also mappings for centralized and distributed GA models are given and their performance has been measured for different population sizes.
Currently there is significant interest in the design and implementation of embedded systems where the hardware and software subsystems are developed concurrently in order to meet design constraints. We present a development environment for general-purpose systems, where the objective is to accelerate the performance of software-based applications, which are specified by C programs. Such programs may be partitioned into hardware and software subsystems — a speed-critical region of the software is implemented in an FPGA in order to provide the performance acceleration. We also discuss two versions of the underlying system hardware architecture. Practical examples are given to illustrate our approach.
This paper presents an hardware accelerator which can effectively improve the security and the performance of virtually any RSA cryptographic application. The accelerator integrates two crucial security- and performance-enhancing facilities: an RSA processor and an RSA key-store. An RSA processor is a dedicated hardware block which executes the RSA algorithm. An RSA key-store is a dedicated device for securely storing RSA key-pairs. We chose RSA since it is by far the most widely adopted standard in public key cryptography. We describe the main functional blocks of the hardware accelerator and their interactions, and comment architectural solutions we adopted for maximizing security and performance while minimizing the cost in terms of hardware resources. We then present an FPGA-based implementation of the proposed architecture, which relies on a Commercial Off The Shelf (COTS) programmable hardware board. Finally, we evaluate the system in terms of performance and chip area occupation, and comment the design trade-offs resulting from different levels of parallelism.
A large-scale reconfigurable data-path processor (LSRDP) implemented by single-flux quantum (SFQ) circuits is introduced which is integrated to a general purpose processor to accelerate data flow graphs (DFGs) extracted from scientific applications. A number of applications are discovered and analyzed throughout the LSRDP design procedure. Various design steps and particularly the DFG mapping process are discussed and our techniques for optimizing the area of accelerator will be presented as well. Different design alternatives are examined through exploring the LSRDP design space and an appropriate architecture is determined for the accelerator. Primary experiments demonstrate capability of the designed architecture to achieve performance values up to 210 Gflops for attempted applications.
In high level synthesis, module selection, scheduling, and resource binding are inter-dependent tasks. For a selected module set, the best schedule/binding should be generated in order to accurately assess the quality of a module selection. Exhaustively enumerating all module selections and constructing a schedule and binding for each one of them can be extremely expensive. In this paper, we present an iterative framework, called WiZard to solve module selection problem under resource, latency, and power constraints. The framework associates a utility measure with each module. This measurement reflects the usefulness of the module for a given a design goal. Using modules with high utility values should result in superior designs. We propose a heuristic which iteratively perturbs module utility values until they lead to good module selections. Our experiments show that by keeping modules with high utility values, WiZard can drastically reduce the module exploration space (approximately 99.2% reduction). Furthermore, the module selections formed by these modules belong to superior solutions in the enumerated set (top 15%).
Wide-issue and high-frequency processors require not only a low-latency but also high-bandwidth memory system to achieve high performance. Previous studies have shown that using multiple small single-ported caches instead of a monolithic large multi-ported one for L1 data cache can be a scalable and inexpensive way to provide higher bandwidth. Various schemes on how to direct the memory references have been proposed in order to achieve a close match to the performance of an ideal multi-ported cache. However, most existing designs seldom take dynamic data access patterns into consideration, thus suffer from access conflicts within one cache and unbalanced loads between the caches. It is observed in this paper that if one can group data references defined in a program into several regions (access regions) to allow parallel accesses, providing separate small caches – access region cache for these regions may prove to have better performance. A register-guided memory reference partitioning approach is proposed and it effectively identifies these semantic regions and organizes them into multiple caches adaptively to maximize concurrent accesses. The base register name, not its content, in the memory reference instruction is used as a basic guide for instruction steering. With the initial assignment to a specific access region cache per the base register name, a reassignment mechanism is applied to capture the access pattern when program is moving across its access regions. In addition, a distribution mechanism is introduced to adaptively enable access regions to extend or shrink among the physical caches to reduce potential conflicts further. The simulations of SPEC CPU2000 benchmarks have shown that the semantic-based scheme can reduce the conflicts effectively, and obtain considerable performance improvement in terms of IPC; with 8 access region caches, 25–33% higher IPC is achieved for integer benchmark programs than a comparable 8-banked cache, while the benefit is less for floating-point benchmark programs, 19% at most.
The explosive growth of the Web, the increasing popularity of PCs and the advances in high-speed network access have brought distributed computing into the mainstream. To simplify network programming and to realize component-based software architecture, distributed object models have emerged as standards. One of those models is distributed component object model (DCOM) which is a protocol that enables software components to communicate directly over a network in a reliable, and efficient manner. In this paper, we investigate an aspect of DCOM concerning software architecture and security mechanism. Also, we describe the concept of role-based access control (RBAC) which began with multi-user and multi-application on-line systems pioneered in the 1970s. And we investigate how we can enforce the role-based access control as a security provider within DCOM, specially in access security policy.
Effective address calculations for load and store instructions need to compete for ALU with other instructions and hence extra latencies might be incurred to data cache accesses. Fast address generation is an approach proposed to reduce cache access latencies. This paper presents a fast address generator that can eliminate most of the effective address computations by storing computed effective addresses of previous load/store instructions in a dummy register file. Experimental results show that this fast address generator can reduce effective address computations of load and store instructions by about 74% on average for SPECint2000 benchmarks and cut the execution times by 8.5%. Furthermore, when multiple dummy register files are deployed, this fast address generator eliminates over 90% of effective address computations of load and store instructions and improves the average execution times by 9.3%.
The importance of accounting for interrupts in multiprocessor real-time schedulability analsysis is discussed and three interrupt accounting methods, namely quantum-centric, task-centric, and processor-centric accounting, are analyzed and contrasted. Additionally, two special cases, dedicated interrupt handling (i.e., all interrupts are processed by one processor) and timer multiplexing (i.e., all jobs are released by a single hardware timer), are considered and corresponding analysis is derived. All discussed approaches are evaluated in terms of schedulability based on interrupt costs previously measured on a Sun Niagara multicore processor. The results show that there is no single “best” accounting technique that is always preferable, but rather that the relative performance of each approach varies significantly based on task set composition, i.e., the number of tasks and the maximum utilization.
Approaches for maximizing throughput of self-timed multiply–accumulate units (MACs) are developed and assessed using the NULL convention logic paradigm. In this class of self-timed circuits, the functional correctness is independent of any delays in circuit elements, through circuit construction, and independent of any wire delays, through the isochronic fork assumption and , where wire delays are assumed to be much less than gate delays. Therefore self-timed circuits provide distinct advantages for System-on-a-Chip applications.First, a number of alternative MAC algorithms are compared and contrasted in terms of throughput and area to determine which approach will yield the maximum throughput with the least area. It was determined that two algorithms that meet these criteria well are the Modified Baugh–Wooley and Modified Booth2 algorithms. Dual-rail non-pipelined versions of these algorithms were first designed using the threshold combinational reduction method [3]. The non-pipelined designs were then optimized for throughput using the gate-level pipelining method [4]. Finally, each design was simulated using Synopsys to quantify the advantage of the dual-rail pipelined Modified Baugh–Wooley MAC, which yielded a speedup of 2.5 over its initial non-pipelined version. This design also required 20% fewer gates than the dual-rail pipelined Modified Booth2 MAC that had the same throughput. The resulting design employs a three-stage feed-forward multiply pipeline connected to a four-stage feedback multifunctional loop to perform a 72+32×32 MAC in 12.7 ns on average using a 0.25 μm CMOS process at 3.3 V, thus outperforming other delay-insensitive/self-timed MACs in the literature.
Computer architects usually evaluate new designs using cycle-accurate processor simulation. This approach provides a detailed insight into processor performance, power consumption and complexity. However, only configurations in a subspace can be simulated in practice due to long simulation time and limited resource, leading to suboptimal conclusions which might not be applied to a larger design space. In this paper, we propose a performance prediction approach which employs state-of-the-art techniques from experiment design, machine learning and data mining. According to our experiments on single and multi-core processors, our prediction model generates highly accurate estimations for unsampled points in the design space and show the robustness for the worst-case prediction. Moreover, the model provides quantitative interpretation tools that help investigators to efficiently tune design parameters and remove performance bottlenecks.
Characteristics of Test Programs 
Code for ld (r0)+, l0 
Emulation of one architecture on another is useful when the architecture is under design, when software must be ported to a new platform or is being developed for systems which are still under development, or for embedded systems that have insufficient resources to support the software development process. Emulation using an interpreter is typically slower than normal execution by up to 3 orders of magnitude. Our approach instead translates the program from the original architecture to another architecture while faithfully preserving its semantics at the lowest level. The emulation speeds are comparable to, and often faster than, programs running on the original architecture. Partial evaluation of architectural features is used to achieve such impressive performance, while permitting accurate statistics collection. Accuracy is at the level of the number of clock cycles spent executing each instruction (hence the description cycle-accurate).
Randomising set index functions can reduce the number of conflict misses in data caches by spreading the cache blocks uniformly over all sets. Typically, the randomisation functions compute the exclusive ors of several address bits. Not all randomising set index functions perform equally well, which calls for the evaluation of many set index functions. This paper discusses and improves a technique that tackles this problem by predicting the miss rate incurred by a randomisation function, based on profiling information. A new way of looking at randomisation functions is used, namely the null space of the randomisation function. The members of the null space describe pairs of cache blocks that are mapped to the same set. This paper presents an analytical model of the error made by the technique and uses this to propose several optimisations to the technique. The technique is then applied to generate a conflict-free randomisation function for the SPEC benchmarks.
Mobile Agent technology is an interesting concept for large network-based system structures. In this paper we describe a Java-based agent environment which integrates agent execution platforms into World Wide Web servers promoting a world wide infrastructure for Mobile Agents. This system has been developed within the framework of MEMPHIS IST project, which involves the definition, development and validation of a new system for commercially significant multilingual premium services, focused on thin clients such as mobile phones, Personal Digital Assistants, etc. The concept relies on the extraction of multilingual information from various web-based content sources in different formats. The system is composed by three different Modules, namely the Acquisition, the Transformation and the Distribution Module. The content acquisition is based on dynamic user profiles. The acquired content is transformed into a meta-representation, which enables content storage independent both of source and destination format and language. The output format is based on the requirements of employed thin client terminals and the output document is distributed via the appropriate networks. The end-user receives the required information as premium services via the chosen output medium. The system supports both pull and push services. An evaluation of the system is also presented.
Technology mapping for Multiplexor (MUX) based field programmable gate arrays (FPGAs) has widely been considered. Here, a new algorithm is proposed that applies techniques from logic synthesis during technology mapping, i.e., the target technology is considered in the minimization process. Binary decision diagrams (BDDs) are used as an underlying data structure combining both structural and functional properties. The algorithm uses local don't cares obtained by a greedy algorithm. To evaluate a netlist, a fast technology mapper is used. Since most of the changes to a netlist are local, re-mapping can also be done locally, allowing a fast but reliable evaluation after each modification. Both area and delay minimization are addressed in this paper. We compare the approach to several previously published algorithms. In most cases these results can be further improved. Compared to SIS, an improvement of 23% for area and 18% for delay can be observed on average.
Parallel computing and distributed computing have traditionally evolved as two separate research disciplines. Parallel computing has addressed problems of communication-intensive computation on tightly-coupled processors while distributed computing has been concerned with coordination, availability, timeliness, etc., of more loosely coupled computations. Current trends, such as parallel computing on networks of conventional processors and Internet computing, suggest the advantages of unifying these two disciplines. Actors provide a flexible model of computation which supports both parallel and distributed computing. One may evaluate the utility of a programming paradigm in terms of four criteria: expressiveness, portability, efficiency, and performance predictability. We discuss how the Actor model and programming methods based on it support these goals. In particular, we provide an overview of the state of the art in Actor languages and their implementation. Finally, we place this work in the context of recent developments in middleware, the Java language, and agents.
In this paper, we present a dual actuator logging disk architecture to minimize write access latencies. We reduce small synchronous write latency using the notion of logging writes, i.e. writing to free sectors near the current disk head location. However, we show through analytic models and simulations that logging writes by itself is not sufficient to reduce write access latencies, particularly in environments with writes to new data and intermixed reads and writes. Therefore, we augment the logging write method with the addition of a second disk actuator. Our models and simulations show that the addition of the second actuator offers significant performance benefits over a normal disk over a wide range of disk access patterns, and comparisons to strictly logging disk architectures show advantages over a range of disk access patterns.
We present a hybrid routing protocol for both pure and hybrid ad hoc networks which uses the mechanisms of swarm intelligence to select next hops. Our protocol, Ad hoc Networking with Swarm Intelligence (ANSI), is a congestion-aware routing protocol, which, owing to the self-organizing mechanisms of swarm intelligence, is able to collect more information about the local network and make more effective routing decisions than traditional MANET protocols. Once routes are found, ANSI maintains routes along a path from source to destination effectively by using swarm intelligence techniques, and is able to gauge the slow deterioration of a link and restore a path along newer links as and when necessary. ANSI is thus more responsive to topological fluctuations. ANSI is designed to work over hybrid ad hoc networks: ad hoc networks which consist of both lower-capability, mobile wireless devices and higher-capability, wireless devices which may or may not be mobile. In addition, ANSI works with multiple interfaces and with both wired and wireless interfaces.Our simulation study compared ANSI with AODV on both hybrid and pure ad hoc network scenarios using both TCP and UDP data flows. The results show that ANSI is able to achieve better results (in terms of packet delivery, number of packets sent, end-to-end delay, and jitter) as compared to AODV in most simulation scenarios. In addition, ANSI achieves this performance with fewer route errors as compared to AODV. Lastly, ANSI is able to perform more consistently, considering the lower variation (measured as the width of the confidence intervals) of the observed values in the results of the experiments. We show that ANSI’s performance is aided by both its superior handling of routing information and also its congestion awareness properties, though we see that congestion awareness in ANSI comes at a price.
In many cases, product families are established on top of a successful pilot product. While this approach provides an option to measure many concrete attributes like performance and memory footprint, adequateness and adaptability of the architecture of the pilot cannot be fully verified. Yet, these properties are crucial business enablers for the whole product family. In this paper, we discuss an architectural assessment of one such seminal system, intended for monitoring electronic subsystems of a mobile machine, which is to be extended to support a wide range of different types of products. This paper shows how well the assessment reveals possible problems and existing flexibilities in assessed system, and this way helps different stakeholders in their further decisions.
The development of multimedia streaming over wireless network is facing a lot of challenges. Taking into account mobility and highly variable bandwidth are the two major ones. Using scalable video content can solve the variable bandwidth problem only if the streaming architecture is able to react without latency. In this article, we present NetMoVie, an intermediate architecture based on real-time protocol which is able to adapt streams to the constraints of the wireless channel.
Microarchitecture optimizations, in general, exploit the gross program behavior for performance improvement. Programs may be viewed as consisting of different “phases” which are characterized by variation in a number of processor performance metrics. Previous studies have shown that many of the performance metrics remain nearly constant within a “phase”. Thus, the change in program “phases” may be identified by observing the change in the values of these metrics. This paper aims to exploit the time varying behavior of programs for processor adaptation. Since the resource usage is not uniform across all program “phases”, the processor operates at varying levels of underutilization. During phases of low available Instruction Level Parallelism (ILP), resources may not be fully utilized while in other phases, more resources may be required to exploit all the available ILP. Thus, dynamically scaling the resources based on program behavior is an attractive mechanism for power–performance trade-off. In this paper we develop per-phase regression models to exploit the phase behavior of programs and adequately allocate resources for a target power–performance trade-off. Modeling processor performance–power using such a regression model is an efficient method to evaluate an architectural optimization quickly and accurately. We also show that the per-phase regression model is better suited than an “unified” regression model that does not use phase information. Further, we describe a methodology to allocate processor resources dynamically by using regression models which are developed at runtime. Our simulation results indicate that average energy savings of 20% can be achieved with respect to a maximally configured system with negligible impact on performance for most of the SPEC-CPU and MEDIA benchmarks.
A strategy to implement adaptive routing in irregular networks is presented and analyzed in this work. A simple and widely applicable deadlock avoidance method, applied to a ring embedded in the network topology, constitutes the basis of this high-performance packet switching. This adaptive router improves the network capabilities by allocating more resources to the fastest and most used virtual network, thus narrowing the performance gap with regular topologies. A thorough simulation process, which obtains statistically reliable measurements of irregular network behavior, has been carried out to evaluate it and compare with other state-of-the-art techniques. In all the experiments, our router exhibited the best behavior in terms of maximum/sustained performance and sensitivity to the network topology.
Many embedded systems have stringent real-time constraints. An effective technique for meeting real-time constraints is to keep the processor utilization on each node at or below the schedulable utilization bound, even though each task’s actual execution time may have large uncertainties and deviate a lot from its estimated value. Recently, researchers have proposed solutions based on Model Predictive Control (MPC) for the utilization control problem. Although these approaches can handle a limited range of execution time estimation errors, the system may suffer performance deterioration or even become unstable with large estimation errors. In this paper, we present two online adaptive optimal control techniques, one is based on Recursive Least Squares (RLS) based model identification plus Linear Quadratic (LQ) optimal controller; the other one is based on Adaptive Critic Design (ACD). Simulation experiments demonstrate both the LQ optimal controller and ACD-based controller have better performance than the MPC-based controller and the ACD-based controller has the smallest aggregate tracking errors.
Several unicast routing protocols have been presented for unicast traffic in MPSoCs. Exploiting the unicast routing algorithms for multicast traffic increases the likelihood of deadlock and congestion. In order to avoid deadlock for multicast traffic, the Hamiltonian path strategy was introduced. The traditional Hamiltonian path routing protocols supporting both unicast and multicast traffic are based on deterministic models, leading to lower performance. In this paper, we propose an adaptive routing protocol for both unicast and multicast traffic without using virtual channels. The proposed method maximizes the degree of adaptiveness of the routing functions which are based on the Hamiltonian path while guaranteeing deadlock freedom. Furthermore, both unicast and multicast aspects of the presented method have been widely investigated separately. Results obtained in both synthetic and real traffic models show that the proposed adaptive method for multicast and unicast aspects has lower latency and power dissipation compared to previously proposed path-based multicasting algorithms with negligible hardware overhead.
At present, the critical computations of real-time systems are guaranteed before run-time by performing a worst-case analysis of the system's timing and resource requirements. The result is that real-time systems are engineered to have spare capacity, under normal operation. A challenge of current research is to make use of this spare capacity, in order to satisfy requirements for adaptivity in the system. Adaptivity can be implemented by optional computations with firm deadlines, which can be guaranteed at run-time by the use of flexible scheduling. This report assumes that the algorithms which attempt to guarantee optional computations at run-time, actually run on the same processor as the optional and critical computations themselves. The report starts with a brief survey of the complex requirements for adaptivity within real-time systems. Such requirements can include task hierarchies composed of interdependent subtasks each with its own utility. Evidence is cited which indicates that the run-time support for a computational model which supports all such requirements, would incur overheads so large, that little spare capacity would remain for the optional computations themselves. Following this, the report presents a constrained computational model, which, it is claimed, could be cost-effectively supported at run-time. The model is nevertheless general enough to satisfy many of the requirements for adaptivity. The constrained model uses Best Effort Admissions Policy to arbitrate between three categories of optional computation, each with its own utility level. The viability of the constrained model is demonstrated by simulation studies which compare the performance of the model to that of First-Come-First-Served Admissions Policy.
A performance model of circuit switched k-ary n-cubes with deterministic routing has recently been proposed by Sharma and Varvarigos [IEEE Trans. Parallel Distrib. Syst. 8 (4) (1997) 349]. Many studies have revealed that using adaptive routing along with virtual channels considerably improves network performance over deterministic routing. This paper presents the first analytical model for calculating the message latency in circuit switched k-ary n-cube networks with adaptive routing. The main feature of the proposed model is the use of Markov chains to compute the path set-up time and to capture the effects of using virtual channels to reduce message blocking in the network. The mean waiting time that a message experiences at a source node is calculated using an M/G/1 queueing system. The validity of the model is demonstrated by comparing analytical results with those obtained through simulation experiments.
Top-cited authors
Tam Duy Nguyen
  • Ho Chi Minh City University of Science
Krishna Priyanka Kadiyala
  • Texas Christian University
Fatemeh Jalali
  • IBM Research
Jason P. Jue
  • University of Texas at Dallas
Ashkan Yousefpour