Article

Reconfigurable Computing: A Survey of Systems and Software

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Due to its potential to greatly accelerate a wide variety of applications, reconfigurable computing has become a subject of a great deal of research. Its key feature is the ability to perform computations in hardware to increase performance, while retaining much of the flexibility of a software solution. In this survey we explore the hardware aspects of reconfigurable computing machines, from single chip architectures to multi-chip systems, including internal structures and external coupling. We also focus on the software that targets these machines, such as compilation tools that map high-level algorithms directly to the reconfigurable substrate. Finally, we consider the issues involved in run-time reconfigurable systems, which re-use the configurable hardware during program execution.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In the last decade, a new computing paradigm, Reconfigurable Computing (RC ), has emerged [14]. RC -systems overcome the limitations of the processor and the IC technology. ...
... RC -systems overcome the limitations of the processor and the IC technology. RC -systems benefit from the flexibility offered by software and the performance offered by hardware [23], [14]. RC has successfully accelerated a wide variety of applications including cryptography and signal processing [22]. ...
... RC has successfully accelerated a wide variety of applications including cryptography and signal processing [22]. This achievement requires a reconfigurable hardware, such as an FPGA, and a software design environment that aids in the creation of configurations for the reconfigurable hardware [14]. ...
Preprint
The problem of finding the solution of Partial Differential Equations (PDEs) plays a central role in modeling real world problems. Over the past years, Multigrid solvers have showed their robustness over other techniques, due to its high convergence rate which is independent of the problem size. For this reason, many attempts for exploiting the inherent parallelism of Multigrid have been made to achieve the desired efficiency and scalability of the method. Yet, most efforts fail in this respect due to many factors (time, resources) governed by software implementations. In this paper, we present a hardware implementation of the V-cycle Multigrid method for finding the solution of a 2D-Poisson equation. We use Handel-C to implement our hardware design, which we map onto available Field Programmable Gate Arrays (FPGAs). We analyze the implementation performance using the FPGA vendor's tools. We demonstrate the robustness of Multigrid over other iterative solvers, such as Jacobi and Successive Over Relaxation (SOR), in both hardware and software. We compare our findings with a C++ version of each algorithm. The obtained results show better performance when compared to existing software versions.
... A fundamental design challenge of DSA is to maximize performance, efficiency and flexibility while still sparing the user from detailed low-level hardware specifics [7]. Therefore, with reconfigurable computing, modern embedded systems can achieve great performance while still maintaining the flexibility of software [8]. Nowadays, reconfigurable computing is being used in many application domains and new trends, such as radar systems [9], cryptographic systems [10], video processing [11], neural networks [12], real-time systems [13], and satellites [14]. ...
... This function is described in Eqs. (8) and (9). ...
Article
Full-text available
One of the main challenges of embedded system design lies in the natural heterogeneity of these systems. We can say that embedded systems are electronic systems designed and programmed to tackle a specific application. Each application has its requirements, although embedded systems often combine many domain-specific subsystems. Considering this context, the design of embedded systems can be extremely challenging, including system modeling, simulation, formal verification, and the synthesis to a correct implementation. To manage the complexity of such systems, the design should start at higher levels of abstraction, based on formal models, without considering the low-level characteristics of the underlying software or hardware. These high-level formal meta-models, named models of computation (MoC), define a set of rules that dictate how computation should be performed and how they should communicate with each other, along with other information such as the notion of time. In this paper, we present as the main contribution a set of rules and interfaces that enable the proper mixing of different MoC domains in a framework for complex embedded system design, thus allowing a heterogeneous system composition at a high abstraction level, including the synchronous, synchronous dataflow, and scenario-aware dataflow MoCs. We model both part of an avionic system and a reconfigurable RISC-V processor using these MoCs and the proposed interfaces as a case study showing the applicability brought by our proposal.
... Besides the fabrication of an ASIC, another alternative for designing a DSA is to utilize a reconfigurable computing (RC) system [25,36,44,118]. In essence, RC systems are programmable computing systems in which specialized digital circuitry can be synthesized from different levels of abstraction, without recourse to integrated circuit development. ...
... Due in part to the aforementioned challenges regarding amenability, there has yet to be a "killer app" for FPGA devices [112]. 6 Whereas the development of GPUs was clearly motivated by graphics applications and, now, generalpurpose scientific computing [87], FPGAs evolved from a considerably smaller market targeting the development of "glue logic" and the prototyping of ASIC devices [25,36,44,118]. In general, the larger demand for GPUs has continually led to a significant discrepancy in subsidies for research/development, supply, and device costs. Until recently, it was not uncommon for the latest GPUs to cost a few hundred dollars while the latest FPGAs were at least $10,000 [112]. ...
Article
Full-text available
This paper establishes the potential of accelerating the evaluation phase of tree-based genetic programming through contemporary field-programmable gate array (FPGA) technology. This exploration stems from the fact that FPGAs can sometimes leverage increased levels of both data and function parallelism, as well as superior power/energy efficiency, when compared to general-purpose CPU/GPU systems. In this investigation, we introduce a fixed-depth, tree-based architecture that can fully parallelize tree evaluation for type-consistent primitives that are unrolled and pipelined. We show that our accelerator on a 14nm FPGA achieves an average speedup of 43×\times when compared to a recent open-source GPU solution, TensorGP, implemented on 8nm process-node technology, and an average speedup of 4,902×\times when compared to a popular baseline GP software tool, DEAP, running parallelized across all cores of a 2-socket, 28-core (56-thread), 14nm CPU server. Despite our single-FPGA accelerator being 2.4×\times slower on average when compared to the recent state-of-the-art Operon tool executing on the same 2-processor, 28-core CPU system, we show that this single-FPGA system is 1.4×\times better than Operon in terms of performance-per-watt. Importantly, we also describe six future extensions that could provide at least a 64–192×\times speedup over our current design. Therefore, our initial results provide considerable motivation for the continued exploration of FPGA-based GP systems. Overall, any success in significantly improving runtime and energy efficiency could potentially enable novel research efforts through faster and/or less costly GP runs, similar to how GPUs unlocked the power of deep learning during the past fifteen years.
... This method involves predicting pixel values within a frame based on neighbouring pixels, effectively reducing redundancy and enhancing compression efficiency. By leveraging spatial correlations between adjacent pixels, intra prediction significantly contributes to the overall compression performance of H.264 [1]. ...
... The concept of image inpainting was first introduced by Bertamio et al. [1]. The method was inspired by the real inpainting process of artists. ...
Article
This article explores the implementation and comparative analysis of two advanced intra prediction algorithms for H.264 video encoding: the Cross Entropy-based Intra Prediction Algorithm and the Error Entropy-based Rate-Distortion Optimization (RDO) Intra Prediction Algorithm. Both algorithms aim to optimize the selection of prediction modes for 4x4 macroblocks, enhancing compression efficiency and maintaining image quality. The Cross Entropy-based algorithm focuses on minimizing cross-entropy loss between the original and predicted macroblocks, while the Error Entropy-based RDO algorithm calculates distortion using error entropy and combines it with rate calculations to determine the optimal mode based on RD cost. The effectiveness of these algorithms is demonstrated through detailed mode selection analysis, highlighting their potential to improve video encoding performance.
... The QFT algorithm transforms the quantum state | ⟩ from the computational basis to the Fourier basis [3]. The quantum state in Fourier basis is often mathematically expressed as shown in Equation 14. Figure 6 illustrates a circuit of three qubits, where QFT is applied. ...
... GPUs are suitable for applications that offer loop parallelism, such as computer vision [21,33] and deep learning [22,24,48]. On the other hand, FPGAs are best for applications that can exhibit pipeline parallelism, due to the combination of various on-device resources (i.e., blocks of memory, registers, logic slices) that can be dynamically reconfigured to compose custom hardware blocks [14]. For instance, financial technology [53] is a domain that can exploit FPGAs for accelerating math operations (e.g., sine and cosine), as FPGAs can execute these operations in few clock cycles. ...
Preprint
Full-text available
In this article, we present TornadoQSim, an open-source quantum circuit simulation framework implemented in Java. The proposed framework has been designed to be modular and easily expandable for accommodating different user-defined simulation backends, such as the unitary matrix simulation technique. Furthermore, TornadoQSim features the ability to interchange simulation backends that can simulate arbitrary quantum circuits. Another novel aspect of TornadoQSim over other quantum simulators is the transparent hardware acceleration of the simulation backends on heterogeneous devices. TornadoQSim employs TornadoVM to automatically compile parts of the simulation backends onto heterogeneous hardware, thereby addressing the fragmentation in development due to the low-level heterogeneous programming models. The evaluation of TornadoQSim has shown that the transparent utilization of GPU hardware can result in up to 506.5x performance speedup when compared to the vanilla Java code for a fully entangled quantum circuit of 11 qubits. Other evaluated quantum algorithms have been the Deutsch-Jozsa algorithm (493.10x speedup for a 11-qubit circuit) and the quantum Fourier transform algorithm (518.12x speedup for a 11-qubit circuit). Finally, the best TornadoQSim implementation of unitary matrix has been evaluated against a semantically equivalent simulation via Qiskit. The comparative evaluation has shown that the simulation with TornadoQSim is faster for small circuits, while for large circuits Qiskit outperforms TornadoQSim by an order of magnitude.
... FPGA (Field Programmable Gate Array), through the utilization of specific hardware description languages, allows for flexible adjustments to its internal circuit structure [1]. Compared to dedicated or general-purpose chips, FPGA demonstrates advantages such as multi-threaded processing capabilities, efficient parallel computing mechanisms, and significantly shortened design cycles. ...
Chapter
Full-text available
With the widespread adoption of FPGA, the inevitable rise of security threats, including Hardware Trojan, poses significant risks. Detecting such trojan is crucial as they can lead to severe consequences. Consequently, various detection methods have been proposed. However, conventional approaches typically require extensive datasets for model training. To tackle this issue, this paper presents a customizable Hardware Trojan detection framework. This framework is designed to generate ample samples for FPGA hardware security testing, ensuring robust detection capabilities.
... Results on processing speed and complexity on the FPGA are given, as well as a comparison between performances of FPGA versus software implementation of a test algorithm. The outlook presents some ideas for further investigations with this architecture, exploiting dynamic reconfiguration capabilities of FPGAs [4], [5]. ...
Preprint
Visual information processing will play an increasingly important role in future electronics systems. In many applications, e.g. video surveillance cameras, data throughput of microprocessors is not sufficient and power consumption is too high. Instruction profiling on a typical test algorithm has shown that pixel address calculations are the dominant operations to be optimized. Therefore AddressLib, a structured scheme for pixel addressing was developed, that can be accelerated by AddressEngine, a coprocessor for visual information processing. In this paper, the architectural design of AddressEngine is described, which in the first step supports a subset of the AddressLib. Dataflow and memory organization are optimized during architectural design. AddressEngine was implemented in a FPGA and was tested with MPEG-7 Global Motion Estimation algorithm. Results on processing speed and circuit complexity are given and compared to a pure software implementation. The next step will be the support for the full AddressLib, including segment addressing. An outlook on further investigations on dynamic reconfiguration capabilities is given.
... Term RC covers essentially everything from ASICs to microprocessors [34] which at least partly utilizes unconventional principles and methods. The EMPA architecture suggested above does not fit the taxonomy [33]. ...
Preprint
Computing is still based on the 70-years old paradigms introduced by von Neumann. The need for more performant, comfortable and safe computing forced to develop and utilize several tricks both in hardware and software. Till now technology enabled to increase performance without changing the basic computing paradigms. The recent stalling of single-threaded computing performance, however, requires to redesign computing to be able to provide the expected performance. To do so, the computing paradigms themselves must be scrutinized. The limitations caused by the too restrictive interpretation of the computing paradigms are demonstrated, an extended computing paradigm introduced, ideas about changing elements of the computing stack suggested, some implementation details of both hardware and software discussed. The resulting new computing stack offers considerably higher computing throughput, simplified hardware architecture, drastically improved real-time behavior and in general, simplified and more efficient computing stack.
... Algorithms running on such SIMD-components make use of hundreds or thousands of parallel processing units and outperform multi-core CPUs by orders of magnitudes for appropriate tasks. Custom hardware such as ASICs or reconfigurable hardware such as FPGAs achieve much higher performance than software for tasks like data encryption [27]. Recent trends on mobile devices also couple two general-purpose processors of different complexity and speed on a single chip [28] that activates the fast but power consuming processor only when needed. ...
Preprint
The actor model of computation has gained significant popularity over the last decade. Its high level of abstraction makes it appealing for concurrent applications in parallel and distributed systems. However, designing a real-world actor framework that subsumes full scalability, strong reliability, and high resource efficiency requires many conceptual and algorithmic additives to the original model. In this paper, we report on designing and building CAF, the "C++ Actor Framework". CAF targets at providing a concurrent and distributed native environment for scaling up to very large, high-performance applications, and equally well down to small constrained systems. We present the key specifications and design concepts---in particular a message-transparent architecture, type-safe message interfaces, and pattern matching facilities---that make native actors a viable approach for many robust, elastic, and highly distributed developments. We demonstrate the feasibility of CAF in three scenarios: first for elastic, upscaling environments, second for including heterogeneous hardware like GPGPUs, and third for distributed runtime systems. Extensive performance evaluations indicate ideal runtime behaviour for up to 64 cores at very low memory footprint, or in the presence of GPUs. In these tests, CAF continuously outperforms the competing actor environments Erlang, Charm++, SalsaLite, Scala, ActorFoundry, and even the OpenMPI.
... This flexibility is essential for keeping pace with evolving technologies, such as the transition from 4G to 5G in telecommunications, where new protocols and modulation schemes are continuously introduced. Furthermore, the ability to quickly reconfigure FPGAs means that multiple signal processing tasks can be performed on a single device, reducing the need for multiple specialized chips and simplifying hardware design [137]. ...
Article
Full-text available
Future Free Space Optical (FSO) communication systems have the potential to communicate data at very high rates with very high levels of integrity over distances of up to a few kilometers (for terrestrial links). This technology has also been a candidate for high-speed and highly reliable (BER ~10^-9) communication links between satellites in geosynchronous orbits and ground stations. Since the free space optical medium can induce many forms of distortion (atmospheric turbulence effects, optical beam wander, etc.), the use of a channel code to detect and correct errors during the process of information transfer over the channel is essential. A correctly designed channel code can reduce the raw BER from unacceptable values to values that can be tolerated in many applications. Different kinds of error-correcting codes are used for FSO like Hamming Code, Reed-Solomon Codes, Turbo, and LDPC Codes. Keywords: FSO (free space optics), FPGA (field programmable gate array), BER (Bit error rate), FEC (forward error correction).
... This enhancement has expanded their capacity to effectively manage a broader spectrum of complex tasks, including social simulation (Park et al., 2023;Wang et al., 2023f;Hua et al., 2023), software development (Osika, 2023;Qian et al., 2023), game playing (Wang et al., 2023a;Wang et al., 2023d;Gong et al., 2023), and scientific research . In order to study the cooperative dynamics of autonomous agents more pertinently, we choose software development (Mills, 1976) as an representative scenario, due to its complexity that demands a blend of natural and programming language skills (Mills, 1976), the processuality that often requires an in-depth understanding of coding and continuous alterations (Barki et al., 1993), and the objectivity of code that can provide quantifiable feedback (Compton and Hauck, 2002). ...
... Designing and implementing FPGAbased systems require specialized knowledge of hardware description languages (HDLs) such as VHDL or Verilog, as well as familiarity with FPGA toolchains and development environments [5]. Additionally, the iterative design and verification process can be resource-intensive, potentially increasing development costs [6]. To mitigate these challenges, higher-level synthesis (HLS) tools and development frameworks have emerged, simplifying FPGA programming and making it accessible to a broader range of developers [7]. ...
Article
Full-text available
This review paper explores the applications of digital signal processing (DSP) using field-programmable gate arrays (FPGAs). FPGAs have emerged as an ideal platform for implementing complex DSP algorithms due to their high flexibility, reconfigurability, and parallel processing capabilities. The paper begins with an overview of FPGA architecture and highlights the advantages of using FPGAs over other DSP technologies. Various applications of FPGAs in DSP are discussed, including digital filtering, fast Fourier transforms (FFT), encoding and decoding in communication systems, and power optimization. Case studies and practical examples from real-world projects are provided to demonstrate the efficiency and effectiveness of FPGAs in DSP applications. Finally, the paper addresses the challenges and limitations associated with using FPGAs for DSP and offers insights into future research directions in this field. This review underscores that FPGAs, by providing a powerful and flexible platform, can significantly enhance the performance and efficiency of DSP systems.
... Re-configurable computing[106]. ...
Article
Full-text available
Image segmentation, a crucial process of dividing images into distinct parts or objects, has witnessed remarkable advancements with the emergence of deep learning (DL) techniques. The use of layers in deep neural networks, like object form recognition in higher layers and basic edge identification in lower layers, has markedly improved the quality and accuracy of image segmentation. Consequently, DL using picture segmentation has become commonplace, video analysis, facial recognition, etc. Grasping the applications, algorithms, current performance, and challenges are crucial for advancing DL-based medical image segmentation. However, there’s a lack of studies delving into the latest state-of-the-art developments in this field. Therefore, this survey aimed to thoroughly explore the most recent applications of DL-based medical image segmentation, encompassing an in-depth analysis of various commonly used datasets, pre-processing techniques and DL algorithms. This study also investigated the state-of-the-art advancement done in DL-based medical image segmentation by analyzing their results and experimental details. Finally, this study discussed the challenges and future research directions of DL-based medical image segmentation. Overall, this survey provides a comprehensive insight into DL-based medical image segmentation by covering its application domains, model exploration, analysis of state-of-the-art results, challenges, and research directions—a valuable resource for multidisciplinary studies.
... The NOP presents a new approach to logical-causal and factualexecutional processing based on minimal collaborative and precise notifying entities. In general, this approach is more efficient and decoupled when compared to traditional, coupling, and sequential programming paradigms whose languages are used to software development and, more recently, also to hardware development using High-Level Synthesis (HLS) tools [17][18][19][20][21][22][23][24][25][26]. ...
... Researchers have proposed randomization and shufflingbased defense techniques for several decades, ranging from the first on n-version programming [34] to reconfigurable software and networks [35]. However, this approach to cybersecurity was formalized only recently under the umbrella of Moving Target Defense [1]. ...
Article
Full-text available
Moving Target Defense and Cyber Deception emerged in recent years as two key proactive cyber defense approaches, contrasting with the static nature of the traditional reactive cyber defense. The key insight behind these approaches is to impose an asymmetric disadvantage for the attacker by using deception and randomization techniques to create a dynamic attack surface. Moving Target Defense (MTD) typically relies on system randomization and diversification, while Cyber Deception is based on decoy nodes and fake systems to deceive attackers. However, current Moving Target Defense techniques are complex to manage and can introduce high overheads, while Cyber Deception nodes are easily recognized and avoided by adversaries. This paper presents DOLOS, a novel architecture that unifies Cyber Deception and Moving Target Defense approaches. DOLOS is motivated by the insight that deceptive techniques are much more powerful when integrated into production systems rather than deployed alongside them. DOLOS combines typical Moving Target Defense techniques, such as randomization, diversity, and redundancy, with cyber deception and seamlessly integrates them into production systems through multiple layers of isolation. We extensively evaluate DOLOS against a wide range of attackers, ranging from automated malware to professional penetration testers, and show that DOLOS is effective in slowing down attacks and protecting the integrity of production systems. We also provide valuable insights and considerations for the future development of MTD techniques based on our findings.
... HLS research has been previously summarized from different perspectives: [24][25][26][27] and [28] have described the historical evolution of HLS tools, primarily focusing on industry adoption. Zhang and Ng [29], Compton and Hauck [30] and Cardoso et al. [31] focus on the dynamic-reconfiguration support of HLS tools. We summarize the field from a different perspective, filling a gap in the literature; namely HLS languages' abstractions, focusing on the clash of hardware and software traditional views. ...
Article
Full-text available
Modern embedded image processing deployment systems are heterogeneous combinations of general-purpose and specialized processors, custom ASIC accelerators and bespoke hardware accelerators. This paper offers a primer on hardware acceleration of image processing, focusing on embedded, real-time applications. We then survey the landscape of High Level Synthesis technologies that are amenable to the domain, as well as new-generation Hardware Description Languages, and present our ongoing work on IMP-lang, a language for early stage design of heterogeneous image processing systems. We show that hardware acceleration is not just a process of converting a piece of computation into an equivalent hardware system: that naive approach offers, in most cases, little benefit. Instead, acceleration must take into account how data is streamed throughout the system, and optimize that streaming accordingly. We show that the choice of tooling plays an important role in the results of acceleration. Different tools, in function of the underlying language paradigm, produce wildly different results across performance, size, and power consumption metrics. Finally, we show that bringing heterogeneous considerations to the language level offers significant advantages to early design estimation, allowing designers to partition their algorithms more efficiently, iterating towards a convergent design that can then be implemented across heterogeneous elements accordingly.
... In that work, data is obtained from some sensors and, subsequently, processing is also carried out in the Fog. In addition, the use of a Field Programmable Gate Array (FPGA) is proposed, a programmable device that can be configured according to the user's needs, allowing the development of very efficient systems [52]. This architecture is not limited to the development of a specific system, but it can be adapted to different use cases. ...
Article
Full-text available
Ambient assisted living (AAL) proposes a vision of the future in which older people can remain in their homes on their own for as long as possible, guaranteeing care and attention thanks to intelligent systems capable of making their lives easier. In parallel, the Internet of Things (IoT) proposes environments where different ‘things’ surrounding the user are able to communicate with each other through the Internet. This allows the creation of intelligent environments, which, in turn, are a requirement of AAL systems. Therefore, there is a relevant synergy between AAL and IoT, where the latter allows the creation of more intelligent and transparent AAL systems for users. This paper makes a systematic literature review (SLR) of AAL systems supported by IoT. We have explored aspects of interest such as the types of systems, the most popular technologies used in their development and the degree of compliance regarding the characteristics that any system of this type should have. Besides, the difficulty of evaluating user satisfaction due to the lack of real evidence is analyzed. This SLR, carried out according to the procedure proposed by Kitchenhan, is based on a selection of 61 papers from among 643 initial results published between 2015 and 2020. As a result of the analysis conducted, several challenges and opportunities that remain open in the field of IoT-supported AAL have been outlined.
... The proposed acceleration in the literature [11] outperformed the software approach in terms of performance and flexibility and integrated OpenCV functions implemented on ARM processors. Several types of hardware-accelerated systems for machine vision applications based on FPGA implementations are described briefly in [13][14][15][16][17]. System [18]. ...
Article
Advancements in image and video processing are growing over the years for industrial robots, autonomous vehicles, cryptography, surveillance, medical imaging and computer-human interaction applications. One of the major challenges in real-time image and video processing is the execution of complex functions and high computational tasks. To overcome this issue, a hardware acceleration of different filter algorithms for both image and video processing is implemented on Xilinx Zynq®-7000 System on-Chip (SoC) device consists of Dual-core Cortex™-A9 processors which provides computing ability to perform with the help of software libraries using Vivado® High-Level Synthesis (HLS). The acceleration of object detection algorithms include Sobel-Feldman filter, posterize and threshold filter algorithms implemented with 1920 x 1080 image resolutions for real-time object detection. The implementation results exhibit effective resource utilization such as 45.6% of logic cells, 51% of Look-up tables (LUTs), 29.47% of Flipflops, 15% of Block RAMs and 23.63% of DSP slices under 100 MHz frequency on comparing with previous works. There are a few reasons why tracking is preferable over detecting objects in each frame. Tracking facilitates in identifying the identity of various items across frames when there are several objects. Object detection may fail in some instances, but tracking may still be achievable which takes into account the location and appearance of the object in the previous frame. The key hurdles in real-time image and video processing applications are object tracking and motion detection. Some tracking algorithms are extremely fast because they perform a local search rather than a global search. Tracking algorithms such as meanshift, Regional Neural Network probabilistic data association, particle filter, nearest neighbor, Kalman filter and interactive multiple model (IMM) are available to estimate and predict the state of a system.
... Implementing GPUs as overlay architectures on Field Programmable Gate Array (FPGA) devices could grant a higher degree of flexibility and customizability. FPGAs can provide reconfigurable hardware platforms that are capable of executing a wide number of existing algorithms [5], with the only drawbacks compared to custom accelerators being a lower operating frequency and higher power consumption. However, the large majority of the systems employing these devices are not taking full advantage of the additional reconfiguration capabilities, namely Partial Reconfiguration (PR) [6]- [8], which can allow for specialisation of part of the system architecture. ...
... Reconfiguration is commonly used in specialized logic to reach adaptivity. Designing and optimizing reconfigurable digital systems has historically been a challenge for developers [34]. Such a challenge is even harder for modern CPSs which may have to cope with strict and multiple constraints and requirements, such as extremely low power behavior or real-time reactivity. ...
Article
Full-text available
Image and video processing are one of the main driving application fields for the latest technology advancement of computing platforms, especially considering the adoption of neural networks for classification purposes. With the advent of Cyber Physical Systems, the design of devices for efficiently executing such applications became more challenging, due to the increase of the requirements to be considered, of the functionalities to be supported, as well as to the demand for adaptivity and connectivity. Heterogeneous computing and design automation are then turning into essential. The former guarantees a variegated set of features under strict constraints (e.g., by adopting hardware acceleration), and the latter limits development time and cost (e.g., by exploiting model-based design). In this context, the literature is still lacking adequate tooling for the design and management of neural network hardware accelerators, which can be adaptable and customizable at runtime according to the user needs. In this work, a novel almost automated toolchain based on the Open Neural Network eXchange format is presented, allowing the user to shape adaptivity right on the network model and to deploy it on a runtime reconfigurable accelerator. As a proof of concept, a Convolutional Neural Network for human/animal classification is adopted to derive a Field Programmable Gate Array accelerator capable of trading execution time for power by changing the resources involved in the computation. The resulting accelerator, when necessary, can consume 30% less power on each layer, taking about overall 8% more time to classify an image.
... Reconfigurable systems have a wide range of advantages in terms of multi-ability (in various configurations, the system can execute multiple functions at different times), evolution (the system's configuration can be altered by removing, substituting, and adding new elements), and survivability (the system can still work despite a potential failure of one or more components) [1]. Reconfigurable systems have been widely adopted in various applications, including reconfigurable manufacturing systems [2], field-programmable gate arrays [3], various software concepts [4], aerospace [5], etc. One of the main applications of reconfigurable systems is demonstrated in self-reconfigurable robots, which can modify their morphology manually or automatically depending on the situation or environment. ...
Article
Full-text available
Reconfigurable robots are suitable for cleaning applications due to their high flexibility and ability to change shape according to environmental needs. However, continuous change in morphology is not an energy-efficient approach, with the limited battery capacity. This paper presents a metaheuristic-based framework to identify the optimal morphology of a reconfigurable robot, aiming to maximize the area coverage and minimize the energy consumption in the given map. The proposed approach exploits three different metaheuristic algorithms, namely, SMPSO, NSGA-II, and MACO, to generate the optimal morphology for every unique layout of a two-dimensional grid map by considering the path-length as the energy consumption. The novel feature of our approach is the implementation of the footprint-based Complete Coverage Path Planning (CCPP) adaptable for all possible configurations of reconfigurable robots. We demonstrate the proposed method in simulations and experiments using a Tetris-inspired robot with four blocks named Smorphi , which can reconfigure into an infinite number of configurations by varying its hinge angle. The optimum morphologies were identified for three settings, i.e., 2D indoor map with obstacles and free spaces. The optimum morphology is compared with the standard Tetris shapes in the simulation and the real-world experiment. The results show that the proposed framework efficiently produces non-dominated solutions for choosing the optimal energy-efficient morphologies.
... Researchers have proposed randomization and shufflingbased defense techniques for several decades, ranging from the first on n-version programming [33] to reconfigurable software and networks [34]. However, this approach to cybersecurity was formalized only recently under the umbrella of Moving Target Defense [1]. ...
Preprint
Full-text available
Moving Target Defense and Cyber Deception emerged in recent years as two key proactive cyber defense approaches, contrasting with the static nature of the traditional reactive cyber defense. The key insight behind these approaches is to impose an asymmetric disadvantage for the attacker by using deception and randomization techniques to create a dynamic attack surface. Moving Target Defense typically relies on system randomization and diversification, while Cyber Deception is based on decoy nodes and fake systems to deceive attackers. However, current Moving Target Defense techniques are complex to manage and can introduce high overheads, while Cyber Deception nodes are easily recognized and avoided by adversaries. This paper presents DOLOS, a novel architecture that unifies Cyber Deception and Moving Target Defense approaches. DOLOS is motivated by the insight that deceptive techniques are much more powerful when integrated into production systems rather than deployed alongside them. DOLOS combines typical Moving Target Defense techniques, such as randomization, diversity, and redundancy, with cyber deception and seamlessly integrates them into production systems through multiple layers of isolation. We extensively evaluate DOLOS against a wide range of attackers, ranging from automated malware to professional penetration testers, and show that DOLOS is highly effective in slowing down attacks and protecting the integrity of production systems. We also provide valuable insights and considerations for the future development of MTD techniques based on our findings.
... Entretanto, a complexidade deste software cresce em função da demanda da carga de processamento, causando um comprometimento ao software em termos de tempo de execução. Uma possibilidade seria o uso de um ASIC (Application Specific Integrated Circuit) [22]. ...
Conference Paper
Diversas áreas do conhecimento necessitam de meios para tornar os sistemas de computação mais eficientes em termos de diminuição de tempo de resposta. De modo particular, os programas que empregam rotinas baseadas em aritmética intervalar carecem de redução do tempo no processamento dos cálculos, pois as operações matemáticas nestes algoritmos levam em conta os limites dos custo intervalos de computacional. Diferentes aplicações deixam de usar algoritmos intervalares porque o compromisso do sistema com o tempo de execução é fator determinante ao problema. Desta forma, para favorecer o desempenho de aplicações que utilizam a aritmética intervalar, este trabalho apresenta a utilização de um dispositivo eletrônico com arquitetura reconfigurável para paralelizar o cômputo de operações aritméticas elementares. Foram analisados o uso do paralelismo em conjunto com a computação reconfigurável, com o objetivo de acelerar algoritmos da aritmética intervalar. Os diferentes ganhos de desempenho e os respectivos consumos de recursos produzidos pela de flexibilidade reconfiguração da arquitetura proposta comprovam que o uso de um Field Programmable Gate Array seria uma possível solução, visto que o mesmo associa a flexibilidade do software com o desempenho hardware.
Article
Digital tech is moving super-fast, right? And we got these four huge game-changers coming together: Cloud Computing (CC), Big Data (BD), Artificial Intelligence (AI), and the Internet of Things (IoT). On their own, they've already turned tons of stuff upside down. But when you mash them up, you've got this wild chance to completely redo how we handle data everywhere – like how we make it, sort it out, study it, and use it. It's not just about cooler gadgets; it's like the bedrock for smart, mega-efficient big-time systems that'll meet what today's world needs. Cloud computing acts as the main support for this mix-up providing resources and storage you can scale and get whenever needed. It gives companies the power to handle and sift through heaps of data from IoT gadgets without the big need for in-house gear. Big data is known for its huge size fast speed many types, and rock-solid accuracy. It backs up cloud computing with the necessary gear and ways of thinking to tackle and make sense of the giant pools of data coming from IoT stuff , they build a strong space for making choices based on data.
Article
Full-text available
Propelled by advancements in artificial intelligence, the demand for field-programmable devices has grown rapidly in the last decade. Among various state-of-the-art platforms, programmable integrated photonics emerges as a promising candidate, offering a new strategy to drastically enhance computational power for data-intensive tasks. However, intrinsic weak nonlinear responses of dielectric materials have limited traditional photonic programmability to the linear domain, leaving out the most common and complex activation functions used in artificial intelligence. Here we push the capabilities of photonic field-programmability into the nonlinear realm by meticulous spatial control of distributed carrier excitations and their dynamics within an active semiconductor. Leveraging the architecture of photonic nonlinear computing through polynomial building blocks, our field-programmable photonic nonlinear microprocessor demonstrates in situ training of photonic polynomial networks with dynamically reconfigured nonlinear connections. Our results offer a new paradigm to revolutionize photonic reconfigurable computing, enabling the handling of intricate tasks using a polynomial network with unparalleled simplicity and efficiency.
Chapter
Reconfigurable computing is breaking down the barrier between hardware and software design technologies. The segregation between the two has become more and more fuzzy because reconfigurable computing has now made it possible for hardware to be programmed and software to be synthesized. Reconfigurable computing can also be viewed as a trade-off between general-purpose computing and application specific design. Given the architecture and design flexibility, reconfigurable computing has catalyzed the progress in hardware-software codesign technology and a vast number of application areas such as scientific computing, biological computing, artificial intelligence, signal processing, security computing, and control-oriented design, to name a few. In this article, we briefly introduce why and what is reconfigurable computing in the introduction section. Then, the resulting enhancements of hardware-software codesign methods and the techniques, tools, platforms, design and verification methodologies of reconfigurable computing will be introduced in the background section. Furthermore, we will introduce and compare some reconfigurable computing architectures. Finally, the future trends and conclusions will also be given. This article is aimed at widespread audiences, including both a person not particularly well grounded in computer architecture and a technical person.
Article
Full-text available
En el presente trabajo presentamos SkelCoRe, un Framework de Codiseño basado en los Esqueletos Algorı́tmicos de Cole. Este Framework está orientado a la programación paralela estructurada y reconfigurable de software de alto rendimiento. El objetivo es que el programador pueda explotar transparentemente paralelismo sobre sistemas de computación reconfigurable basados en FPGAs. Dado que los Esqueletos Algorı́tmicos pueden encapsular los detalles de implementación paralela y de reconfiguración entonces pueden ayudar a los programadores a escribir programas paralelos independiente de la arquitectura de la plataforma reconfigurable disponible. La principal contribución de este trabajo es un framework para codiseño usando componentes o bloques de propiedad intelectual (IP cores) parametrizables usando el enfoque de Esqueletos Algorı́tmicos de Cole para proporcionar paralelismo estructurado de alto nivel. Estos componentes implementan modelos de computación paralela tanto en software como en hardware en la forma de bloques IP que permiten el desarrollo mediante codiseño hardware/software. Esto provee al programador el suficiente nivel de abstracción para mover fácilmente funcionalidades entre software y hardware durante la etapa de exploración de espacios de diseño del software. Como un primer paso implementamos un esqueleto reconfigurable, basado en el modelo de cómputo paralelo “Pipeline”, denominado PipeSkel. Este se implementó como una plantilla parametrizada de alto nivel para desarrollar componentes de hardware y software de una aplicación al mismo nivel de abstracción.
Chapter
Full-text available
Hotspots are Network on-Chip (NoC) routers or modules which occasionally receive packetized traffic at a higher rate that they can process. This phenomenon reduces the performance of an NoC, especially in the case wormhole flow-control. Such situations may also lead to deadlocks, raising the need of a hotspot prevention mechanism. Such mechanism can potentially enable the system to adjust its behavior and prevent hotspot formation, subsequently sustaining performance and efficiency. This Chapter presents an Artificial Neural Network-based (ANN) hotspot prediction mechanism, potentially triggering a hotspot avoidance mechanism before the hotspot is formed. The ANN monitors buffer utilization and reactively predicts the location of an about to-be-formed hotspot, allowing enough time for the system to react to these potential hotspots. The neural network is trained using synthetic traffic models, and evaluated using both synthetic and real application traces. Results indicate that a relatively small neural network can predict hotspot formation with accuracy ranges between 76 and 92%. KeywordsNetwork on-Chip Hotspots-Artificial Neural Networks-VLSI Systems
Article
Full-text available
Ҳозирги вақтда блокчейн технологияси ахборотларни ҳимоялашнинг янги ва самарали технологияларидан бирига айланмоқда. Блокчейн технологияси криптовалюталарни яратиш, Internet of Thigs (IoT), хавфсиз молявий операцияларни амалга ошириш, соғлиқни сақлаш ва бошқа соҳаларда кенг қўлланилмоқда. Блокчейн криптобардошлиги юқори бўлган технология бўлишига қарамай, унда бир қанча хавфсизлик муаммолари аниқланмоқда. Жумладан, хеш функциялар блокчейн технологиясининг муҳим компоненти бўлиб, у бир томонлама функцияларга асосланади. Хавфсиз хешлаш жараёни асосан хеш функция алгоритми разрядлиги ҳамда хеш функцияда бажарилувчи мантиқий функциялар ва ҳисоблашларга боғлиқ. Хешлаш жараёни қанчалик тез амалга оширилса блокчейннинг майнинг самарадорлиги ортади. Шу сабабли мазкур ишда, PRCA (Proactive Reconfigurable Computing Architecture) интелектуал тизими асосида блокчейн хешлаш алгоритмини оптималлаштиришнинг янги схемаси таклиф этилади.
Chapter
Blockchain is a cutting-edge innovation that has reformed the manner in which society connects and exchanges. It very well may be characterized as a chain of blocks that stores data with computerized marks in a circulated and decentralized network. Many years of hacking and taking advantage of digital protection frameworks have more than once demonstrated how a decided digital assailant might think twice and regular citizen organizations. The danger of refined weapon frameworks being hurt or debilitated by non-dynamic effects has constrained militaries to foster a long haul and unmistakably savvy safeguard for military frameworks. Blockchain, and it’s at this point untested military purposes, can move the security weaknesses of some digital frameworks from a weak link weakness model, in which an aggressor just has to think twice about hub to disregard the framework, to a greater part compromised weakness model, in which a noxious entertainer can’t take advantage of a weak link. Quantum innovation interprets the standards of quantum physical science into mechanical applications. As a general rule, quantum innovation has not yet arrived at development; be that as it may, it could hold critical ramifications for the fate of military detecting, encryption, and correspondences, as well as concerning legislative oversight, approvals, and assignments.
Article
Full-text available
Over the past six decades, the computing systems field has experienced significant transformations, profoundly impacting society with transformational developments, such as the Internet and the commodification of computing. Underpinned by technological advancements, computer systems, far from being static, have been continuously evolving and adapting to cover multifaceted societal niches. This has led to new paradigms such as cloud, fog, edge computing, and the Internet of Things (IoT), which offer fresh economic and creative opportunities. Nevertheless, this rapid change poses complex research challenges, especially in maximizing potential and enhancing functionality. As such, to maintain an economical level of performance that meets ever-tighter requirements, one must understand the drivers of new model emergence and expansion, and how contemporary challenges differ from past ones. To that end, this article investigates and assesses the factors influencing the evolution of computing systems, covering established systems and architectures as well as newer developments, such as serverless computing, quantum computing, and on-device AI on edge devices. Trends emerge when one traces technological trajectory, which includes the rapid obsolescence of frameworks due to business and technical constraints, a move towards specialized systems and models, and varying approaches to centralized and decentralized control. This comprehensive review of modern computing systems looks ahead to the future of research in the field, highlighting key challenges and emerging trends, and underscoring their importance in cost-effectively driving technological progress.
Article
Full-text available
rbital Angular Momentum (OAM), provides the new angular or mode dimension for wireless communications, offers an intriguing way for anti-jamming. The unprecedented demands for high-quality and seamless wireless services impose continuous challenges to existing cellular networks. Applications like enhanced mobile broadband (eMBB), ultra-reliable and low latency communications (URLLC), and massive machine type communications (mMTC) services are pushing the evolution of cellular systems towards the fifth-generation (5G). We propose to use the orthogonally of OAM modes for anti-jamming in wireless communications. In particular, the mode hopping (MH) scheme for anti-jamming within the narrow frequency band. We derive the closed-form expression of bit error rate (BER) for multiple user's scenario with our developed MH scheme. Our developed MH scheme can achieve the same anti-jamming results within the narrow frequency band as compared with the conventional wideband FH scheme. We explore the challenges in the design of next generation transport layer protocols (NGTP) in 6G Terahertz communication-based networks. Furthermore, we propose mode-frequency hopping (MFH) scheme, which jointly uses our developed MH scheme and the conventional FH scheme to further decrease the BER for wireless communication. In contrast, our experiments for Reconfigurable Intelligent Surface (RIS) reveal it as economically simple and a new type of ultra-thin meta material inlaid with multiple sub-wavelength scatters. We exposed our observations for possible favorable propagation conditions by controlling the phase shifts of the reflected waves at the surface such that the received signals are directly reflected towards the receivers without any extra cost of power sources or hardware. It provides a revolutionarily new approach to actively improve the link quality and coverage, which sheds light into the future 6G. Aiming high-quality channel links in cellular communications via design and optimization of RIS construction is explored in this work as novel RIS-based smart radio techniques. Unlike traditional antenna arrays, three unique characteristics of RIS are revealed in this work. First, the built-in programmable configuration of RIS enables analog beam forming inherently without extra hardware or signal processing. Second, the incident signals can be controlled to partly reflect and partly transmit through the RIS simultaneously, adding more flexibility to signal transmission. Third, RIS has no digital processing capability to actively send signals nor any radio frequency (RF) components. One of the considerations is the use of Terahertz communications that aims to provide 1 Tbps (terabits per second) and air latency less than 100μs. Further, 6G networks are expected to provide for more stringent Quality of Service (QoS) and mobility requirements. As such, it is necessary to develop novel channel estimation and communication protocols, design joint digital and RIS-based analog beam forming schemes, and perform interference control via mixed reflection and transmission. The aforementioned innovative use-cases call for the necessity of redefining the requirements of upcoming 6G technology. 5G technology has abundant potential but it cannot satisfy the stringent rate-reliability-latency requirements of the new applications. This work also highlights the requirements and KPIs of 6G technology will be stricter and more diverse. For example, we discuss a scenario while the 5G network is already operated in the very high frequency mm-waves region, 6G could require even higher frequencies for operation. The 6G technology will focus on achieving higher peak data rate, seamless ubiquitous connectivity, non-existent latency, high reliability, and strong security and privacy for providing ultimate user experience. A Section is devoted to describe the comparative study of the KPIs of both 5G and 6G.
Article
Full-text available
Field programmable gate arrays (FPGA's) reduce the turnaround time of application-specific integrated circuits from weeks to minutes. However, the high complexity of their architectures makes manual mapping of designs time consuming and error prone thereby offsetting any turnaround advantage. Consequently, effective design automation tools are needed to reduce design time. Among the most important is logic synthesis. While standard synthesis techniques could be used for FPGA's, the quality of the synthesized designs is often unacceptable. As a result, much recent work has been devoted to developing logic synthesis tools targeted to different FPGA architectures. The paper surveys this work. The three most popular types of FPGA architectures are considered, namely those using logic blocks based on look-up tables, multiplexers and wide AND/OR arrays. The emphasis is on tools which attempt to minimize the area of the combinational logic part of a design since little work has been done on optimizing performance or routability, or on synthesis of the sequential part of a design. The different tools surveyed are compared using a suite of benchmark designs.
Article
Full-text available
VLSI cell placement problem is known to be NP complete. A wide repertoire of heuristic algorithms exists in the literature for efficiently arranging the logic cells on a VLSI chip. The objective of this paper is to present a comprehensive survey of the various cell placement techniques, with emphasis on standard cell and macro placement. Five major algorithms for placement are discussed: simulated annealing, force-directed placement, min-cut placement, placement by numerical optimization, and evolution-based placement. The first two classes of algorithms owe their origin to physical laws, the third and fourth are analytical techniques, and the fifth class of algorithms is derived from biological phenomena. In each category, the basic algorithm is explained with appropriate examples. Also discussed are the different implementations done by researchers.
Article
Full-text available
In designing FPGAs, it is important to achiev e a good balance bet w een the number of logic blocks, suc h has Look-Up Tables (LUTs), and wiring resources. It is difficult to find an optimal solution. In this paper, w e presen t an FPGA design methodology to efficiently find well-balanced FPGA architectures. The method covers all aspects of FPGA development from the architecture-decision process to physical implementation. It has been used to develop a new FPGA that can implement circuits that are twice as large as those implementable with the previous version but with half the number of logic blocks. This indicates that the methodology is effectiv e in dev eloping well-balanced FPGAs.
Conference Paper
Full-text available
Floorplanning is an important problem in FPGA circuit mapping. As FPGA capacity grows, new innovative approaches will be required for eficiently mapping circuits to FPGAs. In this paper we present a macro basedjoorplanning methodology suitable for mapping large circuits to large, high density FPGAs. Our method uses clustering techniques to combine macros into clusters, and then uses a tabu search based approach to place clusters while enhancing both circuit routability and performance. Our method is capable of handling both hard (axed size and shape) macros and soft (fixed size and variable shape) macros. We demonstrate our methodology on several macro based circuit designs and compare the execution speed and qualiw of results with commercially available CAE tools. Our approach shows a dramatic speedup in execution time without any negative impact on quality.
Conference Paper
Full-text available
A reconfigurable architecture optimized for media processing, and based on 4-bit arithmetic logic unit (ALU) and interconnect is described. Together, these allow the area devoted to configuration bits and routing switches to be about 50% of the area of the basic CHESS array, leaving the rest available for user-visible functional units. CHESS flexibility in application mapping is largely due to the ability to feed ALU with instruction streams generated within the array, generous provision of embedded block random access memory, and the ability to trade routing switches for small memories.
Conference Paper
Full-text available
We present a general approach to the FPGA technology mapping problem that applies to any logic block composed of lookup tables (LUTs) and can yield optimal solutions. The connections between LUTs of a logic block are modeled by virtual switches, which define a set of multiple-LUT blocks (MLBs) called an MLB-basis. We identify the MLB-bases for various commercial logic blocks. Given a n MLB-basis, we formulate FPGA mapping as a mixed integer linear programming (MILP) problem to achieve both the generality and the optimality objectives. We solve the MILP models using a general-purpose MILP solver, and present the results of mapping some ISCAS.85 benchmark circuits with a variety of commercial FPGAs. Circuits of a few hundred gates can be mapped in reasonable time using the MILP approach directly. Larger circuits can be handled by partitioning them prior to technology mapping. We show that optimal or provably near-optimal solutions can be obtained for the large ISCAS.85 benchmark circuits using partitions defined by their high-level functions.
Conference Paper
Full-text available
1. ABSTRACT lier BDD-based methods. Boolean-based routing transforms the geometric FPGA routing task into a single, large Boolean equation with the property that any assignment of input variables that “satisfies” the equation (that renders equation identically “1”) specifies a valid routing. The formulation has the virtue that it considers all nets simultaneously, and the absence of a satisfying assignment implies that the layout is unroutable. Initial Boolean-based approaches to routing used Binary Decision Diagrams (BDDs) to represent and solve the layout problem. BDDs, however, limit the size and complexity of the FPGAs that can be routed, leading these approaches to concentrate only on individual FPGA channels. In this paper, we present a new search-based Satisfiability (SAT) formulation that can handle entire FPGAs, routing all nets concurrently. The approach relies on a recently developed SAT engine (GRASP) that uses systematic search with conflict-directed non-chronological backtracking, capable of handling very large SAT instances. We present the first comparisons of search-based SAT routing results to other routers, and offer the first evidence that SAT methods can actually demonstrate the unroutability of a layout. Preliminary experimental results suggest that this approach to FPGA routing is more viable than ear
Conference Paper
Full-text available
The Embedded System Block (ESB) of the APEX20K programmable logic device family from Altera Corporation includes the capability of implementing product term macrocells in addition to flexibly configurable ROM and dual port RAM. In product term mode, each ESB has 16 macrocells built out of 32 product terms with 32 literal inputs. The ability to reconfigure memory blocks in this way represents a new and innovative use of resources in a programmable logic device, requiring creative solutions in both the hardware and software domains. The architecture and features of this Embedded System Block are described.
Conference Paper
Full-text available
Increasing design densities on large FPGAs and greater demand for performance, has calledfor special purpose tools like floorplanner, performance driven router, and more. In this paper we present a floorplanning based design mapping solution that is capable of mapping macro cell based designs as well as hierarchicaldesigns on FPGAs. The mapping solution has been tested extensively on a large collection of designs. We not only outperform state of the art CAE tools from industry in terms of execution time but also achieve much better performance in terms of timing. These methods are especially suitable for mapping designs on very large FPGAs.
Conference Paper
Full-text available
The National Adaptive Processing Architecture (NAPA) is a major effort to integrate the resources needed to develop teraops class computing systems based on the principles of adaptive computing. The primary goals for this effort include: (1) the development of an example NAPA component which achieves an order of magnitude cost/performance improvement compared to traditional FPGA based systems, (2) the creation of a rich but effective application development environment for NAPA systems based on the ideas of compile time functional partitioning and (3) significantly improve the base infrastructure for effective research in reconfigurable computing. This paper emphasizes the technical aspects of the architecture to achieve the first goal while illustrating key architectural concepts motivated by the second and third goals
Conference Paper
Full-text available
This paper presents a new approach to synthesize to recon- figurable hardware (HW) user -specified regions of a pro- gram, under the assumption of "virtual HW" support. The automation of t his approach is supported by a compiler front-end and by an HW compiler under development. The front-end starts from the Java bytecodes and, therefore, supports any language that can be compiled to the JVM (Java Virtual Machine) model. It extracts from the byte- codes all the dependencies inside and between basic blocks. This information is stored in representatio n graphs more suitable to efficiently exploit the existent parallelism in the program than those typically used in high-level synthesis. From the intermediate representations the HW compiler exploits the temporal partitions at the behavior level, re- solves memory access conflicts, and generates the VHDL descriptions at register -transfer level that will be mapped into the reconfigurable HW devices.
Conference Paper
Although run-time reconfigurable systems have been shown to achieve very high performance, the speedups over traditional microprocessor systems are limited by the cost of configuration of the hardware. In this paper, we explore the idea of configuration caching. We present techniques to carefully manage the configurations present on the reconfigurable hardware throughout program execution. Using the presented strategies, we show that the number of required reconfigurations is reduced, lowering the configuration overhead. We extend these techniques to a number of different FPGA programming models, and develop both lower bound and realistic caching algorithms for these structures.
Article
Because the VLSI circuits complexity growth, the trend in design is towards divide-and-conquer schemes, in which circuits are composed of blocks, standard macros or custom macros. From the other side, to allow an implementation of large digital circuits, increased capacity target FPGAs are organized hierarchically. In this paper, we present a hierarchical FPGA floorplanning method which takes into account both the hierarchy of the design and the hierarchy of the target. The method aims at minimization the timing and balancing cost of the floorplan and is based on automatic detection of macro blocks and assigning them to the target FPGA hierarchical zones.
Article
The Transmogrifier C hardware description language is almost identical to the C programming language, making it attractive to the large community of C-language programmers. This paper describes the semantics of the language and presents a Transmogrifier C compiler that targets the Xilinx 4000 FPGA. The compiler is operational and has produced several working circuits, including a graphics display driver.
Chapter
In this paper an Hardware/Software partitioning algorithm is presented. Appropriate cost and performance estimation functions were developed, as well, as techniques for their automated calculation. The partitioning algorithm that explores the parallelism in acyclic code regions is part of a larger tool kit specific for custom computing machines. The tool kit includes a parallelising compiler, an hardware/software partitioning program, as well as, a set of programs for performance estimation and system implementation. It speeds up the computationally intensive tasks using a FPGA based processing platform to augment the functionality of the processor with new operations and parallel capacities. An example was used to demonstrate the proposed partitioning techniques.
Conference Paper
The purpose of this paper is to describe Splash 2, a microprocessor. A comparison is made between this system and Splash 1: both their programming and algorithmic applications are discussed. It proved to be highly successful notwithstanding the known dangers of Second System Syndrome.
Conference Paper
Most advanced forms of security for electronic transactions rely on the public-key cryptosystems developed by Rivest, Shamir and Adleman. Unfortunately, these systems are only secure while it remains difficult to factor large integers. The fastest published algorithms for factoring large numbers have a common sieving step. These sieves collect numbers that are completely factored by a set of prime numbers that are known in advance. Furthermore, the time required to execute these sieves currently dominates the runtime of the factoring algorithms. We show how the sieving process can be mapped to the Mojave configurable computing architecture. The mapping exploits unique properties of the sieving algorithms to fully utilize the bandwidth of a multiple bank interleaved memory system. The sieve has been mapped to a single programmable hardware unit on the Mojave computer, and achieves a clock frequency of 16 MHz. The full system implementation sieves over 28 times faster than an UltraSPARC Workstation. A simple upgrade to 8ns SRAMs will result in a speedup factor of 160.
Conference Paper
1. ABSTRACT As custom computing machines evolve, it is clear that a major bottleneck is the slow interconnection architecture between the logic and memory. This paper describes the architecture of a custom computing machine that overcomes the interconnection bottleneck by closely integrating a fixed-logic processor, a reconfigurable logic array, and memory into a single chip, called OneChip-98. The OneChip- system has a seamless programming model that enables the programmer to easily specify instructions without additional complex instruction decoding hardware. As well, there is a simple scheme for mapping instructions to the corresponding programming bits. To allow the processor and the reconfigurable array to execute concurrently, the programming model utilizes a novel memory-consistency scheme implemented in the hardware. To evaluate the feasibility of the OneChip- architecture, a 32-bit MIPS-like processor and several performance enhancement applications were mapped to the Transmogrifier-2 field programmable system. For two typical applications, the 2-dimensional discrete cosine transform and the 64-tap FIR filter, we were capable of achieving a performance speedup of over 30 times that of a stand-alone state-of-the-art processor.
Conference Paper
An algorithm is presented for partitioning a design in time. The algorithm devides a large, technology-mapped design into multiple configurations of a time-multiplexed FPGA. These configurations are rapidly executed in the FPGA to emulate the large design. The tool includes facilities for optimizing the partitioning to improve routability, for fitting the design into more configurations than the depth of the critical path and for compressing the critical path of the design into fewer configurations, both to fit the design into the device and to improve performance. Scheduling results are shown for mapping designs into an 8-configuration time-multiplexed FPGA and for architecture investigation for a time-multiplexed FPGA.
Conference Paper
This paper describes the Transmogrifier-2, a second generation multi-FPGA system. The largest version of the system will comprise 16 boards that each contain two Altera 10K50 FPGAs, four I-cube interconnect chips, and up to 8 Mbytes of memory. The inter-FPGA routing architecture of the TM-2 uses a novel interconnect structure, a non-uniform partial crossbar, that provides a constant delay between any two FPGAs in the system. The TM-2 architecture is modular and scalable, meaning that various sized systems can be constructed from the same board, while maintaining routability and the constant delay feature. Other features include a system-level programmable clock that allows single-cycle access to off-chip memory, and programmable clock waveforms with resolution to 10ns. The first Transmogrifier-2 boards have been manufactured and are functional. They have recently been used successfully in some simple graphics acceleration applications.
Conference Paper
The demand for high-speed FPGA compilation tools has occurred for three reasons: first, as FPGA device capacity has grown, the computation time devoted to placement and routing has grown more dramatically than the compute power of the available com- puters. Second, there exists a subset of users who are willing to accept a reduction in the quality1 of result in exchange for a high- speed compilation. Third, high-speed compile has been a long- standing desire of users of FPGA-based custom computing machines, since their compile time requirements are ideally closer to those of regular computers. This paper focuses on the placement phase of the compile process, and presents an ultra-fast placement algorithm targeted to FPGAs. The algorithm is based on a combination of multiple-level, bottom- up clustering and hierarchical simulated annealing. It provides superior area results over a known high-quality placement tool on a set of large benchmark circuits, when both are restricted to a short run time. For example, it can generate a placement for a 100,000- gate circuit in 10 seconds on a 300 MHz Sun UltraSPARC worksta- tion that is only 33% worse than a high-quality placement that takes 524 seconds using a pure simulated annealing implementa- tion. In addition, operating in its fastest mode, this tool can provide an accurate estimate of the wirelength achievable with good quality placement. This can be used, in conjunction with a routing predic- tor, to very quickly determine the routability of a given circuit on a given FPGA device.
Conference Paper
TIERS is a new pipelined routing and scheduling algorithm implemented in a complete VirtualWireTM compilation and synthesis system. TIERS is described and compared to prior work both analytically and quantitatively. TIERS improves system speed by as much as a factor of 2.5 over prior work. TIERS routing results for both Altera and Xilinx based FPGA systems are provided.
Conference Paper
Multi-FPGA systems (MFSs) are used as custom computing machines, logic emulators and rapid prototyping vehicles. A key aspect of these systems is their programmable routing architecture; the manner in which wires, FPGAs and Field-Programmable Interconnect Devices (FPIDs) are connected. Several routing architectures for MFSs have been proposed [Arno92] [Butt92] [Hauc94] [Apti96] [Vuil96] and previous research has shown that the partial crossbar is one of the best existing architectures [Kim96] [Khal97]. In this paper we propose a new routing architecture, called the Hybrid Complete-Graph and Partial-Crossbar (HCGP) which has superior speed and cost compared to a partial crossbar. The new architecture uses both hard-wired and programmable connections between the FPGAs.We compare the performance and cost of the HCGP and partial crossbar architectures experimentally, by mapping a set of 15 large benchmark circuits into each architecture. A customized set of partitioning and inter-chip routing tools were developed, with particular attention paid to architecture-appropriate inter-chip routing algorithms. We show that the cost of the partial crossbar (as measured by the number of pins on all FPGAs and FPIDs required to fit a design), is on average 20% more than the new HCGP architecture and as much as 35% more. Furthermore, the critical path delay for designs implemented on the partial crossbar increased, and were on average 9% more than the HCGP architecture and up to 26% more.Using our experimental approach, we also explore a key architecture parameter associated with the HCGP architecture: the proportion of hard-wired connections versus programmable connections, to determine its best value.
Conference Paper
Striped FPGA [l], or pipeline-reconfigurable FPGA provides hardware virtualization by supporting fast run-time reconfiguration. In this paper we show that the performance of striped FPGA depends on the reconfiguration pattern, the run time scheduling of configurations through the FPGA. We study two main configuration scheduling approaches- Configuration Caching and Data Caching. We present the quantitative analysis of these scheduling techniques to compute their total execution cycles taking into account the overhead caused by the IO with the external memory. Based on the analysis we can determine which scheduling technique works better for the given application and for the given hardware parameters.
Conference Paper
The goal of this paper is to perform a timing optimization of a circuit described b y a network of cells on a target structure whose connection delays ha v ediscrete values follo wing its hierarch y. The circuits is modelled by a set of timed cones whose delay histograms allow their classification into critical, potential critical and neutral cones according to predicted delays. The floorplanning is then guided b y this cone structuring and has two innov ativ e features:first, it is shown that the placement of the elements of the neutral cones has no impact on timing results, th us a significant reduction is obtained; second, despite a greedy approach, a near optimal floorplan is achieved in a large number of examples.
Conference Paper
The paper presents an operating system (OS) for custom computing machines (CCMs) based on the Xputer paradigm. Custom computing tries to combine traditional computing with programmable hardware, attempting to gain from the benefits of both adaptive software and optimized hardware. The OS running as an extension to the actual host OS allows a greater flexibility in deciding what parts of the application should run on the configurable hardware with structural code and what on the host-hardware with conventional software. This decision can be taken late - at run-time - and dynamically, in contrast to early partitioning and deciding at compile-time as used currently on CCMs. Thus the CCM can be used concurrently by multiple users or applications without knowledge of each other. This raises programming and using CCMs to levels close to modem OSes for sequential von Neumann processors.
Conference Paper
FPGA based data encryption provides greater flexibility than ASICs and higher performance than software. Because FPGAs can be reprogrammed, they allow a single integrated circuit to efficiently implement multiple encryption algorithms. Furthermore, the ability to program FPGAs at runtime can be used to improve the performance through dynamic optimization. This paper describes the application of partial evaluation to an implementation of the Data Encryption Standard (DES). Each end user of a DES session shares a secret key, and this knowledge can be used to improve circuit performance. Key-specific encryption circuits require fewer resources and have shorter critical paths than the completely general design. By applying partial evaluation to DES on a Xilinx XC4000 series device we have reduced the CLB usage by 45% and improved the encryption bandwidth by 35%.
Conference Paper
This paper presents a new LUT based technology mapping approach for delay optimisation. To optimise the circuit delay after layout, the wire delays are taken into account in our delay model. In addition, an effective approach is proposed to trade-off the CLB delays and the wire delays so as to minimise the whole circuit delay. The trade-off is achieved in two phases, mapping for area optimisation followed by new delay reduction techniques. Based on a standard set of benchmark examples, experimental results after PPR layout have shown that the proposed approach outperforms state-of-the-art approaches.
Conference Paper
Current design tools support parameterisation of circuits, but the parameters are fixed at compile-time. In contrast, the circuits discussed in this paper fix their parameters at run-time. Run-time parameterised circuits can potentially out-perform custom VLSI hardware by optimising the FPGA circuit for a specific instance of a problem rather than for a general class of problem. This paper discusses the design of run-time parameterised circuits, and presents a study of run-time parameterised circuits for finite field operations on the Xilinx XC6200. The paper includes a comparison with implementation on a self-timed version of the XC6200 architecture, which illustrates the potential benefits of self-timing for dynamically reconfigurable systems.
Conference Paper
New techniques have been developed for the technology mapping of FPGAs containing more than one size of look-up table. The Xil inx 4000 series is one such family of devices. These have a very large share of the FPGA market, and yet the associated technology mapping problem has hardly been addressed in the literature. Our method extends the standard techniques of functional decomposition and network covering. For the decomposition, we have extended the conventional binpacking (cube-packing) algorithms so that it produces two sizes of bins. We have also enhanced it to explore several packing possibilities, and include cube division and cascading of nodes. The covering step is based on the concept of flow networks and cut-computation. We devised a theory that reduces the flow network sizes so that a dynamic programming approach can be used to compute the feasible cuts in the network. An iterative selection algorithm can then be used to compute the set cover of the network. Experimental results show good performances for the Xilinx 4K devices (about 25% improvement over MOFL and 10% over comparable algorithms in SIS in terms of CLBs).
Conference Paper
Pipeline morphing is a simple but effective technique for reconfiguring pipelined FPGA designs at run time. By overlapping computation and reconfiguration, the latency associated with emptying and refilling a pipeline can be avoided. We show how morphing can be applied to linear and mesh pipelines at both word-level and bit-level, and explain how this method can be implemented using Xilinx 6200 FPGAs. We also present an approach using morphing to map a large virtual pipeline onto a small physical pipeline, and the trade-offs involved are discussed.
Conference Paper
Domain-specific systems represent an important opportunity for FPGA-based systems. They provide a way to exploit the low-level flexibility of the device through the use of domain-specific libraries of specialized circuit elements. Moreover, reconfigurability of the FPGAs is also exploited by reusing these circuit elements to implement a wide variety of applications within the specific domain. Application areas such as image processing, signal processing, graphics, DNA analysis, and other areas are important areas that can use this approach to achieve high levels of performance at reasonable cost. Unfortunately, current CAD tools are a relatively poor match for the design needs of such systems. However, some of the additions listed in this paper should not be too difficult to implement and they would greatly ease the design of such systems.
Book
Details the complete Splash 2 project—the hardware and software systems, their architecture and implementation, and the design process by which the architecture evolved from an earlier version machine. In addition to the description of the machine, this book explains why Splash 2 was engineered. It illustrates several applications in detail, allowing you to gain an understanding of the capabilities and the limitations of this kind of computing device. The Splash 2 program is significant for two reasons. First, it is part of a complete computer system that achieves supercomputer like performance on a number of different applications. The second significant aspect is that this large system is capable of performing real computations on real problems. In order to understand what happens when the application programmer designs, it is necessary to see the system as a whole. This book looks in-depth at one of the handful of data points in the design space of this nee kind of machine.
Article
Guaranteeing or even estimating the routability of a portion of a placed FPGA remains difficult or impossible in most practical applications. In this paper we develop a novel formulation of both routing and routability estimation that relies on a rendering of the routing constraints as a single large Boolean equation. Any satisfying assignment to this equation specifies a complete detailed routing. By representing the equation as a Binary Decision Diagram (BDD), we represent all possible routes for allnets simultaneously. Routability estimation is transformed to Boolean satisfiability, which is trivial for BDDs. We use the technique in the context of a perfect routability estimator for a global router. Experimental results from a standard FPGA benchmark suite suggest the technique is feasible for realistic circuits, but refinements are needed for very large designs.
Article
In this paper, the routing problem for two-dimensional (2-D) field programmable gate arrays of a Xilinx-like architecture is studied. We first propose an efficient one-step router that makes use of the main characteristics of the architecture. Then we propose an improved approach of coupling two greedy heuristics designed to avoid an undesired decaying effect, a dramatically degenerated router performance on the near completion stages. This phenomenon is commonly observed on results produced by the conventional deterministic routing strategies using a single optimization cost function. Consequently, our results are significantly improved on both the number of routing tracks and routing segments by just applying low-complexity algorithms. On the tested MCNC and industrial benchmarks, the total number of tracks used by the best known two-step global/detailed router is 28% more than that used by our proposed method
Article
We consider a board-level routing problem applicable to FPGA-based logic emulation systems such as the Realizer System [Varghese et al. 1993] and the Enterprise Emulation System [Maliniak 1992] manufactured by Quickturn Design Systems. Optimal algorithms have been proposed for the case where all nets are two-terminal nets [Chan and Schlag 1993; Mak and Wong 1995]. We show how multiterminal nets can be handled by decomposition into two-terminal nets. We show that the multiterminal net decomposition problem can be modeled as a bounded-degree hypergraph-to-graph transformation problem where hyperedges are transformed to spanning trees. A network flow-based algorithm that solves both problems is proposed. It determines if there is a feasible decomposition and gives one whenever such a decomposition exists.