Article

A system-level FPGA design methodology for video applications with weakly-programmable hardware components

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

High-performance video applications with real-time requirements play an important role in diverse application fields and are often executed by advanced parallel processors or GPUs. For embedded scenarios with strict energy constraints such as automotive image processing, FPGAs represent a feasible power-efficient computer platform. Unfortunately, their hardware-driven design concept results in long development cycles and impedes their acceptance in industrial practice. Additionally, the verification of the FPGA’s correctness and its performance figures are unavailable until a very late development stage, which is critical during design space exploration and the integration in complex embedded systems. Weakly-programmable architectures, supporting design and run-time reuse via flexible hardware components, represent a promising and efficient FPGA development approach. However, they currently lack suitable design and verification methodologies for real-time scenarios. Therefore, this paper proposes a system-level FPGA development concept for video applications with weakly-programmable hardware components. It combines rapid software prototyping with component-based FPGA design and advanced formal real-time analysis and code generation techniques. The presented approach enables an early verification of the application’s correctness, including exact performance figures. It provides a software-level verification of weakly-programmable hardware components and an automated assembly of the final hardware design. The developed tools and their usability are demonstrated by a binarization and a dense block matching application, which represents a basic preprocessing step in automotive image processing for driver assistance systems. When compared to a hand-optimized variant, the generated hardware design achieves comparable performance and chip area figures without requiring significant hardware integration effort.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Some applications of advanced computer vision algorithms include video histogram, color conversion system that can be found in modern cameras and many video surveillance [3,4]. Although it might not be necessary to have live video processing capability for many applications, some applications such as color conversion and histogram equalization used for autonomous driving system would require an input stream from cameras to be processed at real time in order to send signals back to the powertrain and steering control unit to respond properly [5,6,7]. FPGAs are a good choice platform for real-time video processing because energy efficiency and the potential to extract highly-parallelized calculations [7,8]. ...
... Although it might not be necessary to have live video processing capability for many applications, some applications such as color conversion and histogram equalization used for autonomous driving system would require an input stream from cameras to be processed at real time in order to send signals back to the powertrain and steering control unit to respond properly [5,6,7]. FPGAs are a good choice platform for real-time video processing because energy efficiency and the potential to extract highly-parallelized calculations [7,8]. However, hardware development consumes typically more time and human resources than a similar software development would consume [20,22]. ...
... However, hardware development consumes typically more time and human resources than a similar software development would consume [20,22]. For a traditional development based on FPGAs, a good knowledge of digital logic circuit is necessary for Hardware Description Languages (HDLs) such as Verilog and VHDL to construct and config Register-Transfer Level (RTL) circuits in an FPGA [7,17]. ...
Article
Full-text available
Programming in high abstraction level is known by its benefits. It can facilitate the development of digital image and video processing systems. Recently, high-level synthesis (HLS) has played a significant role in developing this field of study. Real time image and video Processing solution needing high throughput rate are often performed in a dedicated hardware such as FPGA. Previous studies relied on traditional design processes called VHDL and Verilog and to synthesize and validate the hardware. These processes are technically complex and time consuming. This paper introduces an alternative novel approach. It uses a Model-Based Design workflow based on HDL Coder (MBD), Vision HDL Toolbox, Simulink and MATLAB for the purpose of accelerating the design of image and video solution. The main purpose of the present paper is to study the complexity of the design development and minimize development time (Time to market: TM) of conventional FPGA design. In this paper, the complexity of the development™ can be reduced by 60% effectively by automatically generating the IP cores and downloading the modeled design through the Xilinx tools and give more advantages of FPGA related to the other devices like ASIC and GPU.
... Even in video processing domain, the possible contexts and requirements are many. [10,11] address some of those contexts. [10] proposes an image video processing platform based on Xilinx's MicroBlaze soft processor. ...
... The focus in [10] is on the hardwired pixel processing pipeline, which is synthesized by Synfora's PICO high-level synthesis (HLS) tool, where the MicroBlaze simply initializes and controls the pipeline. [11] builds on to the concept of "weaklyprogrammable architectures" and enables early verification of the application's correctness as well as its performance figures. It also offers automated assembly of the final hardware design. ...
... We have an overlap in terms of advocating use of HLS in the pixel processing pipeline. When it comes to [11], we deal with early verification of correctness and throughput as well as automated assembly of the pixel processing pipeline. However, we do not use an architecture based on weakly-programmable architectures. ...
Article
Full-text available
This paper describes flexible tools and techniques that can be used to efficiently design/generate quite a variety of hardware IP blocks for highly parameterized real-time video processing algorithms. The tools and techniques discussed in the paper include host software, FPGA interface IP (PCIe, USB 3.0, DRAM), high-level synthesis, RTL generation tools, synthesis automation as well as architectural concepts (e.g., nested pipelining), an architectural estimation tool, and verification methodology. The paper also discusses a specific use case to deploy the mentioned tools and techniques for hardware design of an optical flow algorithm. The paper shows that in a fairly short amount of time, we were able to implement 11 versions of the optical flow algorithm running on 3 different FPGAs (from 2 different vendors), while we generated and synthesized several thousand designs for architectural trade-off.
... It is a low-end embedded platform based on an ARM Cortex-A9 combined with an Artix-7 FPGA as coprocessor. It is used in computer vision applications or control purposes and machine learning [47,48,70,[75][76][77][78][79]. ...
Thesis
Full-text available
Simulation has become an important tool in the industry that minimizes either the cost and time of new products development and testing. In the automotive industry, the use of simulation is being extended to virtual sensing. Through an accurate model of the vehicle combined with a state estimator, variables that are difficult or costly to measure can be estimated. The virtual sensing approach is limited by the low computational power of in-vehicle hardware due to the strictest timing, reliability and safety requirements imposed by automotive standards. With the new generation hardware, the computational power of embedded platforms has increased. They are based on heterogeneous processors, where the main processor is combined with a co-processor, such as Field Programmable Gate Arrays (FPGAs). This thesis explores the implementation of a state estimator based on a multibody model of a vehicle in new generation embedded hardware. Different implementation strategies are tested in order to explore the advantages that an FPGA can provide. A new state-parameter-input observer is developed, providing accurate estimations. The proposed observer is combined with an efficient multibody model of a vehicle, achieving real-time execution.
... Several works have been presented based on this device in automotive applications. It is used in computer vision applications such as signal recognition and detection of pedestrians or obstacles [27][28][29][30][31][32]. For control purposes and machine learning, the works presented in [17,33] are also based on the Zynq-7000 XC7Z020. ...
Article
Full-text available
New products in the automotive and aerospace industries must provide increased energy efficiency and exceed previous performance, safety and reliability. To meet these expectations, the role of simulation continues to grow. Within this context, simulation models are used in real-time embedded applications such as advanced real-time control and virtual sensing. Both applications require the execution of simulation models in real-time on embedded hardware. The limited computational power of this hardware is, however, a major challenge in the adoption of model-based embedded applications. This research explores the use of multibody models for real-time embedded applications. It describes different techniques to accelerate parts or all of the multibody computations on ARM-based and/or FPGA-based hardware.
... When PCIe bus is utilized to transmit video data, it must meet the real-time requirement [4,5,22,27]. Thoma Y. et al. [25] focus on the fast communication between FPGA and GPU through PCIe and proposes a solution that data need not to go through the main memory of host, but transmitted directly between the two devices by means of DMA. ...
Article
Full-text available
PCI Express (PCIe) interface has been extensively used in high-speed digital systems for multimedia communication. With the migration of the video processing algorithms from host to embedded hardware, multi-channel video capturing systems will produce not only different channels of raw video data but also different types of auxiliary data, such as analyzed data and compressed stream. In order to display multi-channel video in real-time and explore the auxiliary data, conventional transmission strategies are no longer applicable, due to the fact that heterogeneous data will cause frequent interactions and lead to the waste of PCIe bandwidth. In this paper, an efficient PCIe transmission method for multi-channel video is presented. Firstly, for the transmission of multi-type video data, a dynamic splicing mechanism is proposed to combine the video analyzed data and the compressed stream with the raw video to avoid the individual transmission of the auxiliary data. Secondly, as the spliced data are from different channels, a conditional prefetching mechanism is employed to determine whether there exists any entire video frame in other channel buffers, so that multi-channel video data can be transmitted possibly at one time. Finally, in the host-side driver, direct kernel buffer access technique is used to improve the application I/O request packet (IRP) performance. And to ensure the transmission efficiency of the conditional prefetching, DMA circular queue buffer and timer self-feedback monitor techniques are designed to avoid the possible visit bursts and abnormal interruptions. Experimental results demonstrate that compared with the conventional methods, the proposed method reduces the interrupt interactions by 60%, increases the transmission channel number by 94%, and also increases the application IRP number by 54%. The peak transmission speed of PCIe is up to 155 MB/s, which can meet 7 channels 704 × 576 YUV raw video and its auxiliary data transmission requirements using one 1-lane PCIe endpoint.
Chapter
Due to the increasing importance of digital protection of China’s intangible cultural heritage and the rapid development of information sharing, the use of the Internet as a platform for communication and dissemination has become an invisible guarantee. One of the important functions of digital protection of intangible cultural heritage is the creation of digital representation of intangible cultural heritage, and its evolution directly affects the field of digital protection of intangible cultural heritage in China. Therefore, the core issue now is to use parallel processing hardware and software to set digital standards for specific artistic values with cultural heritage. Through a large number of literature readings and experiments, this paper uses the particle swarm optimization algorithm in the software and hardware partition algorithm to process the data analysis of the digital protection system according to the characteristics of intangible cultural heritage, and designs a digital protection system with hardware and software parallel processing. And carried out the type analysis of the critical path scheduling in the digital protection system and the data analysis of the attributes (software power consumption/W, hardware power consumption/W) of each node in the target structure of the digital protection system. Experimental research shows that the data of software power consumption and hardware power consumption of 1–10 nodes in the target structure of the digital protection system with hardware and software parallel processing are accurate, and the system is suitable for digital protection of intangible cultural heritage.
Chapter
This paper presents efficient low-power compact hardware designs for common image processing functions including the median filter, smoothing filter, motion blurring, emboss filter, sharpening, Sobel, Roberts, and Canny edge detection. The designs were described in Verilog HDL. Xilinx ISE design suite was used for code simulation, synthesis, implementation, and chip programming. The designs were all evaluated in terms of speed, area (number of LUTs and registers), and power consumption. Post placement and routing (Post-PAR) results show that they need very small area and consume very little power while achieving good frame per second rate even for HDTV high resolution frames. This makes them suitable for real-time applications with stringent area and power budgets.
Article
This paper presents efficient low-power compact hardware designs for common image processing functions including the median filter, smoothing filter, motion blurring, emboss filter, sharpening, Sobel, Roberts, and Canny edge detection. The designs were described in Verilog HDL. Xilinx ISE design suite was used for code simulation, synthesis, implementation, and chip programming. The designs were all evaluated in terms of speed, area (number of LUTs and registers), and power consumption. Post placement and routing (Post-PAR) results show that they need very small area and consume very little power while achieving good frame per second rate even for HDTV high resolution frames. This makes them suitable for real-time applications with stringent area and power budgets.
Article
Full-text available
Contemporary embedded systems, which process streaming data such as signal, audio, or video data, are an increasingly important part of our lives. Shared resources (e.g. memories) help to reduce the chip area and power consumption of these systems, saving costs in high volume consumer products. Resource sharing, however, introduces new timing interdependencies between system components, which must be analyzed to verify that the initial timing requirements of the application domain are still met. Graphs with synchronous dataflow (SDF) semantics are frequently used to model these systems. In this paper, we present a method to integrate resource sharing into SDF graphs. Using these graphs and a throughput constraint, we will derive deadlines for resource accesses and the amount of memory required for an implementation. Then we derive the resource load directly from the SDF description, and perform a formal schedulability analysis to check if the original timing constraints are still met. Finally, we perform an evaluation of our approach using an image processing application and present our results.
Conference Paper
Full-text available
The increasing complexity of modern embedded streaming applications imposes new challenges on system designers nowadays. For instance, the applications evolved to the point that in many cases hard-real-time execution on multiprocessor platforms is needed in order to meet the applications' timing requirements. Moreover, in some cases, there is a need to run a set of such applications simultaneously on the same platform with support for accepting new incoming applications at run-time. Dealing with all these new challenges increases significantly the complexity of system design. However, the design time must remain acceptable. This requires the development of novel systematic and automated design methodologies driven by the aforementioned challenges. In this paper, we propose such a novel methodology for automated design of an embedded multiprocessor system, which can run multiple hard-real-time streaming applications simultaneously. Our methodology does not need the complex and time-consuming design space exploration phase, present in most of the current state-of-the art multiprocessor design frameworks. In contrast, our methodology applies very fast yet accurate schedulability analysis to determine the minimum number of processors, needed to schedule the applications, and the mapping of applications' tasks to processors. Furthermore, our methodology enables the use of hard-real-time multiprocessor scheduling theory to schedule the applications in a way that temporal isolation and a given throughput of each application are guaranteed. We evaluate an implementation of our methodology using a set of real-life streaming applications and demonstrate that it can greatly reduce the design time and effort while generating high quality hard-real-time systems.
Conference Paper
Full-text available
Digital film processing is characterized by a resolution of at least 2 K (2048times1536 pixels per frame at 30 bit/pixel and 24 pictures/s, data rate of 2.2 Gbit/s); higher resolutions of 4 K (8.8 Gbit/s) and even 8 K (35.2 Gbit/s) are on their way. Real-time processing at this data rate is beyond the scope of today's standard and DSP processors, and ASICs are not economically viable due to the small market volume. As an answer to these challenges, an FPGA-based approach was followed in the FlexFilm project. The multi-board, multi-FPGA hardware/software architecture is based on Xilinx Virtex-II Pro FPGAs which contain the reconfigurable image stream processing data path, large external SDRAM memories for multiple frame storage, and a PCI Express communication backbone network. Different applications are supported on a single hardware platform by using different FPGA configurations. This paper will focus on the FlexWAFE framework, a component library consisting of parameterizable modules for real-time stream processing including memory and communication controllers. Some of the library blocks' parameters are set at synthesis time via VHDL generics, while others are run-time configurable. This combination allows some flexibility without sacrificing FPGA area or speed.
Conference Paper
Full-text available
With the emergence of accelerator devices such as multicores, graphics-processing units (GPUs), and field-programmable gate arrays (FPGAs), application designers are confronted with the problem of searching a huge design space that has been shown to have widely varying performance and energy metrics for different accelerators, different application domains, and different use cases. To address this problem, numerous studies have evaluated specific applications across different accelerators. In this paper, we analyze an important domain of applications, referred to as sliding-window applications, when executing on FPGAs, GPUs, and multicores. For each device, we present optimization strategies and analyze use cases where each device is most effective. The results show that FPGAs can achieve speedup of up to 11x and 57x compared to GPUs and multicores, respectively, while also using orders of magnitude less energy.
Conference Paper
Full-text available
Obstacle avoidance is one of the most important challenges for mobile robots as well as future vision based driver assistance systems. This task requires a precise extraction of depth and the robust and fast detection of moving objects. In order to reach these goals, this paper considers vision as a process in space and time. It presents a powerful fusion of depth and motion information for image sequences taken from a moving observer. 3D-position and 3D-motion for a large number of image points are estimated simultaneously by means of Kalman-Filters. There is no need of prior error-prone segmentation. Thus, one gets a rich 6D representation that allows the detection of moving obstacles even in the presence of partial occlusion of foreground or background.
Article
Full-text available
This paper presents the OpenDF framework and recalls that dataflow programming was once invented to address the problem of parallel computing. We discuss the problems with an imperative style, von Neumann programs, and present what we believe are the advantages of using a dataflow programming model. The CAL actor language is briefly presented and its role in the ISO/MPEG standard is discussed. The Dataflow Interchange Format (DIF) and related tools can be used for analysis of actors and networks, demonstrating the advantages of a dataflow approach. Finally, an overview of a case study implementing an MPEG-4 decoder is given.
Conference Paper
Full-text available
This article discusses the design of an application specific MPSoC architecture dedicated to multiple target tracking (MTT). This application has its utility in driver assistant systems, more precisely in collision avoidance and warning systems. An automotive-radar is used as the front end sensor in our application. The article examines the tradeoffs that must be taken into consideration in the realization of the entire MTT application in an embedded system. In our implementation of MTT, several independent parallel tasks have been identified and mapped onto a multiprocessor architecture to ensure the deadlines imposed by the application. Our study demonstrates that the joint utilization of reconfigurable circuits (namely FPGA) and MPSoC, facilitates the development of a flexible and efficient MTT system.
Conference Paper
Full-text available
Stream oriented processing is an important methodology used in FPGA-based parallel processing. Characteristics of stream-oriented computing include high-data-rate flow of one or more data sources; fixed size, small stream payload (one byte to one word); compute-intensive operations, usually low precision fixed point, on the data stream; access to small local memories holding coefficients and other constants; and occasional synchronization between computational phases. We describe language constructs, compiler technology, and hardware/software libraries embodying the Streams-C system which has been developed to support stream-oriented computation on FPGA-based parallel computers. The language is implemented as a small set of library functions callable from a C language program. The Streams-C compiler synthesizes hardware circuits for multiple FPGAs as well as a multi-threaded software program for the control processor. Our system includes a functional simulation environment based on POSIX threads, allowing the programmer to simulate the collection of parallel processes and their communication at the functional level. Finally we present an application written both in Streams-C and hand-coded in VHDL. Compared to the hand-crafted design, the Streams-C-generated circuit takes 3x the area and runs at 1/2 the clock rate. In terms of time to market, the hand-done design took a month to develop by an experienced hardware developer. The Streams-C design rook a couple of days, for a productivity increase of 10x
Article
Full-text available
We present cycle-static dataflow (CSDF), which is a new model for the specification and implementation of digital signal processing algorithms. The CSDF paradigm is an extension of synchronous dataflow that still allows for static scheduling and, thus, a very efficient implementation of an application. In comparison with synchronous dataflow, it is more versatile because it also supports algorithms with a cyclically changing, but predefined, behavior. Our examples show that this capability results in a higher degree of parallelism and, hence, a higher throughput, shorter delays, and less buffer memory. Moreover, they indicate that CSDF is essential for modelling prescheduled components, like application-specific integrated circuits. Besides introducing the CSDF paradigm, we also derive necessary and sufficient conditions for the schedulability of a CSDF graph. We present and compare two methods for checking the liveness of a graph. The first one checks the liveness of loops, and the second one constructs a single-processor schedule for one iteration of the graph. Once the schedulability is tested, a makespan optimal schedule on a multiprocessor can be constructed. We also introduce the heuristic scheduling method of our graphical rapid prototyping environment (GRAPE)
Article
Full-text available
For modern embedded systems in the realm of high-throughput multimedia, imaging, and signal processing, the complexity of embedded applications has reached a point where the performance requirements of these applications can no longer be supported by embedded system architectures based on a single processor. Thus, the emerging embedded system-on-chip platforms are increasingly becoming multiprocessor architectures. As a consequence, two major problems emerge, namely how to design and how to program such multiprocessor platforms in a systematic and automated way in order to reduce the design time and to satisfy the performance needs of applications executed on such platforms. As an efficient solution to these two problems, in this paper, we present the methodology and techniques implemented in a tool called Embedded System-level Platform synthesis and Application Mapping (ESPAM) for automated multiprocessor system design, programming, and implementation. ESPAM moves the design specification and programming from the Register Transfer Level and low-level C to a higher system level of abstraction. We explain how, starting from system-level platform, application, and mapping specifications, a multiprocessor platform is synthesized, programmed, and implemented in a systematic and automated way. The class of multiprocessor platforms we consider is introduced as well. To validate and evaluate our methodology, we used ESPAM to automatically generate and program several multiprocessor systems that execute three image processing applications, namely Sobel edge detection, Discrete Wavelet Transform, and Motion JPEG encoder. The performance of the systems that execute these applications is also presented in this paper.
Article
This paper presents the OpenDF framework and recalls that dataflow programming was once invented to address the problem of parallel computing. We discuss the problems with an imperative style, von Neumann programs, and present what we believe are the advantages of using a dataflow programming model. The CAL actor language is briefly presented and its role in the ISO/MPEG standard is discussed. The Dataflow Interchange Format (DIF) and related tools can be used for analysis of actors and networks, demonstrating the advantages of a dataflow approach. Finally, an overview of a case study implementing an MPEG- 4 decoder is given.
Chapter
Ptolemy is an environment for simulation and prototyping of heterogeneous systems. It uses modern object-oriented software technology to model each subsystem in a natural and efficient manner, and to integrate these subsystems into a whole. Ptolemy encompasses all aspects of designing signal processing and communication systems, ranging from algorithms and communication strategies, simulation, hardware, and software design, parallel computing, and generating real-time prototypes. The core of Ptolemy is a set of object-oriented class definitions that makes few assumptions about the system to be modeled; rather, standard interface are provided for generic objects and more specialized application-specific objects are derived from these. Ptolemy has been used internally in Berkely for approximately two years for a growing number of simulation efforts, such as signals processing, electric power network simulation, and wireless and broadband network simulation.
Article
High-performance computer architectures for advanced driver assistance systems have become increasingly important in automotive research in the last several years. In order to achieve an optimal and robust perception of the vehicle's surroundings, current driver assistance applications typically rely on multiple sensor systems that deliver large amounts of incoming data from different sensor types. Such sensors include optical systems, which consist of a multi-camera setup combined with complex preprocessing algorithms. These algorithms exhibit high computation and data transport demands, as real-time image processing of multiple input streams is a mandatory requirement for these systems. At the same time, however, future driver assistance systems must adhere to strict power consumption requirements and automotive cost constraints in order to be considered for integration in series vehicles. This paper addresses these power and cost problems and presents an FPGA-based high-performance computing platform combined with a flexible, weakly-programmable data flow architecture and an associated high-level prototyping framework, which targets an efficient acceleration of computation-intensive tasks in driver assistance applications. The usability and processing performance of the platform is demonstrated by an advanced Motion Estimation application, which represents a challenging preprocessing step in automotive image processing.
Article
A recent Gartner Dataquest study predicts that the total worldwide automotive semiconductor market will grow from 20.1billionin2007to20.1 billion in 2007 to 25.9 billion by 2010. The study also predicts that revenue from automotive usage of FPGAs will triple to approximately $312 million during that same period [1] . Many of these FPGAs will be deployed in safety applications such as back-up cameras, lane departure warning systems, blind-spot warning system, and adaptive cruise control. FPGAs will also be deployed in next-generation engine electronics, emissions control, navigation, and entertainment applications. Automotive systems engineers are adept at using Model-Based Design for implementing some of these embedded applications on DSPs and microcontrollers. Many of these engineers are new to FPGA design and waking up to a fragmented workflow that is making it harder to meet time-to-market and cost objectives. For example, engineers who are migrating their systems designs from DSPs to FPGAs are discovering that additional verification steps such as bit-true, cycle-accurate simulations are required to ensure that the FPGA functions the same as the system specification. This is a time-consuming and error-prone activity involving file exchanges between the system designer and the FPGA designer. Geographically distributed teams face an even bigger challenge since the system engineer and FPGA designer may be many miles away from each other. This paper illustrates how Model-Based Design integrates the world of system designers, FPGA designers, and verification engineers to increase productivity and produce correct-by-construction designs that match the system specification. Using the concept of executable design specification, this paper discusses how Model-Based Design streamlines both design and verification of FPGAs for automotive applications in two important automotive workflows: • FPGA design and production deployment to low-volume high-processing power applications such as driver-assistance and infotainment systems. • FPGA use for prototyping in high-volume applications such as engine and steering control, where the final production deployment will be an ASIC. In this workflow, the proof of concept work is done using FPGAs.
Conference Paper
Camera-based systems in series vehicles have gained in importance in the past several years, which is documented, for example, by the introduction of front-view cameras and applications such as traffic sign or lane detection by all major car manufacturers. Besides a pure or enhanced visualization of the vehicle's environment, camera systems have also been extensively used for the design and implementation of complex driver assistance functions in diverse research scenarios, as they offer the possibility to extract both depth and motion information of static and moving objects. However, the evolution of existing computation-intensive vision applications from research vehicles toward series integration is currently a challenging task, which is due to the absence of highperformance computer architectures that adhere to the existing strict power and cost constraints. This paper addresses this challenge and explores FPGA-based dense block matching, which enables the calculation of depth information and motion estimation on shared hardware resources, regarding its applicability in intelligent vehicles. This includes the introduction of design scalability in time and space, thereby supporting customized application implementations and multiple camera setups. The presented modular concept also enables enhancements with pre- and post-processing features, which can be utilized to refine the obtained matching results. Its usability has been evaluated in diverse application scenarios and reaches high-performance image processing results of up to 740 GOPS at an acceptable energy level of 11 Watts, rendering it a suitable candidate for future series vehicles.
Article
Camera-based driver assistance systems have attracted the attention of all major automotive manufacturers in the past several years and are increasingly utilized to differentiate a vendor's vehicles from its competitors. The calculation of depth information and Motion Estimation can be considered as two fundamental image processing applications in these systems, which have already been evaluated in diverse research scenarios. However, in order to push these computation-intensive features towards series integration, future in-vehicle implementations must adhere to the automotive industry's strict power consumption and cost constraints. As an answer to this challenge, this paper presents a high-performance FPGA-based dense block matching solution, which enables the calculation of both object motion and the extraction of depth information on shared hardware resources. This novel single-design approach significantly reduces the amount of logic resources required, resulting in valuable cost and power savings. The acquired sensor information can be fusioned into 3D positions with an associated 3D motion vector, which enables a robust perception of the vehicle's environment. The modular implementation offers enhanced configuration features at design and execution time and achieves up to 418 GOPS at a moderate energy consumption of 10 Watts, providing a flexible solution for a future series integration.
Article
This paper presents IMAPCAR, a 100GOPS programmable highly parallel vision processor LSI consuming less than 2W of power for in-vehicle vision tasks of driver assistance systems. First, requirements of vision processors for driver assistance systems as well as the characteristics of vision tasks for safety are summarized. Next, features in the design of IMAPCAR are described in detail, which comparing with a previous design, improved the performance for major vision tasks by a factor of 2.5 while reduced 50% of power. Design choices taken by other in-vehicle vision processors are also compared and analyzed. Finally, technology perspectives of future in-vehicle vision processors are discussed. KeywordsHighly parallel processor architecture–Image processing–Image recognition–SIMD architecture–In-vehicle vision processor–Parallel language
Article
Pedestrian detection is one of the most important components in driver-assistance systems. In this paper, we propose a monocular vision system for real-time pedestrian detection and tracking during nighttime driving with a near-infrared (NIR) camera. Three modules (region-of-interest (ROI) generation, object classification, and tracking) are integrated in a cascade, and each utilizes complementary visual features to distinguish the objects from the cluttered background in the range of 20-80 m. Based on the common fact that the objects appear brighter than the nearby background in nighttime NIR images, efficient ROI generation is done based on the dual-threshold segmentation algorithm. As there is large intraclass variability in the pedestrian class, a tree-structured, two-stage detector is proposed to tackle the problem through training separate classifiers on disjoint subsets of different image sizes and arranging the classifiers based on Haar-like and histogram-of-oriented-gradients (HOG) features in a coarse-to-fine manner. To suppress the false alarms and fill the detection gaps, template-matching-based tracking is adopted, and multiframe validation is used to obtain the final results. Results from extensive tests on both urban and suburban videos indicate that the algorithm can produce a detection rate of more than 90% at the cost of about 10 false alarms/h and perform as fast as the frame rate (30 frames/s) on a Pentium IV 3.0-GHz personal computer, which also demonstrates that the proposed system is feasible for practical applications and enjoys the advantage of low implementation cost.
Conference Paper
In collision warning systems for automotive applications the response time of a system is very important, since a precise response is useless if it comes too late. In this paper a fast collision warning system is presented, which uses a stereo camera as a sensor. The used algorithms allow a fast response of the system by making use of parallel processing. The parallel processing algorithms have been implemented and tested using an Nvidia Tesla C1060 GPU, programmed using the Nvidia CUDA API. The processing time comparison between the CPU based and optimized GPU based versions of the algorithms are also presented.
Conference Paper
This paper describes a new architecture and the corresponding implementation of a stereo vision system that covers the entire stereo vision process including noise reduction, rectification, disparity estimation, and visualization. Dense disparity estimation is performed using the non-parametric rank transform and semi-global matching (SGM), which is among the top performing stereo matching methods and outperforms locally-based methods in terms of quality of disparity maps and robustness under difficult imaging conditions. Stream-based processing of the SGM despite its non-scan-aligned, complex data dependencies is achieved by a scalable, systolic-array-based architecture. This architecture fulfills the demands of real-world applications regarding frame rate, depth resolution and low resource usage. The architecture is based on a novel two-dimensional parallelization concept for the SGM. An FPGA implementation on a Xilinx Virtex-5 generates disparity maps of VGA images (640×480 pixel) with a 128 pixel disparity range under real-time conditions (30 fps) at a clock frequency as low as 39 MHz.
Article
New standards in signal, multimedia, and network processing for embedded electronics are characterized by computationally intensive algorithms, high flexibility due to the swift change in specifications. In order to meet demanding challenges of increasing computational requirements and stringent constraints on area and power consumption in fields of embedded engineering, there is a gradual trend towards coarse-grained parallel embedded processors. Furthermore, such processors are enabled with dynamic reconfiguration features for supporting time- and space-multiplexed execution of the algorithms. However, the formidable problem in efficient mapping of applications (mostly loop algorithms) onto such architectures has been a hindrance in their mass acceptance. In this paper we present (a) a highly parameterizable, tightly coupled, and reconfigurable parallel processor architecture together with the corresponding power breakdown and reconfiguration time analysis of a case study application, (b) a retargetable methodology for mapping of loop algorithms, (c) a co-design framework for modeling, simulation, and programming of such architectures, and (d) loosely coupled communication with host processor.
Article
Reconfigurable systems can offer the high spatial parallelism and fine-grained, bit-level resource control traditionally associated with hardware implementations, along with the flexibility and adaptability characteristic of software. While reconfigurable systems create new opportunities for engineering and delivering high-performance programmable systems, the traditional approaches to programming and managing computations used for hardware systems (e.g., Verilog, VHDL) and software systems (e.g., C, Fortran, Java) are inappropriate and inadequate for exploiting reconfigurable platforms. To address this need, we develop a stream-oriented compute model, system architecture, and execution patterns which can capture and exploit the parallelism of spatial computations while simultaneously abstracting software applications from hardware details (e.g., timing, device capacity, and microarchitectural implementation details) and consequently allowing applications to scale to exploit newer, larger, and faster hardware platforms. Further, we describe hardware and software techniques that make this late-bound platform mapping viable and efficient.
Article
We consider the problem of document binarization as a pre-processing step for optical character recognition (OCR) for the purpose of keyword search of historical printed documents. A number of promising techniques from the literature for binarization, pre-filtering, and post-binarization denoising were implemented along with newly developed methods for binarization: an error diffusion binarization, a multiresolutional version of Otsu's binarization, and denoising by despeckling. The OCR in the ABBYY FineReader 7.1 SDK is used as a black box metric to compare methods. Results for 12 pages from six newspapers of differing quality show that performance varies widely by image, but that the classic Otsu method and Otsu-based methods perform best on average.
Conference Paper
This paper presents the design and implementation of a coarse-grained reconfigurable architecture, targeting digital signal processing applications. The proposed architecture is constructed from a mesh of resource cells, containing separated processing and memory elements that communicate via a hybrid interconnect network. Parameterizable design of resource cells enables flexible mapping of arbitrary applications at system compile-time, and the feature of dynamic reconfigurability provides mapping possibilities during system run-time to adapt to the current operational and processing conditions. Functionality and flexibility of the proposed architecture is demonstrated through mapping of a radix-22 FFT processor reconfigurable between 32 and 1024 points. Performance evaluation exhibits a great reconfigurability and execution time reduction when compared to a traditional DSP and ARM solution.
Conference Paper
Many real-time stereo vision systems are available on low-power platforms. They all either use a local correlation-like stereo engine or perform dynamic programming variants on a scan-line. However, when looking at high-performance global stereo methods as listed in the upper third of the Middlebury database, the low-power real-time implementations for these methods are still missing. We propose a real-time implementation of the semi-global matching algorithm with algorithmic extensions for automotive applications on a reconfigurable hardware platform resulting in a low power consumption of under 3W. The algorithm runs at 25Hz processing image pairs of size 750x480 pixels and computing stereo on a 680x400 image part with up to a maximum of 128 disparities.
Conference Paper
In this paper, we describe a simple language for parallel programming. Its semantics is studied thoroughly. The de- sirable properties of this language and its deciencies are exhibited by this theoretical study. Basic results on parallel program schemata are given. We hope in this way to make a case for more formal (i.e. mathematical) approach to the design of languages for systems programming and the design of operating systems. There is a wide disagreement among systems designers as to what are the best primitives for writing systems programs. In this paper, we describe a simple language for parallel programming and study its mathematical properties.
Conference Paper
Embedded multimedia systems often run multiple time-constrained applications simultaneously. These systems use multiprocessor systems-on-chip of which it must be guaranteed that enough resources are available for each application to meet its throughput constraints. This requires a task binding and scheduling mechanism that provides timing guarantees for each application independent of other applications while taking into account the available processor space, memory and communication bandwidth. Synchronous Dataflow Graphs (SDFGs) are used to model time-constrained multimedia applications. They allow modeling of cyclic, multi-rate dependencies between tasks. However, existing resource allocation techniques can only deal with acyclic and/or single-rate dependencies. Dependencies in an SDFG can be expressed in single-rate form, but then the problem size may increase exponentially making resource allocation infea-sible. This paper presents a new resource allocation strategy which works directly on SDFGs, building on an efficient technique to calculate throughput of a bound and scheduled SDFG. Experimental results show that the strategy is effective in terms of run-time and allocated resources.
Article
Editor's note:This article presents the history and evolution of HLS from research to industry adoption. The authors offer insights on why earlier attempts to gain industry adoption were not successful, why current HLS tools are finally seeing adoption, and what to expect as HLS evolves toward system-level design.—Andres Takach, Mentor Graphics
Article
ateconcurrent processing of audio, video, graphics,and communication data.Chip architectureViper receives, optionallydecryallydecodes,converts, and displayqyqyqyyqH1EU5qycehaving different data formats. Besides MPEG-2transport streams, the chip ty qBEAA2handles livevideo, audio, and various other stream tyeam incompressed or uncompressed formats. Processingthese media streams reuires not onlytremendous computational power but also realtimesye qA response in...
Article
In this article the Autovision architecture is presented, a new Multi Processor System-on-Chip (MPSoC) architecture for future video-based driver assistance systems, using run-time reconfigurable hardware accelerator engines for video processing. According to various driving conditions (highway, city, sunlight, rain, tunnel entrance) different algorithms have to be used for video processing. These different algorithms require different hardware accelerator engines, which are loaded into the Autovision chip at run-time of the system, triggered by changing driving conditions. It was investigated how to use dynamic partial reconfiguration to load and operate the correct hardware accelerator engines in time, while removing unused engines in order to save precious chip area.
Article
The challenges posed by complex real-time digital image processing at high resolutions cannot be met by current state-of-the-art general-purpose or DSP processors, due to the lack of processing power. On the other hand, large arrays of FPGA-based accelerators are too inefficient to cover the needs of cost sensitive professional markets. We present a new architecture composed of a network of configurable flexible weakly programmable processing elements, Flexible Weakly programmable Advanced Film Engine (FlexWAFE). This architecture delivers both programmability and high efficiency when implemented on an FPGA basis. We demonstrate these claims using a professional next-generation noise reducer with more than 170G image operations/s at 80% FPGA area utilization on four Virtex II-Pro FPGAs. This article will focus on the FlexWAFE architecture principle and implementation on a PCI-Express board.
Article
The automotive market puts strict and often conflicting requirements on computer vision systems. On the one hand the algorithms require considerable computing power to work reliably in real-time and under a wide range of lighting conditions. On the other hand, the cost must be kept low, the package size must be small and the power consumption must be low. In addition, automotive qualified parts must be used both to withstand the harsh operating environment and to guarantee long product life. To meet all these conflicting requirements Mobileye developed the EyeQ, a complete ’system on a chip’ (SoC) which has the computing power to support a variety of applications such as lane, vehicle and pedestrian detection. This paper describes the process of designing an ASIC to support a family of vision algorithms.
Article
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009. This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections. Includes bibliographical references (p. 153-166). Stream programs represent an important class of high-performance computations. Defined by their regular processing of sequences of data, stream programs appear most commonly in the context of audio, video, and digital signal processing, though also in networking, encryption, and other areas. Stream programs can be naturally represented as a graph of independent actors that communicate explicitly over data channels. In this work we focus on programs where the input and output rates of actors are known at compile time, enabling aggressive transformations by the compiler; this model is known as synchronous dataflow. We develop a new programming language, StreamIt, that empowers both programmers and compiler writers to leverage the unique properties of the streaming domain. StreamIt offers several new abstractions, including hierarchical single-input single-output streams, composable primitives for data reordering, and a mechanism called teleport messaging that enables precise event handling in a distributed environment. We demonstrate the feasibility of developing applications in StreamIt via a detailed characterization of our 34,000-line benchmark suite, which spans from MPEG-2 encoding/decoding to GMTI radar processing. We also present a novel dynamic analysis for migrating legacy C programs into a streaming representation. The central premise of stream programming is that it enables the compiler to perform powerful optimizations. We support this premise by presenting a suite of new transformations. We describe the first translation of stream programs into the compressed domain, enabling programs written for uncompressed data formats to automatically operate directly on compressed data formats (based on LZ77). This technique offers a median speedup of 15x on common video editing operations. (cont.) We also review other optimizations developed in the StreamIt group, including automatic parallelization (offering an 11x mean speedup on the 16-core Raw machine), optimization of linear computations (offering a 5.5x average speedup on a Pentium 4), and cache-aware scheduling (offering a 3.5x mean speedup on a StrongARM 1100). While these transformations are beyond the reach of compilers for traditional languages such as C, they become tractable given the abundant parallelism and regular communication patterns exposed by the stream programming model. by William Thies. Ph.D.
Conference Paper
Streaming applications are often implemented as task graphs, in which data is communicated from task to task over buffers. Currently, techniques exist to compute buffer capacities that guarantee satisfaction of the throughput constraint if the amount of data produced and consumed by the tasks is known at design-time. However, applications such as audio and video decoders have tasks that produce and consume an amount of data that depends on the decoded stream. This paper introduces a dataflow model that allows for data-dependent communication, together with an algorithm that computes buffer capacities that guarantee satisfaction of a throughput constraint. The applicability of this algorithm is demonstrated by computing buffer capacities for an H.263 video decoder.
Conference Paper
SDF^3 is a tool for generating random Synchronous DataFlow Graphs (SDFGs), if desirable with certain guaranteed properties like strongly connectedness. It includes an extensive library of SDFG analysis and transformation algorithms as well as functionality to visualize them. The tool can create SDFG benchmarks that mimic DSP or multimedia applications.
Conference Paper
Motion estimation of a scene is an interesting problem in computer vision since it is the basis for the dynamic analysis of a scene. However this task is computational intensive for conventional processors. In this work, a FPGA-based hardware architecture for real-time motion estimation is proposed. The technique used for motion estimation is a variation of the optical flow algorithm where the problem is reformulated as a sum of overlapped basis functions, and solved as a linear system. The proposed architecture is based on a systolic approach and is composed of parallel modules organized in a regular structure. The systolic processor accelerates the matrix operations required to achieve real-time performance. The architecture design is presented. Preliminary results are shown and discussed
Conference Paper
This paper presents a framework for driver-in-the-loop driver assistance systems. It is part of the "Smart Cars" project which is a joint project between the Australian National University, NICTA and CSIRO. The requirements on a driver assistance system (DAS) are discussed as well as desirable properties of the user interface. To demonstrate the framework, a complete system capable of reading speed signs in real-time, comparing the driver gaze, and providing immediate feedback to the driver if the sign has not been noted by the driver is presented and experimentally evaluated.
Article
SymTA/S is a system-level performance and timing analysis approach based on formal scheduling analysis techniques and symbolic simulation. The tool supports heterogeneous architectures, complex task dependencies and context aware analysis. It determines system-level performance data such as end-to-end latencies, bus and processor utilisation, and worst-case scheduling scenarios. SymTA/S furthermore combines optimisation algorithms with system sensitivity analysis for rapid design space exploration. The paper gives an overview of current research interests in the SymTA/S project.
Article
Dataflow has proven to be an attractive computation model for programming digital signal processing (DSP) applications. A restricted version of dataflow, termed synchronous dataflow (SDF), that offers strong compile-time predictability properties, but has limited expressive power, has been studied extensively in the DSP context. Many extensions to synchronous dataflow have been proposed to increase its expressivity while maintaining its compile-time predictability properties as much as possible. We proposed a parameterized dataflow framework that can be applied as a meta-modeling technique to significantly improve the expressive power of any dataflow model that possesses a well-defined concept of a graph iteration, Indeed, the parameterized dataflow framework is compatible with many of the existing dataflow models for DSP including SDF, cyclo-static dataflow, scalable synchronous dataflow, and Boolean dataflow. In this paper, we develop precise, formal semantics for parameterized synchronous dataflow (PSDF)-the application of our parameterized modeling framework to SDF-that allows data-dependent, dynamic DSP systems to be modeled in a natural and intuitive fashion. Through our development of PSDF, we demonstrate that desirable properties of a DSP modeling environment such as dynamic reconfigurability and design reuse emerge as inherent characteristics of our parameterized framework. An example of a speech compression application is used to illustrate the efficacy of the PSDF approach and its amenability to efficient software synthesis techniques. In addition, we illustrate the generality of our parameterized framework by discussing its application to cyclo-static dataflow, which is a popular alternative to the SDF model
Article
Data flow is a natural paradigm for describing DSP applications for concurrent implementation on parallel hardware. Data flow programs for signal processing are directed graphs where each node represents a function and each arc represents a signal path. Synchronous data flow (SDF) is a special case of data flow (either atomic or large grain) in which the number of data samples produced or consumed by each node on each invocation is specified a priori. Nodes can be scheduled statically (at compile time) onto single or parallel programmable processors so the run-time overhead usually associated with data flow evaporates. Multiple sample rates within the same system are easily and naturally handled. Conditions for correctness of SDF graph are explained and scheduling algorithms are described for homogeneous parallel processors sharing memory. A preliminary SDF software system for automatically generating assembly language code for DSP microcomputers is described. Two new efficiency techniques are introduced, static buffering and an extension to SDF to efficiently implement conditionals.
High costs of mask sets and design force industry change
  • P Doe
System Generator for DSP
  • Xilinx Inc
Caltrop—language report (draft) Technical memorandum
  • J Eker
  • J Janneck
WP416: Vivado design suite
  • T Feist