Article

Exploring Shared Virtual Memory for FPGA Accelerators with a Configurable IOMMU

Article

Exploring Shared Virtual Memory for FPGA Accelerators with a Configurable IOMMU

If you want to read the PDF, try requesting it from the authors.

Abstract

A key enabler for the ever-increasing adoption of FPGA accelerators is the availability of frameworks allowing for the seamless coupling to general-purpose host processors. Embedded FPGA+CPU systems still heavily rely on copy-based host-to-accelerator communication, which complicates application development. In this paper, we present a hardware/software framework for enabling transparent, shared virtual memory for FPGA accelerators in embedded SoCs. It can use a hard-macro IOMMU if available, or a configurable soft-core IOMMU that we provide. We explore different TLB configurations and provide a comparison with other designs for shared virtual memory to gain insight on performance-critical IOMMU components. Experimental results using pointer-rich benchmarks show that our framework not only simplifies FPGA-accelerated application development, it also achieves up to 13x speedup compared to traditional copy-based offloading.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Article
Due to the ongoing slowdown of Dennard scaling, heterogeneous hardware architectures are inevitable to meet the increasing demand for energy efficient systems. However, one of the most important aspects that shape today’s computing landscape is the wide availability of software that can run on any system. Current applications that use accelerators, in contrast, are often especially tailored to a specific hardware setup and therefore not universally deployable. This is particularly true for reconfigurable logic as their internal structure requires the circuits and their integration to be designed as well. This makes them inherently difficult to use and therefore less accessible for a general audience. Nevertheless, their balance of flexibility and efficiency puts reconfigurable accelerators in a unique position between CPUs, GPUs, and ASICs. Therefore, one of the main challenges of future heterogeneous systems is to foster collaborative computing between these vastly different components while still being simple to use. Previous approaches mostly focused on subproblems instead of a holistic view of hardware and software in the context of commonplace usability. This paper analyzes the general demands on a reconfigurable platform and derives their requirements regarding accessibility and security. Hereby, we investigate several key features like hardware virtualization, system shared virtual memory, and the use of wide-spread programming paradigms. Then, we systematically build up such a platform based on the established ROCm GPU framework and its internal HSA standard. This new common HERA methodology is finally also demonstrated as a prototype.
Article
Emerging three-dimensional (3D) memory technologies, such as the Hybrid Memory Cube (HMC) and High Bandwidth Memory (HBM), provide high-bandwidth and massive memory-level parallelism. With the growing heterogeneity and complexity of computer systems (CPU cores and accelerators, etc.), efficiently integrating emerging memories into existing systems poses new challenges and requires detailed evaluation in a realistic computing environment. In this article, we propose MEG, an open source, configurable, cycle-exact, and RISC-V-based full-system emulation infrastructure using FPGA and HBM. MEG provides a highly modular hardware design and includes a bootable Linux image for a realistic software flow, so that users can perform cross-layer software-hardware co-optimization in a full-system environment. To improve the observability and debuggability of the system, MEG also provides a flexible performance monitoring scheme to guide the performance optimization. The proposed MEG infrastructure can potentially benefit broad communities across computer architecture, system software, and application software. Leveraging MEG, we present two cross-layer system optimizations as illustrative cases to demonstrate the usability of MEG. In the first case study, we present a reconfigurable memory controller to improve the address mapping of standard memory controller. This reconfigurable memory controller along with its OS support allows us to optimize the address mapping scheme to fully exploit the massive parallelism provided by the emerging three-dimensional (3D) memories. In the second case study, we present a lightweight IOMMU design to tackle the unique challenges brought by 3D memory in providing virtual memory support for near-memory accelerators. We provide a prototype implementation of MEG on a Xilinx VU37P FPGA and demonstrate its capability, fidelity, and flexibility on real-world benchmark applications. We hope MEG fills a gap in the space of publicly available FPGA-based full-system emulation infrastructures, specifically targeting memory systems, and inspires further collaborative software/hardware innovations.
Conference Paper
Full-text available
While emerging accelerator-centric architectures offer orders-of-magnitude performance and energy improvements, use cases and adoption can be limited by their rigid programming model. A unified virtual address space between the host CPU cores and customized accelerators can largely improve the programmability, which necessitates hardware support for address translation. However, supporting address translation for customized accelerators with low overhead is nontrivial. Prior studies either assume an infinite-sized TLB and zero page walk latency, or rely on a slow IOMMU for correctness and safety— which penalizes the overall system performance. To provide efficient address translation support for accelerator-centric architectures, we examine the memory access behavior of customized accelerators to drive the TLB augmentation and MMU designs. First, to support bulk transfers of consecutive data between the scratchpad memory of customized accelerators and the memory system, we present a relatively small private TLB design to provide low-latency caching of translations to each accelerator. Second, to compensate for the effects of the widely used data tiling techniques, we design a shared level-two TLB to serve private TLB misses on common virtual pages, eliminating duplicate page walks from accelerators working on neighboring data tiles that are mapped to the same physical page. This two-level TLB design effectively reduces page walks by 75.8% on average. Finally, instead of implementing a dedicated MMU which introduces additional hardware complexity, we propose simply leveraging the host per-core MMU for efficient page walk handling. This mechanism is based on our insight that the existing MMU cache in the CPU MMU satisfies the demand of customized accelerators with minimal overhead. Our evaluation demonstrates that the combined approach incurs only a 6.4% performance overhead compared to the ideal address translation.
Conference Paper
Full-text available
CPU-FPGA heterogeneous acceleration platforms have shown great potential for continued performance and energy efficiency improvement for modern data centers, and have captured great attention from both academia and industry. However, it is nontrivial for users to choose the right platform among various PCIe and QPI based CPU-FPGA platforms from different vendors. This paper aims to find out what microarchitectural characteristics affect the performance, and how. We conduct our quantitative comparison and in-depth analysis on two representative platforms: QPI-based Intel-Altera HARP with coherent shared memory, and PCIe-based Alpha Data board with private device memory. We provide multiple insights for both application developers and platform designers.
Article
Full-text available
This paper presents a sparse matrix partitioning strategy to improve the performance of SpMV on GPUs and multicore CPUs. This method has wide adaptability for different types of sparse matrices, and is different from existing methods which only adapt to some particular sparse matrices. In addition, our partitioning method can obtain dense blocks by analyzing the probability distribution of non-zero elements in a sparse matrix, and result in very low proportion of zero padded. We make the following significant contributions. (1) We present a partitioning strategy of sparse matrices based on probabilistic modeling of non-zero elements in a row. (2) We prove that our method has the highest mean density compared with other strategies according to certain given ratios of partition obtained from the computing powers of heterogeneous processors. (3) We develop a CPU-GPU hybrid parallel computing model for SpMV on GPUs and multicore CPUs in a heterogeneous computing platform. Our partitioning strategy has balanced load distribution and the performance of SpMV is significantly improved when a sparse matrix is partitioned into dense blocks using our method. The average performance improvement of our solution for SpMV is about 15.75 percent on multicore CPUs, compared to that of the other solutions. By considering the rows of a matrix in a unique order based on the probability mass function of the number of non-zeros in a row, the average performance improvement of our solution for SpMV is about 33.52 percent on GPUs and multicore CPUs of a heterogeneous computing platform, compared to that of the partitioning methods based on the original row order of a matrix.
Article
Full-text available
Decision tree classification (DTC) is a widely used technique in data mining algorithms known for its high accuracy in forecasting. As technology has progressed and available storage capacity in modern computers increased, the amount of data available to be processed has also increased substantially, resulting in much slower induction and classification times. Many parallel implementations of DTC algorithms have already addressed the issues of reliability and accuracy in the induction process. In the classification process, larger amounts of data require proportionately more execution time, thus hindering the performance of legacy systems. We have devised a pipelined architecture for the implementation of axis parallel binary DTC that dramatically improves the execution time of the algorithm while consuming minimal resources in terms of area. Scalability is achieved when connected to a high-speed communication unit capable of performing data transfers at a rate similar to that of the DTC engine. We propose a hardware accelerated solution composed of parallel processing nodes capable of independently processing data from a streaming source. Each engine processes the data in a pipelined fashion to use resources more efficiently and increase the achievable throughput. The results show that this system is 3.5 times faster than the existing hardware implementation of classification.
Conference Paper
Full-text available
Graph-processing platforms are increasingly used in a variety of domains. Although both industry and academia are developing and tuning graph-processing algorithms and platforms, the performance of graph-processing platforms has never been explored or compared in-depth. Thus, users face the daunting challenge of selecting an appropriate platform for their specific application. To alleviate this challenge, we propose an empirical method for benchmarking graph-processing platforms. We define a comprehensive process, and a selection of representative metrics, datasets, and algorithmic classes. We implement a benchmarking suite of five classes of algorithms and seven diverse graphs. Our suite reports on basic (user-lever) performance, resource utilization, scalability, and various overhead. We use our benchmarking suite to analyze and compare six platforms. We gain valuable insights for each platform and present the first comprehensive comparison of graph-processing platforms.
Article
Full-text available
In Network Intrusion Detection Systems (NIDSs), string pattern matching demands exceptionally high performance to match the content of network traffic against a predefined database (or dictionary) of malicious patterns. Much work has been done in this field; however, most of the prior work results in low memory efficiency (defined as the ratio of the amount of the required storage in bytes and the size of the dictionary in number of characters). Due to such inefficiency, state-of-the-art designs cannot support large dictionaries without using high-latency external DRAM. We propose an algorithm called "leaf-attaching" to preprocess a given dictionary without increasing the number of patterns. The resulting set of postprocessed patterns can be searched using any tree-search data structure. We also present a scalable, high-throughput, Memory-efficient Architecture for large-scale String Matching (MASM) based on a pipelined binary search tree. The proposed algorithm and architecture achieve a memory efficiency of 0.56 (for the Rogets dictionary) and 1.32 (for the Snort dictionary). As a result, our design scales well to support larger dictionaries. Implementations on 45 nm ASIC and a state-of-the-art FPGA device (for latest Rogets and Snort dictionaries) show that our architecture achieves 24 and 3.2 Gbps, respectively. The MASM module can simply be duplicated to accept multiple characters per cycle, leading to scalable throughput with respect to the number of characters processed in each cycle. Dictionary update involves simply rewriting the content of the memory, which can be done quickly without reconfiguring the chip.
Article
Full-text available
We describe the University of Florida Sparse Matrix Collection, a large and actively growing set of sparse matrices that arise in real applications. The Collection is widely used by the numerical linear algebra community for the development and performance evaluation of sparse matrix algorithms. It allows for robust and repeatable experiments: robust because performance results with artificially generated matrices can be misleading, and repeatable because matrices are curated and made publicly available in many formats. Its matrices cover a wide spectrum of domains, include those arising from problems with underlying 2D or 3D geometry (as structural engineering, computational fluid dynamics, model reduction, electromagnetics, semiconductor devices, thermodynamics, materials, acoustics, computer graphics/vision, robotics/kinematics, and other discretizations) and those that typically do not have such geometry (optimization, circuit simulation, economic and financial modeling, theoretical and quantum chemistry, chemical process simulation, mathematics and statistics, power networks, and other networks and graphs). We provide software for accessing and managing the Collection, from MATLAB#8482;, Mathematica#8482;, Fortran, and C, as well as an online search capability. Graph visualization of the matrices is provided, and a new multilevel coarsening scheme is proposed to facilitate this task.
Article
Full-text available
Community detection and analysis is an important methodology for understanding the organization of various real-world networks and has applications in problems as diverse as consensus formation in social communities or the identification of functional modules in biochemical networks. Currently used algorithms that identify the community structures in large-scale real-world networks require a priori information such as the number and sizes of communities or are computationally expensive. In this paper we investigate a simple label propagation algorithm that uses the network structure alone as its guide and requires neither optimization of a predefined objective function nor prior information about the communities. In our algorithm every node is initialized with a unique label and at every step each node adopts the label that most of its neighbors currently have. In this iterative process densely connected groups of nodes form a consensus on a unique label to form communities. We validate the algorithm by applying it to networks whose community structures are known. We also demonstrate that the algorithm takes an almost linear time and hence it is computationally less expensive than what was possible so far.
Article
Shared virtual memory is key in heterogeneous systems on chip (SoCs) that combine a general-purpose host processor with a many-core accelerator, both for programmability and performance. In contrast to the full-blown, hardware-only solutions predominant in modern high-end systems, lightweight hardware-software co-designs are better suited in the context of more power- and area-constrained embedded systems and provide additional benefits in terms of flexibility and predictability. As a downside, the latter solutions require the host to handle in software synchronization in case of page misses as well as miss handling. This may incur considerable run-time overheads. In this work, we present a novel hardware-software virtual memory management approach for many-core accelerators in heterogeneous embedded SoCs. It exploits an accelerator-side helper thread concept that enables the accelerator to manage its virtual memory hardware autonomously while operating cache-coherently on the page tables of the user-space processes of the host. This greatly reduces overhead with respect to host-side solutions while retaining flexibility. We have validated the design with a set of parameterizable benchmarks and real-world applications covering various application domains. For purely memory-bound kernels, the accelerator performance improves by a factor of 3.8 compared with host-based management and lies within 50% of a lower-bound ideal memory management unit.
Article
While demand for physically contiguous memory allocation is still alive, especially in embedded system, existing solutions are insufficient. The most adapted solution is reservation technique. Though it serves allocation well, it could severely degrade memory utilization. There are hardware solutions like Scatter/Gather DMA and IOMMU. However, cost of these additional hardware is too excessive for low-end devices. CMA is a software solution of Linux that aims to solve not only allocation but also memory utilization problem. However, in real environment, CMA could incur unpredictably slow latency and could often fail to allocate contiguous memory due to its complex design. We introduce a new solution for the above problem, GCMA (Guaranteed Contiguous Memory Allocator). It guarantees not only memory space efficiency but also fast latency and success by using reservation technique and letting only immediately discardable to use the area efficiently. Our evaluation on Raspberry Pi 2 shows 15 to 130 times faster and more predictable allocation latency without system performance degradation compared to CMA.
Article
Binary content-addressable memory (BiCAM) is a popular high speed search engine in hardware, which provides output typically in one clock cycle. But speed of CAM comes at the cost of various disadvantages such as high latency, low storage density, and low architectural scalability. In addition, field-programmable gate arrays (FPGAs), which are used in many applications because of its advantages, do not have hard IPs for CAM. Since FPGAs have embedded IPs for random-access memories (RAMs), several RAM-based CAM architectures on FPGAs are available in the literature. However, these architectures are especially targeted for Ternary CAMs, not for BiCAMs; thus, the available RAM-based CAMs may not be fully beneficial for BiCAMs in terms of architectural design. Since modern FPGAs are enriched with logical resources, why not to configure them to design BiCAM on FPGA? The paper presents a logic-based high performance BiCAM architecture (LH-CAM) using Xilinx FPGA. The proposed CAM is composed of CAM words and associated comparators. A sample of LH-CAM of size 64×36 is implemented on Xilinx Virtex-6 FPGA. Compared with the latest prior work, the proposed CAM is much simpler in architecture, storage efficient, reduces power consumption by 40.92%, and improves speed by 27.34%.
Article
While high-end heterogeneous systems are increasingly supporting heterogeneous uniform memory access (hUMA), their low-power counterparts still lack basic features like virtual memory support for accelerators. Instead of simply passing pointers, explicit data management involving copies is needed which hampers programmability and performance. In this work, we evaluate a mixed hardware/software solution for lightweight virtual memory support for many-core accelerators in heterogeneous embedded systems-on-chip. Based on an input/output translation lookaside buffer managed by a host kernel-level driver, and compiler extensions protecting the accelerator’s accesses to shared data, our solution is non-intrusive to the architecture of the accelerator cores, and enables zero-copy sharing of pointer-rich data structures.
Article
We present LINQits, a flexible hardware template that can be mapped onto programmable logic or ASICs in a heterogeneous system-on-chip for a mobile device or server. Unlike fixed-function accelerators, LINQits accelerates a domain-specific query language called LINQ. LINQits does not provide coverage for all possible applications---however, existing applications (re-)written with LINQ in mind benefit extensively from hardware acceleration. Furthermore, the LINQits framework offers a graceful and transparent migration path from software to hardware. LINQits is prototyped on a 2W heterogeneous SoC called the ZYNQ processor, which combines dual ARM A9 processors with an FPGA on a single die in 28nm silicon technology. Our physical measurements show that LINQits improves energy efficiency by 8.9 to 30.6 times and performance by 10.7 to 38.1 times compared to optimized, multithreaded C programs running on conventional ARM A9 processors.
Article
An increasing number of architectures provide virtual memory support through software-managed TLBs. However, software management can impose considerable penalties, which are highly dependent on the operating system's structure and its use of virtual memory. This work explores software-managed TLB design tradeoffs and their interaction with a range of operating systems including monolithic and microkernel designs. Through hardware monitoring and simulations, we explore TLB performance for benchmarks running on a MIPS R2000-based workstation running Ultrix, OSF/1, and three versions of mach 3.0. Results: New operating systems are changing the relative frequency of different types of TLB misses, some of which may not be efficiently handled by current architectures. For the same application binaries, total TLB service time varies by as much as an order of magnitude under different operating systems. Reducing the handling cost for kernel TLB misses reduces total TLB service time up to 40%. For TLBs between 32 and 128 slots, each doubling of the TLB size reduces total TLB service time up to 50%.
Conference Paper
Local memory is a key factor for the performance of accelerators in SoCs. Despite technology scaling, the gap between on-chip storage and memory footprint of embedded applications keeps widening. We present a solution to preserve the speedup of accelerators when scaling from small to large data sets. Combining specialized DMA and address translation with a software layer in Linux, our design is transparent to user applications and broadly applicable to any class of SoCs hosting high-throughput accelerators. We demonstrate the robustness of our design across many heterogeneous workload scenarios and memory allocation policies with FPGA-based SoC prototypes featuring twelve concurrent accelerators accessing up to 768MB out of 1GB-addressable DRAM.
Article
Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we designed and built a composable, reconfigurable hardware fabric based on field programmable gate arrays (FPGA). Each server in the fabric contains one FPGA, and all FPGAs within a 48-server rack are interconnected over a low-latency, high-bandwidth network. We describe a medium-scale deployment of this fabric on a bed of 1632 servers, and measure its effectiveness in accelerating the ranking component of the Bing web search engine. We describe the requirements and architecture of the system, detail the critical engineering challenges and solutions needed to make the system robust in the presence of failures, and measure the performance, power, and resilience of the system. Under high load, the large-scale reconfigurable fabric improves the ranking throughput of each server by 95% at a desirable latency distribution or reduces tail latency by 29% at a fixed throughput. In other words, the reconfigurable fabric enables the same throughput using only half the number of servers.
Conference Paper
Ultrasound imaging is one of the most important medical diagnostic methods. The bulkiness of state-of-the-art high-quality ultrasound devices, however, drastically limits their usability in important application scenarios. In this paper, we show how a portable medical ultrasound device can be built using many-core technology and programmable logic, combining low power consumption with high flexibility. We discuss a typical ultrasound image reconstruction algorithm and how it can be parallelized using a pipelined design that efficiently partitions the workload among heterogeneous processing elements. A special focus lies on the limited memory resources and data bandwidth between components. To tackle both problems, we use floating window buffers and approximate computations, and we minimize lookup table sizes using on-the-fly calculations. We evaluate the design on the Adapteva Parallella platform, which contains a power-efficient 16-core Epiphany coprocessor and a Zynq SoC including a dual-core ARM A9 processor and programmable logic. Experimental results show that parallel beamforming of 128 input channels to a 288x 128 pixel ultrasound image can be achieved on the Parallella at a rate of 5.3 frames per second consuming only 2 watt of dynamic power.
Conference Paper
While high-end heterogeneous systems are increasingly supporting heterogeneous uniform memory access (hUMA) as envisioned by the Heterogeneous System Architecture (HSA) foundation, their low-power counterparts targeting the embedded domain still lack basic features like virtual memory support for accelerators. As opposed to simply passing virtual address pointers, explicit data management involving copies is needed to share data between host processor and accelerators which hampers programmability and performance. In this work, we present a mixed hardware/software solution to enable lightweight virtual memory support for many-core accelerators in heterogeneous embedded systems-on-chip (SoCs). Based on an input/output translation lookaside buffer (IOTLB), efficiently managed by a kernel-level driver module running on the host , our solution features a considerably lower design complexity compared to conventional input/output memory management units. Using our evaluation platform based on the Xilinx Zynq-7000 SoC with a many-core accelerator implemented in the programmable logic, we demonstrate the effectiveness of our solution and the benefits of virtual memory support for embedded heterogeneous SoCs.
Article
Increasing power densities have led to the dark silicon era, for which heterogeneous multicores with different power and performance characteristics are promising architectures. This paper focuses on maximizing the overall system performance under a critical temperature constraint for heterogeneous tiled multicores, where all cores or accelerators inside a tile share the same voltage and frequency levels. For such architectures, we present a resource management technique that introduces power density as a novel system level constraint, in order to avoid thermal violations. The proposed technique then assigns applications to tiles by choosing their degree of parallelism and the voltage/frequency levels of each tile, such that the power density constraint is satisfied. Moreover, our technique provides runtime adaptation of the power density constraint according to the characteristics of the executed applications, and reacting to workload changes at runtime. Thus, the available thermal headroom is exploited to maximize the overall system performance.
Conference Paper
We present the hardware architecture and extensions of an Input-Output Memory Management Unit (IOMMU) utilized in heterogeneous SoCs that support full virtualization. The proposed IOMMU architecture offers unique innovative features supporting multiple concurrently active virtual machine instances (VMs) with zero-latency world-context switching and enabling address translation services for up to a thousand virtual domains while serving multiple devices. At the same the proposed design allows for serving multiple address translation requests in parallel and per domain Translation Look-aside Buffer (TLB) invalidation.
Article
Hybrid Computing is a term that has originally been used for computations performed on analog/digital hardware and was popular until the late 1970s. Complex computations done under realtime conditions, such as signal processing, were left in the analog domain as the conversion times of A/D converters, sampling rates, and clock speeds of processors were significantly too low to solve complex equations in reasonable times. Today, processor layouts still contain analog components in the I/O areas as amplifiers, sensors or A/D converters. In the area of logical or arithmetical computations they became absolutely irrelevant. The renaissance that the term Hybrid Computing has experienced in recent years comes from a combination of hardwired multicore microprocessors and configurable integrated circuits (FPGAs). This chapter focusses on Hybrid Computing and Hybrid Core Computing which is a special form of Hybrid Computing introduced by Convey Computer Corporation. © 2013 Springer Science+Business Media, LLC. All rights are reserved.
Article
Coupling processors with acceleration hardware is an effective manner to improve energy efficiency of embedded systems. Many-core is nowadays a dominating design paradigm for SoCs, which opens new challenges and opportunities for designing HW blocks. Exploring acceleration solutions that naturally fit into well-established parallel programming models and that can be incrementally added on top of existing parallel applications is thus extremely important. In this paper we focus on tightly-coupled multi-core cluster architectures, representative of the basic building block of the most recent many-cores, and we enhance it with dedicated HW processing units (HWPU). We propose an architecture where the HWPUs share the same L1 data memory through which processors also communicate, implementing a zero-copy communication model. High-level synthesis (HLS) tools are used to generate HW blocks, then a custom wrapper interfaces the latter to the tightly coupled cluster. We validate our proposal on RTL models, running both synthetic workload and real applications. Experimental results demonstrate that on average our solution provides nearly identical performance to traditional private-memory coarse-grained accelerators, but it achieves up to 32 percent better performance/area/watt and it requires only minimal modifications to legacy parallel codes.
We present a method for accelerating server applications using a hybrid CPU+FPGA architecture and demonstrate its advantages by accelerating Memcached, a distributed key-value system. The accelerator, implemented on the FPGA fabric, processes request packets directly from the network, avoiding the CPU in most cases. The accelerator is created by profiling the application to determine the most commonly executed trace of basic blocks which are then extracted. Traces are executed speculatively within the FPGA. If the control flow exits the trace prematurely, the side effects of the computation are rolled back and the request packet is passed to the CPU. When compared to the best reported software numbers, the Memcached accelerator is 9.15× more energy efficient for common case requests.
Article
Heterogeneous computing systems combine different types of compute elements that share memory. A specific class of heterogeneous systems discussed in this paper pairs traditional general-purpose processing cores and accelerator units. While this arrangement enables significant gains in application performance, device driver overheads and operating system code path overheads can become prohibitive. The I/O interface of a processor chip is a well-suited attachment point from a system design perspective, in that standard server models can be augmented with application-specific accelerators. However, traditional I/O attachment protocols introduce significant device driver and operating system software latencies. With the Coherent Accelerator Processor Interface (CAPI), we enable attaching an accelerator as a coherent CPU peer over the I/O physical interface. The CPU peer features consist of a homogeneous virtual address space across the CPU and accelerator, and hardware-managed caching of this shared data on the I/O device. This attachment method greatly increases the opportunities for acceleration due to the much shorter software path length required to enable its use compared to a traditional I/O model.
Article
The reconfigurable mesh is a parallel model of computation, which exploits a massive amount of rather simple processing elements connected through a reconfigurable interconnection network. During the last decades, the model received strong interest and many researchers have devised algorithms for it. However, most of this work focuses on theoretical aspects. Due to some idealistic modeling assumptions only a few attempts have been made to implement the model and to study the practical use of reconfigurable meshes. In this paper, we leverage the reconfigurable mesh model to study potential architectures and programming models for future many-cores. We design a reconfigurable mesh in form of a scalable soft core array with a reconfigurable interconnect and implement it on FPGA technology in order to create a prototype platform. We present an overall hardware/software tool flow for generating and programming reconfigurable mesh prototypes. The new language ARMLang and a corresponding compiler facilitate the programming of the massively parallel processor arrays. To our knowledge, this work is the first practical study of word-level reconfigurable meshes. To analyze the performance of our implementation we study four algorithmic kernels from the application domains arithmetic, sorting, graph algorithms and imaging. For each kernel, we devise a reconfigurable mesh program in ARMLang, compile it to our soft core array and measure its runtime depending on the mesh size. Then, we compare the runtimes to two sequential implementations of the algorithms, which are executed on two single core systems. The results show that many-cores leveraging the reconfigurable mesh model can efficiently use a vast number of processing elements and that, for the chosen algorithms, they come close to optimally parallelized programs.
Conference Paper
Heterogeneous computing utilizing both CPU and FPGA requires access to data in the main memory from both devices. While a typical system relies on software executing on the CPU to orchestrate all data movements between the FPGA and the main memory, our demo presents a complementary FPGA-centric approach that allows gateware to directly access the virtual memory space as part of the executing process without involving the CPU. A caching address translation buffer was implemented alongside the user FPGA gateware to provide runtime mapping between virtual and physical memory addresses. The system was implemented on a commercial off-the-shelf FPGA add-on card to demonstrate the viability of such approach in low-cost systems. Experiment demonstrated reasonable performance improvement when compared to a typical software-centric implementation; while the number of context switches between FPGA and CPU in both kernel and user mode was significantly reduced, freeing the CPU for other concurrent user tasks.
Conference Paper
Sparse matrix-vector multiplication (SMVM) is a crucial primitive used in a variety of scientific and commercial applications. Despite having significant parallelism, SMVM is a challenging kernel to optimize due to its irregular memory access characteristics. Numerous studies have proposed the use of FPGAs to accelerate SMVM implementations. However, most prior approaches focus on parallelizing multiply-accumulate operations within a single row of the matrix (which limits parallelism if rows are small) and/or make inefficient uses of the memory system when fetching matrix and vector elements. In this paper, we introduce an FPGA-optimized SMVM architecture and a novel sparse matrix encoding that explicitly exposes parallelism across rows, while keeping the hardware complexity and on-chip memory usage low. This system compares favorably with prior FPGA SMVM implementations. For the over 700 University of Florida sparse matrices we evaluated, it also performs within about two thirds of CPU SMVM performance on average, even though it has 2.4× lower DRAM memory bandwidth, and within almost one third of GPU SVMV performance on average, even at 9x lower memory bandwidth. Additionally, it consumes only 25W, for power efficiencies 2.6x and 2.3x higher than CPU and GPU, respectively, based on maximum device power.
Conference Paper
We present LINQits, a flexible hardware template that can be mapped onto programmable logic or ASICs in a heterogeneous system-on-chip for a mobile device or server. Unlike fixed-function accelerators, LINQits accelerates a domain-specific query language called LINQ. LINQits does not provide coverage for all possible applications---however, existing applications (re-)written with LINQ in mind benefit extensively from hardware acceleration. Furthermore, the LINQits framework offers a graceful and transparent migration path from software to hardware. LINQits is prototyped on a 2W heterogeneous SoC called the ZYNQ processor, which combines dual ARM A9 processors with an FPGA on a single die in 28nm silicon technology. Our physical measurements show that LINQits improves energy efficiency by 8.9 to 30.6 times and performance by 10.7 to 38.1 times compared to optimized, multithreaded C programs running on conventional ARM A9 processors.
Conference Paper
Cooperation of CPU and hardware accelerator to accomplish computational intensive tasks, provides significant advantages in run-time speed and energy. Efficient management of data sharing among multiple computational kernels can rapidly turn into a complicated problem. The Accelerator coherency port (ACP) emerges as a possible solution by enabling hardware accelerators to issue coherent accesses to the memory space. In this paper, we quantify the advantages of using ACP over the traditional method of sharing data on the DRAM. We select the Xilinx ZYNQ as target and develop an infrastructure to stress the ACP and high-performance (HP) AXI interfaces of the ZYNQ device. Hardware accelerators on both of HP and ACP AXI interfaces reach full duplex data processing bandwidth of over 1.6 GBytes/s running at 125 MHz on a XC7Z020-1C device. The effect of background DRAM and cache traffic on the performance of accelerators is analyzed. For a sample image filtering task, the cooperative operation of CPU and ACP accelerator (CPU-ACP) gains a speed-up of 1.2X over CPU and HP acceleration (CPU-HP). In terms of energy efficiency, an improvement of 2.5 nJ (> 20%) is shown for each byte of processed data. This is the first work which represents detailed practical comparisons on the speed and energy efficiency of various processor-accelerator memory sharing techniques in a configurable heterogeneous platform.
Conference Paper
We developed a custom FPGA-based Network Interface Controller named APEnet+ aimed at GPU accelerated clusters for High Performance Computing. The card exploits peer-to-peer capabilities (GPU-Direct RDMA) for latest NVIDIA GPGPU devices and the RDMA paradigm to perform fast direct communication between computing nodes, offloading the host CPU from network tasks execution. In this work we focus on the implementation of a Virtual to Physical address translation mechanism, using the FPGA embedded soft-processor. Address management is the most demanding task — we estimated up to 70% of the μC load — for the NIC receiving side, resulting being the main culprit for data bottleneck. To improve the performance of this task and hence improve data transfer over the network, we added a specialized hardware logic block acting as a Translation Lookaside Buffer. This block makes use of a peculiar Content Address Memory implementation designed for scalability and speed. We present detailed measurements to demonstrate the benefits coming from the introduction of such custom logic: a substantial address translation latency reduction (from a measured value of 1.9 μs to 124 ns) and a performance enhancement of both host-bound and GPU-bound data transfers (up to ∼ 60% of bandwidth increase) in given message size ranges.
Conference Paper
Hough forests have emerged as a powerful and versatile method, which achieves state-of-the-art results on various computer vision applications, ranging from object detection over pose estimation to action recognition. The original method operates in offline mode, assuming to have access to the entire training set at once. This limits its applicability in domains where data arrives sequentially or when large amounts of data have to be exploited. In these cases, on-line approaches naturally would be beneficial. To this end, we propose an on-line extension of Hough forests, which is based on the principle of letting the trees evolve on-line while the data arrives sequentially, for both classification and regression. We further propose a modified version of off-line Hough forests, which only needs a small subset of the training data for optimization. In the experiments, we show that using these formulations, the classification results of classic Hough forests could be reached or even outperformed, while being orders of magnitudes faster. Furthermore, our method allows for tracking arbitrary objects without requiring any prior knowledge. We present state-of-the-art tracking results on publicly available data sets. © 2011. The
We present a method for the detection of instances of an object class, such as cars or pedestrians, in natural images. Similarly to some previous works, this is accomplished via generalized Hough transform, where the detections of individual object parts cast probabilistic votes for possible locations of the centroid of the whole object; the detection hypotheses then correspond to the maxima of the Hough image that accumulates the votes from all parts. However, whereas the previous methods detect object parts using generative codebooks of part appearances, we take a more discriminative approach to object part detection. Towards this end, we train a class-specific Hough forest, which is a random forest that directly maps the image patch appearance to the probabilistic vote about the possible location of the object centroid. We demonstrate that Hough forests improve the results of the Hough-transform object detection significantly and achieve state-of-the-art performance for several classes and datasets.
Conference Paper
With the introduction of multithreaded programming for reconfigurable hardware, it is possible to map both sequential software and parallel hardware to a single CPU/FPGA platform using threads as a unifying development model. At the same time, platform FPGAs are a natural technology for implementing computationally intensive systems in the aerospace, automotive and industrial domains, as they combine high performance and flexibility with lower non-recurring engineering (NRE) costs when compared to low-volume ASIC solutions. The reusability and portability of hardware components in these safety-critical domains could be significantly improved by using multithreaded programming. However, the unique design considerations for memory virtualization, as required in safety-critical systems, are difficult to transfer directly from software to autonomous hardware threads. This paper presents a transparent and efficient way of augmenting current multithreaded and partially reconfigurable hardware runtime environments with dedicated, hardware-thread-aware memory address translation units to provide seamless memory translation for hardware threads. We show an analysis of the overheads, as well as an experimental evaluation of the latencies caused by address translation.
Article
In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/ To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date. Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical largescale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. Keywords World Wide Web, Search Engines, Information Retrieval, PageRank, Google 1.
Handling large data sets for high-performance embedded applications in heterogeneous systems-on-chip
  • P Mantovani
Observations and opportunities in architecting shared virtual memory for heterogeneous systems
  • J Vesely
Utilizing the IOMMU scalably
  • O Peleg
Supporting address translation for accelerator-centric architectures
  • J Cong