P.Y.K. Cheung

Imperial College London, Londinium, England, United Kingdom

Are you P.Y.K. Cheung?

Claim your profile

Publications (322)52.37 Total impact

  • ACM SIGARCH Computer Architecture News 06/2014; 41(5):35-40.
  • James J. Davis, Peter Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: As process scaling and transistor count inflation continue, silicon chips are becoming increasingly susceptible to faults. Although FPGAs are particularly vulnerable to these effects, their runtime reconfigurability offers unique opportunities for fault tolerance. This work presents an application combining algorithmic-level error detection with dynamic partial reconfiguration (DPR) to allow faults manifested within its datapath at runtime to be circumvented at low cost.
    2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM); 05/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: The Sequential Monte Carlo (SMC) method is a simulation-based approach to compute posterior distributions. SMC methods often work well on applications considered intractable by other methods due to high dimensionality, but they are computationally demanding. While SMC has been implemented efficiently on FPGAs, design productivity remains a challenge. This paper introduces a design flow for generating efficient implementation of reconfigurable SMC designs. Through templating the SMC structure, the design flow enables efficient mapping of SMC applications to multiple FPGAs. The proposed design flow consists of a parametrisable SMC computation engine, and an open-source software template which enables efficient mapping of a variety of SMC designs to reconfigurable hardware. Design parameters that are critical to the performance and to the solution quality are tuned using a machine learning algorithm based on surrogate modelling. Experimental results for three case studies show that design performance is substantially improved after parameter optimisation. The proposed design flow demonstrates its capability of producing reconfigurable implementations for a range of SMC applications that have significant improvement in speed and in energy efficiency over optimised CPU and GPU implementations.
    2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM); 05/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: The operation of FPGA systems, like most VLSI technology, is traditionally governed by static timing analysis, whereby safety margins for operating and manufacturing uncertainty are factored in at design-time. If we operate FPGA designs beyond these conservative margins we can obtain substantial energy and performance improvements. However, doing this carelessly would cause unacceptable impacts to reliability, lifespan and yield - issues which are growing more severe with continuing process scaling. Fortunately, the flexibility of FPGA architecture allows us to monitor and control reliability problems with a variety of runtime instrumentation and adaptation techniques. In this paper we develop a system for detecting timing faults in arbitrary FPGA circuits based on Razor-like shadow register insertion. Through a combination of calibration, timing constraint and adaptation of the CAD flow, we deliver low-overhead, trustworthy fault detection for FPGA-based circuits.
    2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM); 05/2014
  • Joshua M. Levine, Edward Stott, Peter Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: Timing margins in FPGAs are already significant and as process scaling continues they will have to grow to guarantee operation under increased variation. Margins enforce worst-case operation even in typical conditions and result in devices operating more slowly and consuming more energy than necessary. This paper presents a method of dynamic voltage and frequency scaling that uses online slack measurement to determine timing headroom in a circuit while it is operating and scale the voltage and/or frequency in response. Doing so can significantly reduce power consumption or increase throughput with a minimal overhead. The method is demonstrated on a number of benchmark circuits under a range of operating conditions, constraints and optimisation targets.
    Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays; 02/2014
  • Jianxiong Liu, Christos Bouganis, Peter Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: As the resolution of digital images increases, accessing raw image data from memory has become a major consideration during the design of image/video processing systems. This is due to the fact that the bandwidth requirement and energy consumption of such image accessing process has increased. Inspired by the successful application of progressive image sampling techniques in many image processing tasks, this work proposes to apply similar concept within hardware systems to efficiently trade image quality for reduced memory bandwidth requirement and lower energy consumption. Based on this idea, a hardware system is proposed that is placed between the memory subsystem and the processing core of the design. The proposed system alters the conventional memory access pattern to progressively and adaptively access pixels from a target memory external to the system. The sampled pixels are used to reconstruct an approximation to the ground truth, which is stored in an internal image buffer for further processing. The system is prototyped on FPGA and its performance evaluation shows that a saving of up to 85% of memory accessing time and 33%/45% of image acquisition time/energy is achieved on the benchmark image “lena” while maintaining a PSNR of about 30 dB.
    Design Automation and Test in Europe; 01/2014
  • J.S.J. Wong, P.Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: The key aspects of a good on-chip timing measurement platform are high measurement resolution, accuracy, and low area overhead. A measurement method based on transition probability (TP) has shown promising characteristics in all these areas. In this paper, the TP measurement method is examined through simulation to understand its apparent effectiveness and accuracy in measuring complex circuits. Timing uncertainties and logic glitch activities are considered in detail, and the effect of varying input vectors' probability distributions is analyzed to enable further accuracy improvements. Using a field-programmable gate array, the method is implemented and demonstrated as a modular on-chip test platform for testing complex arbitrary circuits. Practical circuits found in typical modular designs, including fixed/floating-point arithmetic and filter circuits, are chosen to evaluate the test platform. The resolution of the timing measurements ranges from 0.3 to 8.0 ps, and the measurement errors against reference measurements are found to be within 3.6%. The test platform can be applied to VLSI designs with minor area overhead, and provides designers with precise and accurate physical timing information of circuits.
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems 12/2013; 21(12):2307-2320. · 1.14 Impact Factor
  • Design & Test, IEEE. 12/2013; 30(6):50-59.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Aggressive transistor scaling will soon lead us to the physical upper-bound of process technology, where stochastic process variability dominates the timing performance of FPGA components. In this paper, a variation-aware partial-rerouting method is proposed to mitigate and take advantage of the effect of delay variability due to process variation. The variation in logic delay across each FPGA (variation map) is measured on commercial FPGAs and is used to assess the effectiveness and potential gain of the proposed method on current FPGA architectures. Our partial-rerouting method achieved 5.25% improvement in critical path delay under a delay variability of σ/μ = 0.3, and is considerably less time consuming than using variation-aware full chipwise routing, which gave a slightly better timing gain of 6.41% but requires 8x more execution time when optimising for 100 target FPGAs with unique variation maps.
    2013 International Conference on Field-Programmable Technology (FPT); 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Proximity Query (PQ) is a process to calculate the relative placement of objects. It is a critical task for many applications such as robot motion planning, but it is often too computationally demanding for real-time applications, particularly those involving human-robot collaborative control. This paper derives a PQ formulation which can support non-convex objects represented by meshes or cloud points. We optimise the proposed PQ for reconfigurable hardware by function transformation and reduced precision, resulting in a novel data structure and memory architecture for data streaming while maintaining the accuracy of results. Run-time reconfiguration is adopted for dynamic precision optimisation. Experimental results show that our optimised PQ implementation on a reconfigurable platform with four FPGAs is 58 times faster than an optimised CPU implementation with 12 cores, 9 times faster than a GPU, and 3 times faster than a double precision implementation with four FPGAs.
    2013 International Conference on Field-Programmable Technology (FPT); 12/2013
  • Jianxiong Liu, Christos Bouganis, Peter Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: Domain specific knowledge is useful in image processing applications where the target image to process is known to be of a particular image class. It is commonly used as prior knowledge, to model structured image classes such as human faces in order to break limitations posed by various problems. This paper proposes to use domain specific codebook and corresponding sampling patterns learned from example faces, to build a progressive image sampling algorithm specifically for face processing applications. Instead of accessing the whole target face image, the proposed system is able to progressively sample from it and make approximation of it during the process, allowing the process to stop when image quality is considered to have met the requirement. The proposed system is able to identify significant information from the target image and retrieve it at early stage of the sampling, without requiring the target image to be pre-processed as conventional PIT methods do. Therefore it is applicable to situations where such pre-processing is not possible. The experiment shows that the proposed method is able to efficiently sample and reconstruct face images to achieve significant improvement of PSNR over state-of-art method.
    2013 IEEE Global Conference on Signal and Information Processing (GlobalSIP); 12/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a design space exploration framework for an FPGA-based soft processor that is built on the estimation of power and performance metrics using algorithm and architecture parameters. The proposed framework is based on regression trees, a popular machine learning technique, that can capture the relationship of low-level soft-processor parameters and high-level algorithm parameters of a specific application domain, such as image compression. In doing this, power and execution time of an algorithm can be predicted before implementation and on unseen configurations of soft processors. For system designers this can result in fast design space exploration at an early stage in design.
    Journal of Systems Architecture 11/2013; 59(10):1144–1156. · 0.69 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a heterogeneous reconfigurable system for real-time applications applying particle filters. The system consists of an FPGA and a multi-threaded CPU. We propose a method to adapt the number of particles dynamically and utilise the run-time reconfigurability of the FPGA for reduced power and energy consumption. An application is developed which involves simultaneous mobile robot localisation and people tracking. It shows that the proposed adaptive particle filter can reduce up to 99% of computation time. Using run-time reconfiguration, we achieve 34% reduction in idle power and save 26-34% of system energy. Our proposed system is up to 7.39 times faster and 3.65 times more energy efficient than the Intel Xeon X5650 CPU with 12 threads, and 1.3 times faster and 2.13 times more energy efficient than an NVIDIA Tesla C2070 GPU.
    Proceedings of the 9th international conference on Reconfigurable Computing: architectures, tools, and applications; 03/2013
  • James J. Davis, Peter Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: While we reap the benefits of process scaling in terms of transistor density and switching speed, consideration must be given to the negative effects it causes: increased variation, degradation and fault susceptibility. Above device level, such phenomena and the faults they induce can lead to reduced yield, decreased system reliability and, in extreme cases, total failure after a period of successful operation. Although error detection and correction are almost always considered for highly sensitive and susceptible applications such as those in space, for other, more general-purpose applications they are often overlooked. In this paper, we present a parallel matrix multiplication accelerator running in hardware on the Xilinx Zynq system-on-chip platform, along with ‘bolt-on’ logic for detecting, locating and avoiding faults within its datapath. Designs of various sizes are compared with respect to resource overhead and performance impact. Our largest-implemented fault-tolerant accelerator was found to consume 17.3% more area, run at a 3.95% lower frequency and incur an 18.8% execution time penalty over its equivalent fault-susceptible design during fault-free operation.
    Field-Programmable Technology (FPT), 2013 International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this article we present a variation-aware post placement and routing (P&R) retiming method to counteract process variation in FPGAs. Variation-aware retiming takes into account exact variation maps (measured on FPGAs) as opposed to statistical static timing analysis (SSTA) which models process variation with statistical distributions. Experiments are conducted using variation maps measured from 100 Cyclone III FPGAs, and the retiming algorithm is applied using MATLAB. We have shown that for circuits with several retiming choices of equivalent logic depth, up to 30% delay improvement can be achieved for a given variation coefficient of σ/μ = 0.3.
    Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Shadow registers, driven by a variable-phase clock, can be used to extract useful timing information from a circuit during operation. This paper presents Slack Measurement Insertion (SMI), an automated tool flow for inserting shadow registers into an FPGA design to enable measurement of timing slack. The flow provides a parameterised level of circuit coverage and results in minimal timing and area overheads. We demonstrate the process through its application to three complex benchmark designs.
    Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on; 01/2013
  • A. Powell, C. Bouganis, P.Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a power and execution time estimation framework for an FPGA-based soft processor when considering the implementation of image compression techniques. Using the proposed framework, a quick power consumption and execution time estimate can be obtained early in the design phase allowing system designers to estimate these performance metrics without the need of implementing the algorithm or generating all possible soft processor architectures. This estimate is performed using both high-level algorithm parameters and soft processor architecture parameters. For system designers this can result in fast design space exploration. The model can predict the execution time of an algorithm with an average of 139% less relative error than predictions using only architecture parameters with the same framework.
    Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Technology scaling causes increasing and unavoidable delay variability in FPGAs. This paper proposes a 2-stage variation-aware placement method that benefits from the optimality of a full-chipwise (chip-by-chip) placement but only requires a fraction of total execution time for a large number of FPGAs with different variation patterns. By classifying variation maps into finite number of classes, variation-aware placement only need to be executed based on the median map of each class to produce the placement for the other FPGAs (variation maps) in that class to save execution time. Our proposed method is implemented in a modified version of VPR 5.0 and verified using variation maps measured from 129 DE0 boards equipped with Cyclone III FPGAs. The mean timing gain of 7.36% is observed in 20 MCNC benchmarks with 16 clusters, while reducing execution time by a factor of 8 compared to full-chipwise placement.
    Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an adaptive Sequential Monte Carlo approach for real-time applications. Sequential Monte Carlo method is employed to estimate the states of dynamic systems using weighted particles. The proposed approach reduces the run-time computation complexity by adapting the size of the particle set. Multiple processing elements on FPGAs are dynamically allocated for improved energy efficiency without violating real-time constraints. A robot localisation application is developed based on the proposed approach. Compared to a non-adaptive implementation, the dynamic energy consumption is reduced by up to 70% without affecting the quality of solutions.
    Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Reliability, power consumption and timing performance are key considerations for the utilisation of field-programmable gate arrays. Online measurement techniques can determine the timing characteristics of an FPGA application while it is operating, and facilitate a range of benefits. Degradation can be monitored by tracking changes in timing performance, while power consumption can be reduced through dynamic voltage scaling (DVS) of the power supply to exploit any spare timing headroom. If higher performance is the objective, dynamic frequency scaling (DFS) can be used to maximise operating frequency. In both cases, online timing measurement of the application circuit is used to exploit favourable operating conditions. This work demonstrates a method of online measurement, achieved by sweeping the phase of a secondary clock signal, driving additional shadowing registers strategically added to the application design. The measurement technique and initial voltage and frequency scaling experiments are demonstrated on an Alter a Cyclone III FPGA. Timing performance can be measured with a best case resolution of 96ps. The additional circuitry results in minimal overhead in terms of area and performance. Power savings of 23% dynamic and 13% static in an example circuit are achieved through DVS, or performance improvements of 21% through DFS, when compared with operating at nominal core voltage, or timing model FMax.
    Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th Annual International Symposium on; 01/2012

Publication Stats

3k Citations
52.37 Total Impact Points

Institutions

  • 1970–2014
    • Imperial College London
      • • Department of Electrical and Electronic Engineering
      • • Department of Computing
      Londinium, England, United Kingdom
  • 2011
    • Newcastle University
      Newcastle-on-Tyne, England, United Kingdom
  • 2009
    • Trinity College Dublin
      Dublin, Leinster, Ireland
  • 1997–2007
    • Imperial Valley College
      Imperial, California, United States
  • 2003–2005
    • Mahanakorn University of Technology
      Krung Thep, Bangkok, Thailand
  • 2000
    • Xilinx Inc.
      San Jose, California, United States
  • 1992–1996
    • University of London
      Londinium, England, United Kingdom
  • 1994
    • Higher Institute for Applied Science and Technology
      Dimashq, Damascus City, Syria