P.Y.K. Cheung

Imperial College London, Londinium, England, United Kingdom

Are you P.Y.K. Cheung?

Claim your profile

Publications (311)46.01 Total impact

  • Joshua M. Levine, Edward Stott, Peter Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: Timing margins in FPGAs are already significant and as process scaling continues they will have to grow to guarantee operation under increased variation. Margins enforce worst-case operation even in typical conditions and result in devices operating more slowly and consuming more energy than necessary. This paper presents a method of dynamic voltage and frequency scaling that uses online slack measurement to determine timing headroom in a circuit while it is operating and scale the voltage and/or frequency in response. Doing so can significantly reduce power consumption or increase throughput with a minimal overhead. The method is demonstrated on a number of benchmark circuits under a range of operating conditions, constraints and optimisation targets.
    Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays; 02/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a heterogeneous reconfigurable system for real-time applications applying particle filters. The system consists of an FPGA and a multi-threaded CPU. We propose a method to adapt the number of particles dynamically and utilise the run-time reconfigurability of the FPGA for reduced power and energy consumption. An application is developed which involves simultaneous mobile robot localisation and people tracking. It shows that the proposed adaptive particle filter can reduce up to 99% of computation time. Using run-time reconfiguration, we achieve 34% reduction in idle power and save 26-34% of system energy. Our proposed system is up to 7.39 times faster and 3.65 times more energy efficient than the Intel Xeon X5650 CPU with 12 threads, and 1.3 times faster and 2.13 times more energy efficient than an NVIDIA Tesla C2070 GPU.
    Proceedings of the 9th international conference on Reconfigurable Computing: architectures, tools, and applications; 03/2013
  • James J. Davis, Peter Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: While we reap the benefits of process scaling in terms of transistor density and switching speed, consideration must be given to the negative effects it causes: increased variation, degradation and fault susceptibility. Above device level, such phenomena and the faults they induce can lead to reduced yield, decreased system reliability and, in extreme cases, total failure after a period of successful operation. Although error detection and correction are almost always considered for highly sensitive and susceptible applications such as those in space, for other, more general-purpose applications they are often overlooked. In this paper, we present a parallel matrix multiplication accelerator running in hardware on the Xilinx Zynq system-on-chip platform, along with ‘bolt-on’ logic for detecting, locating and avoiding faults within its datapath. Designs of various sizes are compared with respect to resource overhead and performance impact. Our largest-implemented fault-tolerant accelerator was found to consume 17.3% more area, run at a 3.95% lower frequency and incur an 18.8% execution time penalty over its equivalent fault-susceptible design during fault-free operation.
    Field-Programmable Technology (FPT), 2013 International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a design space exploration framework for an FPGA-based soft processor that is built on the estimation of power and performance metrics using algorithm and architecture parameters. The proposed framework is based on regression trees, a popular machine learning technique, that can capture the relationship of low-level soft-processor parameters and high-level algorithm parameters of a specific application domain, such as image compression. In doing this, power and execution time of an algorithm can be predicted before implementation and on unseen configurations of soft processors. For system designers this can result in fast design space exploration at an early stage in design.
    Journal of Systems Architecture. 01/2013; 59(10):1144–1156.
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this article we present a variation-aware post placement and routing (P&R) retiming method to counteract process variation in FPGAs. Variation-aware retiming takes into account exact variation maps (measured on FPGAs) as opposed to statistical static timing analysis (SSTA) which models process variation with statistical distributions. Experiments are conducted using variation maps measured from 100 Cyclone III FPGAs, and the retiming algorithm is applied using MATLAB. We have shown that for circuits with several retiming choices of equivalent logic depth, up to 30% delay improvement can be achieved for a given variation coefficient of σ/μ = 0.3.
    Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Shadow registers, driven by a variable-phase clock, can be used to extract useful timing information from a circuit during operation. This paper presents Slack Measurement Insertion (SMI), an automated tool flow for inserting shadow registers into an FPGA design to enable measurement of timing slack. The flow provides a parameterised level of circuit coverage and results in minimal timing and area overheads. We demonstrate the process through its application to three complex benchmark designs.
    Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on; 01/2013
  • J.S.J. Wong, P.Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: The key aspects of a good on-chip timing measurement platform are high measurement resolution, accuracy, and low area overhead. A measurement method based on transition probability (TP) has shown promising characteristics in all these areas. In this paper, the TP measurement method is examined through simulation to understand its apparent effectiveness and accuracy in measuring complex circuits. Timing uncertainties and logic glitch activities are considered in detail, and the effect of varying input vectors' probability distributions is analyzed to enable further accuracy improvements. Using a field-programmable gate array, the method is implemented and demonstrated as a modular on-chip test platform for testing complex arbitrary circuits. Practical circuits found in typical modular designs, including fixed/floating-point arithmetic and filter circuits, are chosen to evaluate the test platform. The resolution of the timing measurements ranges from 0.3 to 8.0 ps, and the measurement errors against reference measurements are found to be within 3.6%. The test platform can be applied to VLSI designs with minor area overhead, and provides designers with precise and accurate physical timing information of circuits.
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems 01/2013; 21(12):2307-2320. · 1.22 Impact Factor
  • Design & Test, IEEE. 01/2013; 30(6):50-59.
  • A. Powell, C. Bouganis, P.Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a power and execution time estimation framework for an FPGA-based soft processor when considering the implementation of image compression techniques. Using the proposed framework, a quick power consumption and execution time estimate can be obtained early in the design phase allowing system designers to estimate these performance metrics without the need of implementing the algorithm or generating all possible soft processor architectures. This estimate is performed using both high-level algorithm parameters and soft processor architecture parameters. For system designers this can result in fast design space exploration. The model can predict the execution time of an algorithm with an average of 139% less relative error than predictions using only architecture parameters with the same framework.
    Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Technology scaling causes increasing and unavoidable delay variability in FPGAs. This paper proposes a 2-stage variation-aware placement method that benefits from the optimality of a full-chipwise (chip-by-chip) placement but only requires a fraction of total execution time for a large number of FPGAs with different variation patterns. By classifying variation maps into finite number of classes, variation-aware placement only need to be executed based on the median map of each class to produce the placement for the other FPGAs (variation maps) in that class to save execution time. Our proposed method is implemented in a modified version of VPR 5.0 and verified using variation maps measured from 129 DE0 boards equipped with Cyclone III FPGAs. The mean timing gain of 7.36% is observed in 20 MCNC benchmarks with 16 clusters, while reducing execution time by a factor of 8 compared to full-chipwise placement.
    Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an adaptive Sequential Monte Carlo approach for real-time applications. Sequential Monte Carlo method is employed to estimate the states of dynamic systems using weighted particles. The proposed approach reduces the run-time computation complexity by adapting the size of the particle set. Multiple processing elements on FPGAs are dynamically allocated for improved energy efficiency without violating real-time constraints. A robot localisation application is developed based on the proposed approach. Compared to a non-adaptive implementation, the dynamic energy consumption is reduced by up to 70% without affecting the quality of solutions.
    Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Reliability, power consumption and timing performance are key considerations for the utilisation of field-programmable gate arrays. Online measurement techniques can determine the timing characteristics of an FPGA application while it is operating, and facilitate a range of benefits. Degradation can be monitored by tracking changes in timing performance, while power consumption can be reduced through dynamic voltage scaling (DVS) of the power supply to exploit any spare timing headroom. If higher performance is the objective, dynamic frequency scaling (DFS) can be used to maximise operating frequency. In both cases, online timing measurement of the application circuit is used to exploit favourable operating conditions. This work demonstrates a method of online measurement, achieved by sweeping the phase of a secondary clock signal, driving additional shadowing registers strategically added to the application design. The measurement technique and initial voltage and frequency scaling experiments are demonstrated on an Alter a Cyclone III FPGA. Timing performance can be measured with a best case resolution of 96ps. The additional circuitry results in minimal overhead in terms of area and performance. Power savings of 23% dynamic and 13% static in an example circuit are achieved through DVS, or performance improvements of 21% through DFS, when compared with operating at nominal core voltage, or timing model FMax.
    Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th Annual International Symposium on; 01/2012
  • Source
    T. Mak, P.Y.K. Cheung, Kai-Pui Lam, W. Luk
    [Show abstract] [Hide abstract]
    ABSTRACT: Dynamic routing is desirable because of its substantial improvement in communication bandwidth and intelligent adaptation to faulty links and congested traffic. However, implementation of adaptive routing in a network-on-chip system is not trivial and is further complicated by the requirements of deadlock-free and real-time optimal decision making. In this paper, we present a deadlock-free routing architecture which employs a dynamic programming (DP) network to provide on-the-fly optimal path planning and network monitoring for packet switching. Also, a new routing strategy called k -step look ahead is introduced. This new strategy can substantially reduce the size of routing table and maintain a high quality of adaptation which leads to a scalable dynamic-routing solution with minimal hardware overhead. Our results, based on a cycle-accurate simulator, demonstrate the effectiveness of the DP network, which outperforms both the deterministic and adaptive-routing algorithms in average delay on various traffic scenarios by 22.3%. Moreover, the hardware overhead for DP network is insignificant, based on the results obtained from the hardware implementations.
    IEEE Transactions on Industrial Electronics 09/2011; · 6.50 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Restoration methods, such as super-resolution (SR), largely depend on the accuracy of the point spread function (PSF). PSF estimation is an ill-posed problem, and a linear and uniform motion is often assumed. In real-life systems, this may deviate significantly from the actual motion, impairing subsequent restoration. To address the above, this work proposes a dynamically configurable imaging system that combines algorithmic video enhancement, field programmable gate array (FPGA)-based video processing and adaptive image sensor technology. Specifically, a joint blur identification and validation (BIV) scheme is proposed, which validates the initial linear and uniform motion assumption. For the cases that significantly deviate from that assumption, the real-time reconfiguration property of an adaptive image sensor is utilised, and the sensor is locally reconfigured to larger pixels that produce higher frame-rate samples with reduced blur. Results demonstrate that once the sensor reconfiguration gives rise to a valid motion assumption, highly accurate PSFs are estimated, resulting in improved SR reconstruction quality. To enable real-time reconstruction, an FPGA-based BIV architecture is proposed. The system's throughput is significantly higher than 25 fps, for frame sizes up to 1024 × 1024, and its performance is robust to noise for signal-to-noise ratio (SNR) as low as 20 dB.
    IET Computers & Digital Techniques 08/2011; · 0.28 Impact Factor
  • Source
    Yan Wu, P. Kuvinichkul, P.Y.K. Cheung, Y. Demiris
    [Show abstract] [Hide abstract]
    ABSTRACT: Theremin is an electronic musical instrument considered to be the most difficult to play which requires the player's hands to have high precision and stability as any position change within proximity of the instrument's antennae can make a difference to the pitch or volume. In a different direction to previous developments of Theremin playing robots, we propose a Humanoid Thereminist System that goes beyond using only one degree of freedom which will open up the possibility for robot to acquire more complex skills, such as aerial fingering and include musical expressions in playing the Theremin. The proposed system consists of two phases, namely calibration phase and playing phase which can be executed independently. During the playing phase, the System takes input from a MIDI file and performs path planning using a combination of minimum energy strategy in joint space and feedback error correction for next playing note. Three experiments have been conducted to evaluate the developed system quantitatively and qualitatively by playing a selection of music files. The experiments have demonstrated that the proposed system can effectively utilise multiple degrees of freedoms while maintaining minimum pitch error margins.
    Robotics and Biomimetics (ROBIO), 2010 IEEE International Conference on; 01/2011
  • IET Computers & Digital Techniques. 01/2011; 5:271-286.
  • Ben Cope, Peter Y. K. Cheung, Wayne Luk, Lee W. Howes
    [Show abstract] [Hide abstract]
    ABSTRACT: A systematic approach to customising Homogeneous Multi-Processor (HoMP) architectures is described. The approach involves a novel design space exploration tool and a parameterisable system model. Post-fabrication customisation options for using reconfigurable logic with a HoMP are classified. The adoption of the approach in exploring pre- and post-fabrication customisation options to optimise an architecture's critical paths is then described. The approach and steps are demonstrated using the architecture of a graphics processor. We also analyse on-chip and off-chip memory access for systems with one or more processing elements (PEs), and study the impact of the number of threads per PE on the amount of off-chip memory access and the number of cycles for each output. It is shown that post-fabrication customisation of a graphics processor can provide up to four times performance improvement for negligible area cost.
    T. HiPEAC. 01/2011; 4:63-83.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Literature suggests that timing performance degradation in VLSI could be a major concern in future process technologies. FPGAs are well suited to cope with this challenge, due to their flexibility at design-, manufacture- and run-time. Existing timing measurement techniques allow for the measurement of delay while the circuit is not operating, and reliability techniques allow for the detection of faults as they occur in operating circuits. Neither allows for the health of an operating circuit to be measured. The ability to monitor the health of a system can provide an early warning of impending failure. This information will enable measures to reduce the impact of, or avoid altogether, the failure. A good indication of the degree of degradation in an operating circuit is the available timing slack in a combinatorial circuit path, between registers, while the circuit is operating at speed. This work proposes a new time delay measurement technique that does not interfere with the circuit's normal operation. This is achieved by sweeping the phase of a secondary clock signal, driving additional shadow registers. These are connected to each circuit node to be measured, typically those on the most critical paths. The technique is able to measure the timing slack available in the circuit-under-test, while it is performing its usual function. The technique is demonstrated using a 12-stage LUT chain, and on an 8-bit ripple-carry adder, implemented on an Altera Cyclone III FPGA. It is able to measure the timing slack with a best case resolution of 96ps. The additional circuitry has minimal overhead in terms of area, power consumption, and timing. The increase in circuit delay due to extra fan-out load was measured to be 0.25% in the first example circuit.
    Proceedings of the ACM/SIGDA 19th International Symposium on Field Programmable Gate Arrays, FPGA 2011, Monterey, California, USA, February 27, March 1, 2011; 01/2011
  • Source
    Justin S. Wong, Peter Y. K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: The ability to measure delay of arbitrary circuits on FPGA offers many opportunities for on-chip characterisation and optimisation. This paper describes an improved delay measurement method by monitoring the transition probability at the output nodes as the operating frequency is swept. The new method uses optimised test vector generation to improve the accuracy of the test method. It is effectively demonstrated on a 4th order IIR filter circuit implemented on an Altera Cyclone III FPGA.
    Proceedings of the ACM/SIGDA 19th International Symposium on Field Programmable Gate Arrays, FPGA 2011, Monterey, California, USA, February 27, March 1, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes our approaches to raise the level of abstraction at which hardware suitable for accelerating computationally intensive applications can be specified. Field-programmable gate arrays are becoming adopted as a computational platform by the high-performance computing community, but there are challenges to extract maximum performance from these devices. Unlike other approaches, our focus is on data memory organization and input–output bandwidth considerations, which are the typical stumbling block of existing hardware compilation schemes. We describe our approaches, which are based on formal optimization techniques, and present some results showing the advantage of exposing the interaction between data memory system design and parallelism extraction to the compiler.
    Comput. J. 01/2011; 54:1-10.

Publication Stats

3k Citations
46.01 Total Impact Points


  • 1970–2014
    • Imperial College London
      • • Department of Electrical and Electronic Engineering
      • • Department of Computing
      Londinium, England, United Kingdom
  • 2011
    • Newcastle University
      Newcastle-on-Tyne, England, United Kingdom
  • 2010
    • Universidad de Las Palmas de Gran Canaria
      Las Palmas, Canary Islands, Spain
  • 2009
    • Trinity College Dublin
      Dublin, Leinster, Ireland
  • 2007–2008
    • University of Peloponnese
      • Department of Computer Science and Technology
      Trípoli, Peloponnese, Greece
  • 1997–2007
    • Imperial Valley College
      Imperial, California, United States
  • 2003–2005
    • Mahanakorn University of Technology
      Krung Thep, Bangkok, Thailand
  • 2000
    • Xilinx Inc.
      San Jose, California, United States
  • 1992–1996
    • University of London
      Londinium, England, United Kingdom
  • 1994
    • Higher Institute for Applied Science and Technology
      Dimashq, Damascus City, Syria