P.Y.K. Cheung

Newcastle University, Newcastle-on-Tyne, England, United Kingdom

Are you P.Y.K. Cheung?

Claim your profile

Publications (180)36.32 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: In this article we present a variation-aware post placement and routing (P&R) retiming method to counteract process variation in FPGAs. Variation-aware retiming takes into account exact variation maps (measured on FPGAs) as opposed to statistical static timing analysis (SSTA) which models process variation with statistical distributions. Experiments are conducted using variation maps measured from 100 Cyclone III FPGAs, and the retiming algorithm is applied using MATLAB. We have shown that for circuits with several retiming choices of equivalent logic depth, up to 30% delay improvement can be achieved for a given variation coefficient of σ/μ = 0.3.
    Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Shadow registers, driven by a variable-phase clock, can be used to extract useful timing information from a circuit during operation. This paper presents Slack Measurement Insertion (SMI), an automated tool flow for inserting shadow registers into an FPGA design to enable measurement of timing slack. The flow provides a parameterised level of circuit coverage and results in minimal timing and area overheads. We demonstrate the process through its application to three complex benchmark designs.
    Field Programmable Logic and Applications (FPL), 2013 23rd International Conference on; 01/2013
  • J.S.J. Wong, P.Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: The key aspects of a good on-chip timing measurement platform are high measurement resolution, accuracy, and low area overhead. A measurement method based on transition probability (TP) has shown promising characteristics in all these areas. In this paper, the TP measurement method is examined through simulation to understand its apparent effectiveness and accuracy in measuring complex circuits. Timing uncertainties and logic glitch activities are considered in detail, and the effect of varying input vectors' probability distributions is analyzed to enable further accuracy improvements. Using a field-programmable gate array, the method is implemented and demonstrated as a modular on-chip test platform for testing complex arbitrary circuits. Practical circuits found in typical modular designs, including fixed/floating-point arithmetic and filter circuits, are chosen to evaluate the test platform. The resolution of the timing measurements ranges from 0.3 to 8.0 ps, and the measurement errors against reference measurements are found to be within 3.6%. The test platform can be applied to VLSI designs with minor area overhead, and provides designers with precise and accurate physical timing information of circuits.
    IEEE Transactions on Very Large Scale Integration (VLSI) Systems 01/2013; 21(12):2307-2320. · 1.22 Impact Factor
  • Design & Test, IEEE. 01/2013; 30(6):50-59.
  • A. Powell, C. Bouganis, P.Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a power and execution time estimation framework for an FPGA-based soft processor when considering the implementation of image compression techniques. Using the proposed framework, a quick power consumption and execution time estimate can be obtained early in the design phase allowing system designers to estimate these performance metrics without the need of implementing the algorithm or generating all possible soft processor architectures. This estimate is performed using both high-level algorithm parameters and soft processor architecture parameters. For system designers this can result in fast design space exploration. The model can predict the execution time of an algorithm with an average of 139% less relative error than predictions using only architecture parameters with the same framework.
    Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Technology scaling causes increasing and unavoidable delay variability in FPGAs. This paper proposes a 2-stage variation-aware placement method that benefits from the optimality of a full-chipwise (chip-by-chip) placement but only requires a fraction of total execution time for a large number of FPGAs with different variation patterns. By classifying variation maps into finite number of classes, variation-aware placement only need to be executed based on the median map of each class to produce the placement for the other FPGAs (variation maps) in that class to save execution time. Our proposed method is implemented in a modified version of VPR 5.0 and verified using variation maps measured from 129 DE0 boards equipped with Cyclone III FPGAs. The mean timing gain of 7.36% is observed in 20 MCNC benchmarks with 16 clusters, while reducing execution time by a factor of 8 compared to full-chipwise placement.
    Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an adaptive Sequential Monte Carlo approach for real-time applications. Sequential Monte Carlo method is employed to estimate the states of dynamic systems using weighted particles. The proposed approach reduces the run-time computation complexity by adapting the size of the particle set. Multiple processing elements on FPGAs are dynamically allocated for improved energy efficiency without violating real-time constraints. A robot localisation application is developed based on the proposed approach. Compared to a non-adaptive implementation, the dynamic energy consumption is reduced by up to 70% without affecting the quality of solutions.
    Field Programmable Logic and Applications (FPL), 2012 22nd International Conference on; 01/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Reliability, power consumption and timing performance are key considerations for the utilisation of field-programmable gate arrays. Online measurement techniques can determine the timing characteristics of an FPGA application while it is operating, and facilitate a range of benefits. Degradation can be monitored by tracking changes in timing performance, while power consumption can be reduced through dynamic voltage scaling (DVS) of the power supply to exploit any spare timing headroom. If higher performance is the objective, dynamic frequency scaling (DFS) can be used to maximise operating frequency. In both cases, online timing measurement of the application circuit is used to exploit favourable operating conditions. This work demonstrates a method of online measurement, achieved by sweeping the phase of a secondary clock signal, driving additional shadowing registers strategically added to the application design. The measurement technique and initial voltage and frequency scaling experiments are demonstrated on an Alter a Cyclone III FPGA. Timing performance can be measured with a best case resolution of 96ps. The additional circuitry results in minimal overhead in terms of area and performance. Power savings of 23% dynamic and 13% static in an example circuit are achieved through DVS, or performance improvements of 21% through DFS, when compared with operating at nominal core voltage, or timing model FMax.
    Field-Programmable Custom Computing Machines (FCCM), 2012 IEEE 20th Annual International Symposium on; 01/2012
  • Source
    T. Mak, P.Y.K. Cheung, Kai-Pui Lam, W. Luk
    [Show abstract] [Hide abstract]
    ABSTRACT: Dynamic routing is desirable because of its substantial improvement in communication bandwidth and intelligent adaptation to faulty links and congested traffic. However, implementation of adaptive routing in a network-on-chip system is not trivial and is further complicated by the requirements of deadlock-free and real-time optimal decision making. In this paper, we present a deadlock-free routing architecture which employs a dynamic programming (DP) network to provide on-the-fly optimal path planning and network monitoring for packet switching. Also, a new routing strategy called k -step look ahead is introduced. This new strategy can substantially reduce the size of routing table and maintain a high quality of adaptation which leads to a scalable dynamic-routing solution with minimal hardware overhead. Our results, based on a cycle-accurate simulator, demonstrate the effectiveness of the DP network, which outperforms both the deterministic and adaptive-routing algorithms in average delay on various traffic scenarios by 22.3%. Moreover, the hardware overhead for DP network is insignificant, based on the results obtained from the hardware implementations.
    IEEE Transactions on Industrial Electronics 09/2011; · 6.50 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Restoration methods, such as super-resolution (SR), largely depend on the accuracy of the point spread function (PSF). PSF estimation is an ill-posed problem, and a linear and uniform motion is often assumed. In real-life systems, this may deviate significantly from the actual motion, impairing subsequent restoration. To address the above, this work proposes a dynamically configurable imaging system that combines algorithmic video enhancement, field programmable gate array (FPGA)-based video processing and adaptive image sensor technology. Specifically, a joint blur identification and validation (BIV) scheme is proposed, which validates the initial linear and uniform motion assumption. For the cases that significantly deviate from that assumption, the real-time reconfiguration property of an adaptive image sensor is utilised, and the sensor is locally reconfigured to larger pixels that produce higher frame-rate samples with reduced blur. Results demonstrate that once the sensor reconfiguration gives rise to a valid motion assumption, highly accurate PSFs are estimated, resulting in improved SR reconstruction quality. To enable real-time reconstruction, an FPGA-based BIV architecture is proposed. The system's throughput is significantly higher than 25 fps, for frame sizes up to 1024 × 1024, and its performance is robust to noise for signal-to-noise ratio (SNR) as low as 20 dB.
    IET Computers & Digital Techniques 08/2011; · 0.28 Impact Factor
  • Source
    Yan Wu, P. Kuvinichkul, P.Y.K. Cheung, Y. Demiris
    [Show abstract] [Hide abstract]
    ABSTRACT: Theremin is an electronic musical instrument considered to be the most difficult to play which requires the player's hands to have high precision and stability as any position change within proximity of the instrument's antennae can make a difference to the pitch or volume. In a different direction to previous developments of Theremin playing robots, we propose a Humanoid Thereminist System that goes beyond using only one degree of freedom which will open up the possibility for robot to acquire more complex skills, such as aerial fingering and include musical expressions in playing the Theremin. The proposed system consists of two phases, namely calibration phase and playing phase which can be executed independently. During the playing phase, the System takes input from a MIDI file and performs path planning using a combination of minimum energy strategy in joint space and feedback error correction for next playing note. Three experiments have been conducted to evaluate the developed system quantitatively and qualitatively by playing a selection of music files. The experiments have demonstrated that the proposed system can effectively utilise multiple degrees of freedoms while maintaining minimum pitch error margins.
    Robotics and Biomimetics (ROBIO), 2010 IEEE International Conference on; 01/2011
  • E. Stott, J.S.J. Wong, P.Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: FPGAs are powerful platforms for investigating impending challenges associated with process scaling, such as variation and degradation. Their versatility allows us to gather empirical data and evaluate novel solutions. We carried out accelerated-life tests on modern FPGA devices and obtained a useful characterisation of the ageing processes that afflict them. We also quantified the potential benefits of three degradation mitigation strategies based on exploiting spare logic and interconnect resources. The work helps cement the role of reconfigurable logic as a vitally-important technology in the face of the uncertainties of future process scaling.
    Field Programmable Logic and Applications (FPL), 2010 International Conference on; 10/2010
  • Source
    B. Cope, P.Y.K. Cheung, W. Luk, L. Howes
    [Show abstract] [Hide abstract]
    ABSTRACT: A systematic approach to the comparison of the graphics processor (GPU) and reconfigurable logic is defined in terms of three throughput drivers. The approach is applied to five case study algorithms, characterized by their arithmetic complexity, memory access requirements, and data dependence, and two target devices: the nVidia GeForce 7900 GTX GPU and a Xilinx Virtex-4 field programmable gate array (FPGA). Two orders of magnitude speedup, over a general-purpose processor, is observed for each device for arithmetic intensive algorithms. An FPGA is superior, over a GPU, for algorithms requiring large numbers of regular memory accesses, while the GPU is superior for algorithms with variable data reuse. In the presence of data dependence, the implementation of a customized data path in an FPGA exceeds GPU performance by up to eight times. The trends of the analysis to newer and future technologies are analyzed.
    IEEE Transactions on Computers 05/2010; · 1.38 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Hardware sharing can be used to reduce the area and the power dissipation of a design. This is of particular interest in the field of image and video compression, where an encoder must deal with different design tradeoffs depending on the characteristics of the signal to be encoded and the constraints imposed by the users. This paper introduces a novel methodology for exploring the design space based on the amount of hardware sharing between different functional blocks, giving as a result a set of feasible solutions which are broad in terms of hardware cost and throughput capabilities. The proposed approach, inspired by the notion of a partition in set theory, has been applied to optimize and to evaluate the sharing alternatives of a group of image and video compression key computational kernels when mapped onto a Xilinx Virtex-5 FPGA.
    Design, Automation & Test in Europe Conference & Exhibition (DATE), 2010; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper is concerned with the application of geometric programming to the design of homogeneous field programmable gate array (FPGA) architectures. The paper builds on an increasing body of work concerned with modeling reconfigurable architectures, and presents a full area and delay model of an FPGA. We use a geometric programming framework to show how transistor sizing and high-level architecture parameter selection can now be solved as a concurrent optimization problem. We validate the model through the use of simulation program with integrated circuit emphasis (SPICE) models and the versatile place and route (VPR) FPGA architecture simulation tool. Not only does the optimization framework allow architectures to be optimized orders of magnitude faster than previous work, but the combined optimization can lead to different architectural conclusions compared to conventional methods by exploring the coupling between the two sets of optimization variables. Specifically, we show that as delay takes more significance in the objective of the optimization, there should be more lookup tables in a logic block, whereas conventional techniques suggest that there should be fewer lookup tables in an FPGA logic block.
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 01/2010; 29(8):1163-1176. · 1.09 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Super-resolution (SR) methods are largely affected by the accurate evaluation of the Point Spread Function (PSF) that is related to the input frames. When the frames are degraded by heavy motion blur, the PSFs are highly non-isotropic, which further complicates their estimation. The ill-posed nature of blur identification is usually addressed using the assumption of linear and uniform motion. However, in real-life systems, this may deviate significantly from the actual motion blur. To resolve the above, this work proposes combining a scheme that validates the initial motion assumption with the real-time reconfiguration property of an adaptive image sensor. If the linearity and uniformity assumption is invalid for a given motion region, the sensor is locally reconfigured to larger pixels that produce higher frame-rate samples with reduced blur. Once the appropriate configuration that gives rise to a valid motion assumption is applied, highly accurate PSFs are estimated, resulting to an improved SR reconstruction quality.
    Image Processing (ICIP), 2009 16th IEEE International Conference on; 12/2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a methodology for estimating and optimising FPGA routing fabrics using high-level modelling and convex optimisation techniques. Experimental methods for exploring design spaces suffer from expensive computation time, which is exacerbated by increased dimensionality due to the larger number of architectural parameters. In this paper we build on previously published work to describe a model of FPGA routing area. This model is used in conjunction with a form of optimisation known as geometric programming, in order to analytically derive optimised FPGA architectural parameters, demonstrating the power and accuracy of model-based approaches in configurable architecture design. We show that routing parameters such as connection and switch box flexibilities can be architected to save around 6% of area instead of using traditional ldquorules of thumbrdquo.
    Field Programmable Logic and Applications, 2009. FPL 2009. International Conference on; 10/2009
  • P. Sedcole, E. Stott, P.Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: Two complementary techniques for reducing the effect of within-die variability on the critical path delay in FPGA circuits are reported. The first technique selects the best LUT mapping from a set of alternative mappings of a logic function for each LUT cluster in the FPGA. The second selects the best assignment of LUTs to physical locations within a cluster. The techniques can be used together, and are shown in Monte Carlo experiments to reduce both the mean and standard deviation of critical path delay.
    Field Programmable Logic and Applications, 2009. FPL 2009. International Conference on; 10/2009
  • Source
    S.A. Fahmy, P.Y.K. Cheung, W. Luk
    [Show abstract] [Hide abstract]
    ABSTRACT: Most effort in designing median filters has focused on two-dimensional filters with small window sizes, used for image processing. However, recent work on novel image processing algorithms, such as the trace transform, has highlighted the need for architectures that can compute the median and weighted median of large one-dimensional windows, to which the optimisations in the aforementioned architectures do not apply. A set of architectures for computing both the median and weighted median of large, flexibly sized windows through parallel cumulative histogram construction is presented. The architecture uses embedded memories to control the highly parallel bank of histogram nodes, and can implicitly determine window sizes for median and weighted median calculations. The architecture is shown to perform at 72 Msamples, and has been integrated within a trace transform architecture.
    IET Computers & Digital Techniques 08/2009; · 0.28 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Contemporary FPGA-based reconfigurable systems have been widely used to implement data-dominated applications. In these applications, data transfer and storage consume a large proportion of the system energy. Exploiting data-reuse can introduce significant power savings, but also introduces the extra requirement for on-chip memory. To aid data-reuse design exploration early during the design cycle, the authors present an optimisation approach to achieve a power-optimal design satisfying an on-chip memory constraint in a targeted FPGA-based platform. The data-reuse exploration problem is mathematically formulated and shown to be equivalent to the multiple-choice knapsack problem. The solution to this problem for an application code corresponds to the decision of which array references are to be buffered on-chip and where loading reused data of the array references into on-chip memory happen in the code, in order to minimise power consumption for a fixed on-chip memory size. The authors also present an experimentally verified power model, capable of providing the relative power information between different data-reuse design options of an application, resulting in a fast and efficient design-space exploration. The experimental results demonstrate that the approach enables us to find the most power-efficient design for all the benchmark circuits tested.
    IET Computers & Digital Techniques 06/2009; · 0.28 Impact Factor

Publication Stats

2k Citations
36.32 Total Impact Points

Institutions

  • 2011
    • Newcastle University
      Newcastle-on-Tyne, England, United Kingdom
  • 2009
    • Trinity College Dublin
      Dublin, Leinster, Ireland
  • 1970–2009
    • Imperial College London
      • • Department of Electrical and Electronic Engineering
      • • Department of Computing
      London, ENG, United Kingdom
  • 2003–2005
    • Mahanakorn University of Technology
      Krung Thep, Bangkok, Thailand
  • 2001
    • Imperial Valley College
      Imperial, California, United States
  • 2000
    • Xilinx Inc.
      San Jose, California, United States
  • 1992–1996
    • University of London
      Londinium, England, United Kingdom
  • 1994
    • Higher Institute for Applied Science and Technology
      Dimashq, Damascus City, Syria