P.Y.K. Cheung

Imperial College London, Londinium, England, United Kingdom

Are you P.Y.K. Cheung?

Claim your profile

Publications (152)14.65 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, the FPGA routing process is explored to mitigate and take advantage of the effect of delay variability due to process variation. A new method called partial rerouting is proposed in this paper to improve the timing performance based on process variation and reduce the execution time. By only rerouting a small number of critical and near-critical paths, about 6.3% timing improvement can be achieved by partial rerouting method. At the same time, partial rerouting can speed up the routing process by 9 times compared with full chipwise with 100 target FPGAs (variation maps). Moreover, the partial rerouting enables a trade-off between product yield and routing speed.
    No preview · Article · Jan 2014 · IEICE Electronics Express
  • J. Liu · C. Bouganis · P.Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an adaptive progressive image acquisition algorithm based on the concept of kernel construction. The algorithm takes the conventional route of blind progressive sampling to sample and reconstruct the ground truth image in an iterative manner. During each iteration, an equivalent kernel is built for each unsampled pixel to capture the spatial structure of its local neighborhood. The kernel is normalized by the estimated sample strength in the local area and used as the projection of the influence of this unsampled pixel to the consequent sampling procedure. The sampling priority of a candidate unsampled pixel is the sum of such projections from other unsampled pixels in the local area. Pixel locations with the highest priority are sampled in the next iteration. The algorithm does not require to pre-process or compress the ground truth image and therefore can be used in various situations where such procedure is not possible. The experiments show that the proposed algorithm is able to capture the local structure of images to achieve a better reconstruction quality than that of the existing methods. Copyright © 2014 SCITEPRESS - Science and Technology Publications. All rights reserved.
    No preview · Article · Jan 2014
  • Edward Stott · Zhenyu Guan · J.M. Levine · J.S.J. Wong · P.Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper focuses on variability and reliability issues for FPGAs, showing how these issues can be effectively addressed using one of the most powerful features of FPGAs. One of the most powerful features of FPGAs is their ability and flexibility to be reconfigured. The paper also presents techniques for characterizing variability and degradation in these systems. It discusses the empirical approach that is adopted to understand and confront these problems in field-programmable gate arrays (FPGAs). An important application for on-chip measurement is to produce a map of intra-die variation across a device. This helps in understanding the statistical properties of the delay variation, such as its spatial correlation, and is the first step towards mitigating the problem.
    No preview · Article · Dec 2013 · IEEE Design and Test
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this article we present a variation-aware post placement and routing (P&R) retiming method to counteract process variation in FPGAs. Variation-aware retiming takes into account exact variation maps (measured on FPGAs) as opposed to statistical static timing analysis (SSTA) which models process variation with statistical distributions. Experiments are conducted using variation maps measured from 100 Cyclone III FPGAs, and the retiming algorithm is applied using MATLAB. We have shown that for circuits with several retiming choices of equivalent logic depth, up to 30% delay improvement can be achieved for a given variation coefficient of σ/μ = 0.3.
    No preview · Conference Paper · Jan 2013
  • J.M. Levine · E. Stott · G.A. Constantinides · P.Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: Shadow registers, driven by a variable-phase clock, can be used to extract useful timing information from a circuit during operation. This paper presents Slack Measurement Insertion (SMI), an automated tool flow for inserting shadow registers into an FPGA design to enable measurement of timing slack. The flow provides a parameterised level of circuit coverage and results in minimal timing and area overheads. We demonstrate the process through its application to three complex benchmark designs.
    No preview · Conference Paper · Jan 2013
  • Adam Powell · Christos-S. Bouganis · Peter Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a power and execution time estimation framework for an FPGA-based soft processor when considering the implementation of image compression techniques. Using the proposed framework, a quick power consumption and execution time estimate can be obtained early in the design phase allowing system designers to estimate these performance metrics without the need of implementing the algorithm or generating all possible soft processor architectures. This estimate is performed using both high-level algorithm parameters and soft processor architecture parameters. For system designers this can result in fast design space exploration. The model can predict the execution time of an algorithm with an average of 139% less relative error than predictions using only architecture parameters with the same framework.
    No preview · Conference Paper · Aug 2012
  • T.C.P. Chau · W. Luk · P.Y.K. Cheung · A. Eele · J. Maciejowski
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents an adaptive Sequential Monte Carlo approach for real-time applications. Sequential Monte Carlo method is employed to estimate the states of dynamic systems using weighted particles. The proposed approach reduces the run-time computation complexity by adapting the size of the particle set. Multiple processing elements on FPGAs are dynamically allocated for improved energy efficiency without violating real-time constraints. A robot localisation application is developed based on the proposed approach. Compared to a non-adaptive implementation, the dynamic energy consumption is reduced by up to 70% without affecting the quality of solutions.
    No preview · Conference Paper · Jan 2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Technology scaling causes increasing and unavoidable delay variability in FPGAs. This paper proposes a 2-stage variation-aware placement method that benefits from the optimality of a full-chipwise (chip-by-chip) placement but only requires a fraction of total execution time for a large number of FPGAs with different variation patterns. By classifying variation maps into finite number of classes, variation-aware placement only need to be executed based on the median map of each class to produce the placement for the other FPGAs (variation maps) in that class to save execution time. Our proposed method is implemented in a modified version of VPR 5.0 and verified using variation maps measured from 129 DE0 boards equipped with Cyclone III FPGAs. The mean timing gain of 7.36% is observed in 20 MCNC benchmarks with 16 clusters, while reducing execution time by a factor of 8 compared to full-chipwise placement.
    No preview · Conference Paper · Jan 2012
  • M.E. Angelopoulou · C.-S. Bouganis · P.Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: Restoration methods, such as super-resolution (SR), largely depend on the accuracy of the point spread function (PSF). PSF estimation is an ill-posed problem, and a linear and uniform motion is often assumed. In real-life systems, this may deviate significantly from the actual motion, impairing subsequent restoration. To address the above, this work proposes a dynamically configurable imaging system that combines algorithmic video enhancement, field programmable gate array (FPGA)-based video processing and adaptive image sensor technology. Specifically, a joint blur identification and validation (BIV) scheme is proposed, which validates the initial linear and uniform motion assumption. For the cases that significantly deviate from that assumption, the real-time reconfiguration property of an adaptive image sensor is utilised, and the sensor is locally reconfigured to larger pixels that produce higher frame-rate samples with reduced blur. Results demonstrate that once the sensor reconfiguration gives rise to a valid motion assumption, highly accurate PSFs are estimated, resulting in improved SR reconstruction quality. To enable real-time reconstruction, an FPGA-based BIV architecture is proposed. The system's throughput is significantly higher than 25 fps, for frame sizes up to 1024 × 1024, and its performance is robust to noise for signal-to-noise ratio (SNR) as low as 20 dB.
    No preview · Article · Aug 2011 · IET Computers & Digital Techniques
  • Source
    Yan Wu · P. Kuvinichkul · P.Y.K. Cheung · Y. Demiris
    [Show abstract] [Hide abstract]
    ABSTRACT: Theremin is an electronic musical instrument considered to be the most difficult to play which requires the player's hands to have high precision and stability as any position change within proximity of the instrument's antennae can make a difference to the pitch or volume. In a different direction to previous developments of Theremin playing robots, we propose a Humanoid Thereminist System that goes beyond using only one degree of freedom which will open up the possibility for robot to acquire more complex skills, such as aerial fingering and include musical expressions in playing the Theremin. The proposed system consists of two phases, namely calibration phase and playing phase which can be executed independently. During the playing phase, the System takes input from a MIDI file and performs path planning using a combination of minimum energy strategy in joint space and feedback error correction for next playing note. Three experiments have been conducted to evaluate the developed system quantitatively and qualitatively by playing a selection of music files. The experiments have demonstrated that the proposed system can effectively utilise multiple degrees of freedoms while maintaining minimum pitch error margins.
    Full-text · Conference Paper · Jan 2011
  • S. Lopez · R. Sarmiento · P.G. Potter · W. Luk · P.Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: Hardware sharing can be used to reduce the area and the power dissipation of a design. This is of particular interest in the field of image and video compression, where an encoder must deal with different design tradeoffs depending on the characteristics of the signal to be encoded and the constraints imposed by the users. This paper introduces a novel methodology for exploring the design space based on the amount of hardware sharing between different functional blocks, giving as a result a set of feasible solutions which are broad in terms of hardware cost and throughput capabilities. The proposed approach, inspired by the notion of a partition in set theory, has been applied to optimize and to evaluate the sharing alternatives of a group of image and video compression key computational kernels when mapped onto a Xilinx Virtex-5 FPGA.
    No preview · Conference Paper · Jan 2010
  • Source
    S.A. Fahmy · P.Y.K. Cheung · W. Luk
    [Show abstract] [Hide abstract]
    ABSTRACT: Most effort in designing median filters has focused on two-dimensional filters with small window sizes, used for image processing. However, recent work on novel image processing algorithms, such as the trace transform, has highlighted the need for architectures that can compute the median and weighted median of large one-dimensional windows, to which the optimisations in the aforementioned architectures do not apply. A set of architectures for computing both the median and weighted median of large, flexibly sized windows through parallel cumulative histogram construction is presented. The architecture uses embedded memories to control the highly parallel bank of histogram nodes, and can implicitly determine window sizes for median and weighted median calculations. The architecture is shown to perform at 72 Msamples, and has been integrated within a trace transform architecture.
    Full-text · Article · Aug 2009 · IET Computers & Digital Techniques
  • Source
    Q. Liu · G.A. Constantinides · K. Masselos · P.Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: Contemporary FPGA-based reconfigurable systems have been widely used to implement data-dominated applications. In these applications, data transfer and storage consume a large proportion of the system energy. Exploiting data-reuse can introduce significant power savings, but also introduces the extra requirement for on-chip memory. To aid data-reuse design exploration early during the design cycle, the authors present an optimisation approach to achieve a power-optimal design satisfying an on-chip memory constraint in a targeted FPGA-based platform. The data-reuse exploration problem is mathematically formulated and shown to be equivalent to the multiple-choice knapsack problem. The solution to this problem for an application code corresponds to the decision of which array references are to be buffered on-chip and where loading reused data of the array references into on-chip memory happen in the code, in order to minimise power consumption for a fixed on-chip memory size. The authors also present an experimentally verified power model, capable of providing the relative power information between different data-reuse design options of an application, resulting in a fast and efficient design-space exploration. The experimental results demonstrate that the approach enables us to find the most power-efficient design for all the benchmark circuits tested.
    Full-text · Article · Jun 2009 · IET Computers & Digital Techniques
  • Source
    Y. Liu · C.-S. Bouganis · P.Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: Computation of eigenvalues is essential in many applications in the fields of science and engineering. When the application of interest requires the computation of eigenvalues of high throughput or real-time performance, a hardware implementation of an eigenvalue computation block is often employed. The problem of eigenvalue computation of real symmetric matrices is focused upon. For the general case of a symmetric matrix eigenvalue problem, the approximate Jacobi method is proposed, where for the special case of a 3times3 symmetric matrix, an algebraic-based method is introduced. The proposed methods are compared with various other approaches reported in the literature. Results obtained by mapping the above architectures on a field programmable gate array device illustrate the advantages of the proposed methods over the existing ones.
    Full-text · Article · Feb 2009 · IET Computers & Digital Techniques
  • T. Mak · P. Sedcole · P.Y.K. Cheung · W. Luk
    [Show abstract] [Hide abstract]
    ABSTRACT: On-FPGA communication is becoming more problematic as the long interconnection performance is deteriorating in technology scaling. In this paper, we address this issue by presenting a new wave-pipelined signaling scheme to achieve high-throughput communication in FPGA. The throughput and power consumption of a wave-pipelined link have been derived analytically and compared to the conventional synchronous link. Two circuit designs are proposed to realize wave-pipelined link using FPGA fabrics. The proposed approaches are also compared with conventional synchronous and asynchronous pipelining techniques. It is shown that, the wave-pipelined approach can achieve up to 5.66 times improvement in throughput versus the synchronous link and 13% improvement in power consumption and 35% improvement in delay versus the synchronous register-pipelining. Also, trade-offs between power, speed and area between the proposed and conventional designs are studied.
    No preview · Conference Paper · Jan 2009
  • Source
    P. Sedcole · J.S. Wong · P.Y.K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: As integrated circuits are scaled down it becomes difficult to maintain uniformity in process parameters across each individual die. To avoid significant performance loss through pessimistic over-design new design strategies are required that are cognisant of within-die performance variability. This paper examines the effect of process variability on the clock resources in FPGA devices. A model of variation in clock skew in FPGA clock networks is presented. Techniques for reducing the impact of variations on the performance of implemented designs are proposed and analysed, demonstrating that skew variation can be reduced by 70% or more through a combination of phase adjustment and clock rerouting. Measurements on a Virtex-5 FPGA validate the feasibility and benefits of the proposed compensation strategies.
    Preview · Conference Paper · Jan 2009
  • P.G. Potter · W. Luk · P. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes a novel approach for design space exploration by characterizing hardware sharing based on the notion of a partition in set theory. Related designs with different degrees of hardware sharing can be captured concisely by a Hasse diagram, highlighting designs with shared building blocks. Hardware sharing can be implemented in various ways, such as component multiplexing, instruction-set processors, or run-time reconfiguration. We illustrate how the proposed approach can be applied to exploring the design space for FPGA implementations of JPEG image compression.
    No preview · Conference Paper · Jan 2009
  • Source
    Maria E. Angelopoulou · C.-S. Bouganis · Peter Y. K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: The high density pixel sensors of the latest imaging systems provide images with high resolution, but require long exposure times, which limit their applicability due to the motion blur effect. Recent technological advances have lead to image sensors that can combine in real-time several pixels together to form a larger pixel. Larger pixels require shorter exposure times and produce high-frame-rate samples with reduced motion blur. This work proposes ways of configuring such a sensor to maximize the raw information collected from the environment, and methods to process that information and enhance the final output. In particular, a super-resolution and a deconvolution-based approach, for motion deblurring on an adaptive image sensor, are proposed, compared and evaluated.
    Preview · Conference Paper · Nov 2008
  • Justin S. J. Wong · Peter Y. K. Cheung · Pete Sedcole
    [Show abstract] [Hide abstract]
    ABSTRACT: The goal of this PhD project is to devise a way to combat the effect of process variation on propagation delays in modern FPGAs. Through our research, we have devised a novel measurement method that is capable of measuring the delays of components on FPGAs with picosecond timing resolution and fine spatial granularity. The method avoids the use of external test equipment and able to measure stochastic delay variability, which is becoming increasingly significant. The aim is to exhaustively test FPGA components based on this method and use the results to optimise the placement and routing of circuits in FPGAs to maximise performance under the negative influence of process variation.
    No preview · Conference Paper · Oct 2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A geometric programming framework is proposed in this paper to automate exploration of the design space consisting of data reuse (buffering) exploitation and loop-level parallelization, in the context of FPGA-targeted hardware compilation. We expose the dependence between data reuse and data-level parallelization and explore both problems under the on-chip memory constraint for performance-optimal designs within a single optimization step. Results from applying this framework to several real benchmarks demonstrate that given different constraints on on-chip memory utilization, the corresponding performance-optimal designs are automatically determined by the framework, and performance improvements up to 4.7 times have been achieved compared with the method that first explores data reuse and then performs parallelization.
    Full-text · Conference Paper · Oct 2008

Publication Stats

2k Citations
14.65 Total Impact Points

Institutions

  • 1970-2013
    • Imperial College London
      • • Department of Electrical and Electronic Engineering
      • • Department of Computing
      Londinium, England, United Kingdom
  • 2003
    • Mahanakorn University of Technology
      Krung Thep, Bangkok, Thailand
  • 2001-2003
    • Imperial Valley College
      IPL, California, United States
  • 1992-2000
    • University of London
      Londinium, England, United Kingdom
  • 1994
    • Higher Institute for Applied Science and Technology
      Dimashq, Damascus City, Syria