Ngai Wong’s research while affiliated with The University of Hong Kong and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (317)


BDLUT: Blind image denoising with hardware‐optimized look‐up tables
  • Article

April 2025

Journal of the Society for Information Display

Boyu Li

·

Zhilin Ai

·

Baizhou Jiang

·

[...]

·

Ngai Wong

Denoising sensor‐captured images on edge display devices remains challenging due to deep neural networks' (DNNs) high computational overhead and synthetic noise training limitations. This work proposes BDLUT(‐D), a novel blind denoising method combining optimized lookup tables (LUTs) with hardware‐centric design. While BDLUT describes the LUT‐based network architecture, BDLUT‐D represents BDLUT trained with a specialized noise degradation model. Designed for edge deployment, BDLUT(‐D) eliminates neural processing units (NPUs) and functions as a standalone ASIC IP solution. Experimental results demonstrate BDLUT‐D achieves up to 2.42 dB improvement over state‐of‐the‐art LUT methods on mixed‐noise‐intensity benchmarks, requiring only 66 KB storage. FPGA implementation shows over 10 reduction in logic resources, 75% less storage compared to DNN accelerators, while achieving 57% faster processing than traditional bilateral filtering methods. These optimizations enable practical integration into edge scenarios like low‐cost webcam enhancement and real‐time 4 K‐to‐4 K denoising without compromising resolution or latency. By enhancing silicon efficiency and removing external accelerator dependencies, BDLUT(‐D) establishes a new standard for practical edge imaging denoising. Implementation is available at https://github.com/HKU-LiBoyu/BDLUT .





DnLUT: Ultra-Efficient Color Image Denoising via Channel-Aware Lookup Tables

March 2025

While deep neural networks have revolutionized image denoising capabilities, their deployment on edge devices remains challenging due to substantial computational and memory requirements. To this end, we present DnLUT, an ultra-efficient lookup table-based framework that achieves high-quality color image denoising with minimal resource consumption. Our key innovation lies in two complementary components: a Pairwise Channel Mixer (PCM) that effectively captures inter-channel correlations and spatial dependencies in parallel, and a novel L-shaped convolution design that maximizes receptive field coverage while minimizing storage overhead. By converting these components into optimized lookup tables post-training, DnLUT achieves remarkable efficiency - requiring only 500KB storage and 0.1% energy consumption compared to its CNN contestant DnCNN, while delivering 20X faster inference. Extensive experiments demonstrate that DnLUT outperforms all existing LUT-based methods by over 1dB in PSNR, establishing a new state-of-the-art in resource-efficient color image denoising. The project is available at https://github.com/Stephen0808/DnLUT.


Valence-Arousal Disentangled Representation Learning for Emotion Recognition in SSVEP-Based BCIs

March 2025

·

4 Reads

IEEE Journal of Biomedical and Health Informatics

Steady state visually evoked potential (SSVEP)-based brain-computer interfaces (BCIs), which are widely used in rehabilitation and disability assistance, can benefit from real-time emotion recognition to enhance human–machine interaction. However, the learned discriminative latent representations in SSVEP-BCIs may generalize in an unintended direction, which can lead to reduced accuracy in detecting emotional states. In this paper, we introduce a Valence-Arousal Disentangled Representation Learning (VADL) method, drawing inspiration from the classical two-dimensional emotional model, to enhance the performance and generalization of emotion recognition within SSVEP-BCIs. VADL distinctly disentangles the latent variables of valence and arousal information to improve accuracy. It utilizes the structured state space duality model to thoroughly extract global emotional features. Additionally, we propose a Multisubject Gradient Blending training strategy that individually tailors the learning pace of reconstruction and discrimination tasks within VADL on-the-fly. To verify the feasibility of our method, we have developed a comprehensive database comprising 23 subjects, in which both the emotional states and SSVEPs were effectively elicited. Experimental results indicate that VADL surpasses existing state-of-the-art benchmark algorithms.



Fig. 2: The train and deploy stages for proposed HaLoRA. (a) During the training stage, HaLoRA incorporates an additional loss regularization term with sampled noise to enhance model robustness. (b) In the deploy stage, the finetuned LLM is mapped to a hybrid CIM architecture formed by RRAM and SRAM-based CIM macros, leveraging their respective advantages.
Fig. 3: The performance of HaLoRA with different values of µ and vanilla LoRA on the OBQA and SIQA datasets.
HaLoRA: Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture
  • Preprint
  • File available

February 2025

·

8 Reads

Low-rank adaptation (LoRA) is a predominant parameter-efficient finetuning method to adapt large language models (LLMs) for downstream tasks. In this paper, we first propose to deploy the LoRA-finetuned LLMs on the hybrid compute-in-memory (CIM) architecture (i.e., pretrained weights onto RRAM and LoRA onto SRAM). To address performance degradation from RRAM's inherent noise, we design a novel Hardware-aware Low-rank Adaption (HaLoRA) method, aiming to train a LoRA branch that is both robust and accurate by aligning the training objectives under both ideal and noisy conditions. Experiments finetuning LLaMA 3.2 1B and 3B demonstrate HaLoRA's effectiveness across multiple reasoning tasks, achieving up to 22.7 improvement in average score while maintaining robustness at various noise levels.

Download

Figure 1: Upper: The distribution of two types of attention. Lower: The heatmaps of two types of attention.
Figure 2: Overview of ParallelComp. Parallel Attention -The input sequence is split into multiple chunks based on the model's maximum context length. Each chunk undergoes local attention computation independently, and the self-information score of the query is calculated. Parallel KV Cache Eviction -Based on the self-information score, low-score tokens (marked in yellow, R l ) and attention bias tokens (marked in red, R h ) are selectively evicted to optimize memory usage and attention bias. Global Attention -The remaining KV caches are ranked by self-information, and less relevant chunks are discarded. The selected chunks are then concatenated, and a global attention operation is applied to ensure comprehensive information aggregation before the final autoregressive decoding stage.
The model's performance on the InfiniteBench dataset across different datasets.
ParallelComp: Parallel Long-Context Compressor for Length Extrapolation

February 2025

·

4 Reads

Efficiently handling long contexts is crucial for large language models (LLMs). While rotary position embeddings (RoPEs) enhance length generalization, effective length extrapolation remains challenging and often requires costly fine-tuning. In contrast, recent training-free approaches suffer from the attention sink phenomenon, leading to severe performance degradation. In this paper, we introduce ParallelComp, a novel training-free method for long-context extrapolation that extends LLMs' context length from 4K to 128K while maintaining high throughput and preserving perplexity, and integrates seamlessly with Flash Attention. Our analysis offers new insights into attention biases in parallel attention mechanisms and provides practical solutions to tackle these challenges. To mitigate the attention sink issue, we propose an attention calibration strategy that reduces biases, ensuring more stable long-range attention. Additionally, we introduce a chunk eviction strategy to efficiently manage ultra-long contexts on a single A100 80GB GPU. To further enhance efficiency, we propose a parallel KV cache eviction technique, which improves chunk throughput by 1.76x, thereby achieving a 23.50x acceleration in the prefilling stage with negligible performance loss due to attention calibration. Furthermore, ParallelComp achieves 91.17% of GPT-4's performance on long-context tasks using an 8B model trained on 8K-length context, outperforming powerful closed-source models such as Claude-2 and Kimi-Chat.


Schematic of BCI with brain–memristor decoder co-evolution
Enhanced BCI that integrates human brain intelligence with memristor-chip-based machine intelligence (memristor intelligence) is enabled by brain–memristor decoder co-evolution. The process involves three key signals: intention, feedback and response. Initially, the brain sends out a control intention through its activities. Subsequently, the memristor chip decodes it, transmitting the decoding result to the brain as feedback. After that, the brain responds to whether the decoding is correct, guiding the memristor chip update and decoding model transfer. The three sets of evolving multichannel brain signals indicated by red arrows illustrate the fluctuations in brain activities. By contrast, the evolving decoder maps indicated by green arrows illustrate the adaptive memristor decoders. The right panel shows an example of typical BCI-controlled devices, such as a drone, wheelchair, prosthesis and mouse.
Design of a memristor-chip-enabled BCI
a, Framework of the memristor-chip-enabled BCI. The pink, green and blue arrows represent the brain, decoder and hybrid paths, respectively. In the brain and decoder paths, the memristor chip decodes the brain’s SSVEP signals and then feeds the results back to the brain. In the hybrid path, ErrP responses from the brain are detected to accumulate new online training samples to update the decoder. SET/RESET pulses are then applied on the memristor chip to update the model. SL, source line; BL, bit line; WL, word line. b, Photograph of the memristor-chip-based decoding platform. c, Cross-sectional transmission electron microscopy image of the memristor chip with one-transistor–one-resistor cells. The inset shows the TiN/TaOx/HfO2/TiN material stack of the memristor device. d, One-step implementation of the SSVEP decoding algorithm on the memristor array by merging the original preprocessing (temporal filtering), feature extraction (spatial filtering) and pattern recognition (template matching) into one matrix computation. e, Comparison of the computational complexity for the three-step and one-step decoding algorithm implementations for an 8-channel and 12-command drone flight control task. f–h, Coefficient maps of the temporal-filtering (f), spatial-filtering (g) and template-matching (h) signals for the conventional three-step implementation in this task. i, Conductance maps for the concise one-step implementation. The top panel is the mapped differential conductance on the memristor array, and the lower panel shows the corresponding mapping error.
Source data
Real-time brain-controlled drone flight with a memristor-chip-based decoder
a, Schematic of the experimental setup. The left panel illustrates that the subject controls the drone flight through the memristor-chip-based decoder. The screen shows 12 encoded stimuli of flight commands along with a real-time video stream of the drone. The right panel shows the flight scene and trajectory in the online experiment. T1–T3 represent several key time points during the online drone flight. b, Key frames of the memristor-chip-based online decoding experiment. c, Differential conductance maps of the memristor decoder for five different subjects. d–g, Performance comparisons between the CPU and memristor chip, including accuracy (d), ITR (e), energy consumption (f) and normalized decoding speed (g). The dots in d–e indicate the corresponding results of different subjects. The error bar shows the mean ± s.e.m. (n = 5). h, Comparison of the decoding results from the CPU and memristor chip in the simulated online experiments, showing a linear relationship and high consistency. i, Confusion matrix of the memristor-chip-based decoding experiment. CW, clockwise; CCW, counterclockwise.
Source data
Online brain–memristor decoder co-evolution experiments with HLU
a, Decoding accuracies versus update cycles for BCI with (w/) and without (w/o) co-evolution for ten subjects. The error bar shows the mean ± s.e.m. (n = 10). Two-sided paired-sample t-test: **P < 0.01, ***P < 0.001. b, Evolution of the decoding accuracy for 12 different control commands. Each pixel represents the average result of ten subjects. c, Accuracies calculated using FBCCA after different updates. Each dot represents the average result of ten subjects, and the dashed line is the result of linear fitting (R² = 0.588, P < 0.01). d, Progressive enhancement in signal-to-noise ratio (SNR) for the task-related components at 10 Hz and its harmonics during evolution. e, Temporal waveforms of task-related components after the decoder update. f, Adaptive memristor decoder maps after each update cycle for eight recording channels. g, Correlation coefficient between the decoder maps after each update and the final one. Insets: correlation coefficient for each channel of the decoder. h, Evolution of the individual contribution of the brain (α/(α + β); red) and the decoder (β/(α + β); green) to the enhanced co-evolutional BCI performance.
Source data
A memristor-based adaptive neuromorphic decoder for brain–computer interfaces

February 2025

·

1,591 Reads

·

2 Citations

Practical brain–computer interfaces should be able to decipher brain signals and dynamically adapt to brain fluctuations. This, however, requires a decoder capable of flexible updates with energy-efficient decoding capabilities. Here we report a neuromorphic and adaptive decoder for brain–computer interfaces, which is based on a 128k-cell memristor chip. Our approach features a hardware-efficient one-step memristor decoding strategy that allows the interface to achieve software-equivalent decoding performance. Furthermore, we show that the system can be used for the real-time control of a drone in four degrees of freedom. We also develop an interactive update framework that allows the memristor decoder and the changing brain signals to adapt to each other. We illustrate the capabilities of this co-evolution of the brain and memristor decoder over an extended interaction task involving ten participants, which leads to around 20% higher accuracy than an interface without co-evolution.


Citations (40)


... [2] However, traditional BCIs are limited by noise interference and the low throughput of electroencephalography (EEG) signals, which hinder improvements in decoding accuracy. [3] Although brain-inspired computing has advanced with models like spiking neural networks (SNNs), challenges remain in efficiency and simulation accuracy. Against this backdrop, large AI models, exemplified by DeepSeek, have demonstrated disruptive potential. ...

Reference:

DeepSeek or ChatGPT: Can brain‐computer interfaces/brain‐inspired computing achieve leapfrog development with large AI models?
A memristor-based adaptive neuromorphic decoder for brain–computer interfaces

... Currently, the edge-exposed exfoliation using sticky tape has been demonstrated as a simple, scalable, and reliable method for massively producing ultrathin and transferable diamond films. [29] However, as newly developed three-dimensional (3D) film, it is still difficult for the post-manipulation of ultrathin diamond (e.g., on-surface nanofabrication) referring the way of two-dimensional (2D) materials and bulk materials via conventional methods (e.g., lithography and etching). The incompatibility of diamond film with conventional nanofabrication process mainly attribute to its special properties, including i) low surface electrical conductivity, which suffers from proximity effect and charge accumulation, easily inducing the aggregation of electrons when conducting electron-beam-lithography (EBL), diminishing the accuracy of defined pattern; ii) fragility that leaves shattering and cracking due to the stress during directly spin-coating on diamond film, especially for films with thickness thinner than one micron, undermining its integrity; and iii) surface geometric fluctuation when attached to curved and flexible surfaces, which could easily leads to uneven distributions of spin-coated resists and simultaneously bring difficulties in sample focusing when implementing photo-/ e-beam lithography. ...

Scalable production of ultraflat and ultraflexible diamond membrane

Nature

... Drawing inspiration from LoRA's intrinsic space [48], we propose subspace selective activation (SSA) without additional training to filter out irrelevant subspace outputs, ensuring alignment between responses and instructions. In particular, A vanilla LoRA can be decomposed as the product of two low-rank subspaces (i.e., A, B) and an intrinsic mixing matrix W: ...

Mixture-of-Subspaces in Low-Rank Adaptation
  • Citing Conference Paper
  • January 2024

... PTQ has seen diverse developments across several distinct strategies. GPTQ [Frantar et al., 2022] pioneered the use of Hessian-based updates to refine quantization errors, leading to derivatives such as QuantEase [Behdin et al., 2023], VPTQ , and APTQ [Guan et al., 2024], which achieve near-lossless performance at 3-bit quantization. QuIP [Chee et al., 2023] introduced rotation-based methods that leverage orthogonal transformations to mitigate outliers. ...

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models
  • Citing Conference Paper
  • November 2024

... To manage storage requirements, the 4D index space is uniformly subsampled. Additionally, a rotation ensemble strategy rotates each 2 × 2 input patch four times, effectively expanding the receptive field to 3 × 3. Recent works have focused on increasing the effective receptive field through various techniques, including multiple LUTs [12,14,17], shift aggregations [17], and diverse kernel patterns [8,12,16]. ...

Hundred-Kilobyte Lookup Tables for Efficient Single-Image Super-Resolution
  • Citing Conference Paper
  • August 2024

... This could also help to generate synthetic data for machine learning models pretraining, which may be a next step of the approach described in [120]. Du et al [131] harnessed hyperdimensional computing, which is a brain-inspired framework motivated by the observation that cerebral cortex operates on high-dimensional data. Zhang et al [132] used a spiking neural network (SNN), which is a closer mimic of real brain than conventional deep neural network, to learn patterns from EEG; Gallou et al [133] developed a specialized hardware implementation of an SNN. ...

Hyperdimensional Computing with Multi-Scale Local Binary Patterns for Scalp EEG-Based Epileptic Seizure Detection
  • Citing Article
  • August 2024

IEEE Internet of Things Journal

... Methods combining transformer with CNN have made signifcant progress in the feld of medical image segmentation. Qi et al. [46] proposed a cascade module that uses convolutional kernels of diferent sizes to construct multiple receptive felds and learns the global context through selfattentive layers. Te authors used multiple layers with shift operations to be more resilient to recognizing shape, size, and boundary changes. ...

Hybrid Module with Multiple Receptive Fields and Self-Attention Layers for Medical Image Segmentation
  • Citing Conference Paper
  • April 2024

... Since backpropagation is not required during training, ZO methods reduce memory consumption significantly compared to a FO method. MeZO (Malladi et al., 2024) employed a memory-efficient ZO stochastic gradient descent (ZO-SGD) algorithm to efficiently fine-tune LLMs exceeding 60 billion parameters, leveraging parameter-efficient approaches (Yang et al., 2024b;Liu et al., 2021) like LoRA (Hu et al., 2021). Other ZO methods include ZO-SGD (Ghadimi and Lan, 2013) and ZO-Sign-SGD (Liu et al., 2018) using sign-based gradient estimation, the ZO-Adam (Chen et al., 2019) optimizer exploiting momentum information, and parameter-efficient methods like AdaZeta (Yang et al., 2024a). ...

LoRETTA: Low-Rank Economic Tensor-Train Adaptation for Ultra-Low-Parameter Fine-Tuning of Large Language Models

... Implicit neural representation. There has been a recent surge of interest in implicit neural representation (INR) (Park et al., 2019;Gropp et al., 2020;Grattarola & Vandergheynst, 2022;Lindell et al., 2022;Xie et al., 2023;Li et al., 2023;Molaei et al., 2023;Li et al., 2024a;b) due to its ability to represent discrete signals continuously. Such representation typically is achieved by training an overparameterized MLP, which offers various practical benefits, including memory efficiency (Sitzmann et al., 2020b;Xie et al., 2023) and enhanced training efficiency for downstream computer vision tasks (Dupont et al., 2022;Chen et al., 2023). ...

Learning Spatially Collaged Fourier Bases for Implicit Neural Representation
  • Citing Article
  • March 2024

Proceedings of the AAAI Conference on Artificial Intelligence