Kah Phooi Seng’s research while affiliated with University of the Sunshine Coast and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (230)


Figure 1. An emotional human-robot interaction system between robotics and humans supported by machine learning methods.
Figure 3. Illustration of skeletal graph.
Figure 4. Overall architecture of Trajectories-Aware and Skeleton-Graph-Aware Spatial-Temporal Transformer (TS-ST).
Figure 5. Process of trajectorial encoding.
Figure 6. The details of the temporal transformer, with dotted arrows representing residual connections

+3

Gait-To-Gait Emotional Human–Robot Interaction Utilizing Trajectories-Aware and Skeleton-Graph-Aware Spatial–Temporal Transformer
  • Article
  • Full-text available

January 2025

·

3 Reads

Chenghao Li

·

Kah Phooi Seng

·

Li-Minn Ang

The emotional response of robotics is crucial for promoting the socially intelligent level of human–robot interaction (HRI). The development of machine learning has extensively stimulated research on emotional recognition for robots. Our research focuses on emotional gaits, a type of simple modality that stores a series of joint coordinates and is easy for humanoid robots to execute. However, a limited amount of research investigates emotional HRI systems based on gaits, indicating an existing gap in human emotion gait recognition and robotic emotional gait response. To address this challenge, we propose a Gait-to-Gait Emotional HRI system, emphasizing the development of an innovative emotion classification model. In our system, the humanoid robot NAO can recognize emotions from human gaits through our Trajectories-Aware and Skeleton-Graph-Aware Spatial–Temporal Transformer (TS-ST) and respond with pre-set emotional gaits that reflect the same emotion as the human presented. Our TS-ST outperforms the current state-of-the-art human-gait emotion recognition model applied to robots on the Emotion-Gait dataset.

Download

Customized Binary Convolutional Neural Networks and Neural Architecture Search on Hardware Technologies

January 2025

·

3 Reads

IEEE Nanotechnology Magazine

Customized binary convolutional neural network (BCNN) architectures, which are implemented on hardware technologies, give significant advantages for computational efficiency and hardware acceleration. The deployment of these customized BCNNs in several real-time domains such as edge devices, embedded systems and other resource-constrained hardware platforms is becoming increasingly important. BCNN architectures, with their simplified representation of binarized weights and activations, significantly reduce computational and memory bandwidth requirements. On the one hand, the straightforward binarization of full precision CNN architectures achieves the hardware simplification. On the other hand, the binarization process suffers from the reduction in the algorithm performance. It would be highly desirable to improve the computational efficiency of the BCNN architectures through algorithmic and hardware-level optimizations without significantly affecting the algorithm performance for implementation on hardware technologies. The usage of the neural architecture search (NAS) to optimize the BCNN architectures is becoming a promising approach. This paper proposes and illustrates efficient designs and customized BCNN architectures with examples for two edge applications (compressed sensing and image super-resolution). Our designs improve the computational efficiency of the BCNN architectures through algorithmic and hardware-level optimizations without significantly affecting the algorithm performance. In some cases, the NAS-optimized BCNN architectures perform better than the full precision CNN architectures. Hardware analysis substantiates the computational effectiveness of the proposed architectures.


Initial Seed Selection
UNSDipDECK
An example dataset to demonstrate UCN and UNS concepts
A deep embedded clustering technique using dip test and unique neighbourhood set

Neural Computing and Applications

In recent years, there has been a growing interest in deep learning-based clustering. A recently introduced technique called DipDECK has shown effective performance on large and high-dimensional datasets. DipDECK utilises Hartigan’s dip test, a statistical test, to merge small non-viable clusters. Notably, DipDECK was the first deep learning-based clustering technique to incorporate the dip test. However, the number of initial clusters of DipDECK is overestimated and the algorithm then randomly selects the initial seeds to produce the final clusters for a dataset. Therefore, in this paper, we presented a technique called UNSDipDECK , which is an improved version of DipDECK and does not require user input for datasets with an unknown number of clusters. UNSDipDECK produces high-quality initial seeds and the initial number of clusters through a deterministic process. UNSDipDECK uses the unique closest neighbourhood and unique neighbourhood set approaches to determine high-quality initial seeds for a dataset. In our study, we compared the performance of UNSDipDECK with fifteen baseline clustering techniques, including DipDECK, using NMI and ARI metrics. The experimental results indicate that UNSDipDECK outperforms the baseline techniques, including DipDECK. Additionally, we demonstrated that the initial seed selection process significantly contributes to UNSDipDECK ’s ability to produce high-quality clusters.


Multiple Kernel Learning for Multi-class Facial Emotion Classification

November 2024

·

1 Read

Model selections in most Multiple Kernel Learning (MKL) frameworks involve building several kernels with several parameters. Rather than focusing on optimizing a few relevant kernels, several kernels sharing the same features but different parameters are built that remains redundant. In this paper, we tackle the problem of selecting more informative basis kernel for facial emotion classification task. This is achieved by enhancing the original optimization capability of SimpleMKL method using a perspective of Kernel Target Alignment (KTA). Novelty is demonstrated by presenting a low cost model with high performance accuracy and minimized feature redundancies. Experimental results on Cohn Kanade dataset show significant improvements over other approaches.



CycleGAN*: Collaborative AI Learning With Improved Adversarial Neural Networks for Multimodalities Data

November 2024

·

7 Reads

IEEE Transactions on Artificial Intelligence

With the widespread adoption of Generative Adversarial Networks (GANs) for sample generation, this paper aims to enhance adversarial neural networks to facilitate collaborative Artificial Intelligence (AI) learning which has been specifically tailored to handle datasets containing multi-modalities. Currently, a significant portion of the literature is dedicated to sample generation using GANs, with the objective of enhancing the detection performance of machine learning (ML) classifiers through the incorporation of these generated data into the original training set via adversarial training. The quality of the generated adversarial samples is contingent upon the sufficiency of training data samples. However, in the multimodal domain, the scarcity of multimodal data poses a challenge due to resource constraints. In this paper, we address this challenge by proposing a new multimodal dataset generation approach based on the classical audiovisual speech recognition task, utilizing CycleGAN, DiscoGAN, and StyleGAN2 for exploration and performance comparison. Audiovisual Speech Recognition (AVSR) experiments are conducted using the LRS2 and LRS3 corpora. Our experiments reveal that CycleGAN, DiscoGAN, and StyleGAN2 do not effectively address the low-data state problem in AVSR classification. Consequently, we introduce an enhanced model, CycleGAN*, based on the original CycleGAN, which efficiently learns the original dataset features and generates high-quality multimodal data. Experimental results demonstrate that the multimodal datasets generated by our proposed CycleGAN* exhibit significant improvement in Word Error Rate (WER), indicating reduced errors. Notably, the images produced by CycleGAN* exhibit a marked enhancement in overall visual clarity, indicative of its superior generative capabilities. Furthermore, in contrast to traditional approaches, we underscore the significance of collaborative learning. We implement co-training with diverse multimodal data to facilitate information sharing and complementary learning across modalities. This collaborative approach enhances the model’s capability to integrate heterogeneous information, thereby boosting its performance in multimodal environments.



Deep Learning and Neural Architecture Search for Optimizing Binary Neural Network Image Super Resolution

June 2024

·

12 Reads

The evolution of super-resolution (SR) technology has seen significant advancements through the adoption of deep learning methods. However, the deployment of such models by resource-constrained devices necessitates models that not only perform efficiently, but also conserve computational resources. Binary neural networks (BNNs) offer a promising solution by minimizing the data precision to binary levels, thus reducing the computational complexity and memory requirements. However, for BNNs, an effective architecture is essential due to their inherent limitations in representing information. Designing such architectures traditionally requires extensive computational resources and time. With the advancement in neural architecture search (NAS), differentiable NAS has emerged as an attractive solution for efficiently crafting network structures. In this paper, we introduce a novel and efficient binary network search method tailored for image super-resolution tasks. We adapt the search space specifically for super resolution to ensure it is optimally suited for the requirements of such tasks. Furthermore, we incorporate Libra Parameter Binarization (Libra-PB) to maximize information retention during forward propagation. Our experimental results demonstrate that the network structures generated by our method require only a third of the parameters, compared to conventional methods, and yet deliver comparable performance.


Image to Label to Answer: An Efficient Framework for Enhanced Clinical Applications in Medical Visual Question Answering

June 2024

·

30 Reads

·

3 Citations

Jianfeng Wang

·

Kah Phooi Seng

·

Yi Shen

·

[...]

·

Difeng Huang

Medical Visual Question Answering (Med-VQA) faces significant limitations in application development due to sparse and challenging data acquisition. Existing approaches focus on multi-modal learning to equip models with medical image inference and natural language understanding, but this worsens data scarcity in Med-VQA, hindering clinical application and advancement. This paper proposes the ITLTA framework for Med-VQA, designed based on field requirements. ITLTA combines multi-label learning of medical images with the language understanding and reasoning capabilities of large language models (LLMs) to achieve zero-shot learning, meeting natural language module needs without end-to-end training. This approach reduces deployment costs and training data requirements, allowing LLMs to function as flexible, plug-and-play modules. To enhance multi-label classification accuracy, the framework uses external medical image data for pretraining, integrated with a joint feature and label attention mechanism. This configuration ensures robust performance and applicability, even with limited data. Additionally, the framework clarifies the decision-making process for visual labels and question prompts, enhancing the interpretability of Med-VQA. Validated on the VQA-Med 2019 dataset, our method demonstrates superior effectiveness compared to existing methods, confirming its outstanding performance for enhanced clinical applications.


Figure 1. The proposed multimodal hierarchical graph-based situation awareness (MHGSA) system.
Figure 7. Fine-and coarse-grained categories of the extracted VGGSound-DR dataset.
Experiment on VGGSound-DR in visual-only mode.
Experiment on VGGSound-DR in audio-only mode.
AI-Empowered Multimodal Hierarchical Graph-Based Learning for Situation Awareness on Enhancing Disaster Responses

May 2024

·

36 Reads

·

1 Citation

Situational awareness (SA) is crucial in disaster response, enhancing the understanding of the environment. Social media, with its extensive user base, offers valuable real-time information for such scenarios. Although SA systems excel in extracting disaster-related details from user-generated content, a common limitation in prior approaches is their emphasis on single-modal extraction rather than embracing multi-modalities. This paper proposed a multimodal hierarchical graph-based situational awareness (MHGSA) system for comprehensive disaster event classification. Specifically, the proposed multimodal hierarchical graph contains nodes representing different disaster events and the features of the event nodes are extracted from the corresponding images and acoustic features. The proposed feature extraction modules with multi-branches for vision and audio features provide hierarchical node features for disaster events of different granularities, aiming to build a coarse-granularity classification task to constrain the model and enhance fine-granularity classification. The relationships between different disaster events in multi-modalities are learned by graph convolutional neural networks to enhance the system’s ability to recognize disaster events, thus enabling the system to fuse complex features of vision and audio. Experimental results illustrate the effectiveness of the proposed visual and audio feature extraction modules in single-modal scenarios. Furthermore, the MHGSA successfully fuses visual and audio features, yielding promising results in disaster event classification tasks.


Citations (71)


... If multimodal systems are not thought out properly, they can overly load learners with information and create cognitive overload. Multimodal design involves a careful trade-off between giving us diverse sensory stimuli and also making sure that those stimuli don't clash with one another [1]. Our research examines the effects of multimodal technologies on cognitive load and attentional flow in CALL environments. ...

Reference:

Applications of Multimodal Technology in Computer-Assisted Language Learning: Impacts on Cognitive Load and Attention Distribution
Situation Awareness in AI-Based Technologies and Multimodal Systems: Architectures, Challenges and Applications

IEEE Access

... edge, which proves valuable when adapting them for various downstream medical tasks, including medical diagnosis. In this review, five studies perform text-only pretraining on the LLMs from Chen et al.68 and Wang et al.124 pretrained the model on VQA data, where Chen et al.68 used an out-of-shelf multi-model LLM to reformat image-text pairs from PubMed as VQA data points to train their LLM. To improve the quality of the image encoder, pretraining tasks ...

Image to Label to Answer: An Efficient Framework for Enhanced Clinical Applications in Medical Visual Question Answering

... Wu et al. [11] presented a new deep graph-based learning model. This method denotes a mixture of scalable graph representations embedded in the initial value of graph signal handling models. ...

Energy Efficient Graph-Based Hybrid Learning for Speech Emotion Recognition on Humanoid Robot

... Dynamic partitioning is challenging in SFL because client capabilities, network condition, and availability of computational resources are major factors that affect the dynamic partitioning by affecting the model components. For example, in a smart city deployment project [78], traffic prediction models are collaborated with edge computing component allowing for partitioning of model based on resource availability and real-time traffic pattern. Optimizing distribution model component continuously allows for efficient training in dynamic partitioning with accommodating changing condition in SFL [28]. ...

Multi-Level Split Federated Learning for Large-Scale AIoT System Based on Smart Cities

... After a comparison with the manual method, for a fair comparison Table 5 presents the computational comparison with other gradient-based architecture search method for super resolution. The table shows the reduced computational requirements for our SRBNAS method compared with two other methods (DLSR [48] and DNAS-EASR [49]). To visualize the model performance, Figures 8 and 9 display three different HR images. ...

Efficient FPGA Binary Neural Network Architecture for Image Super-Resolution

... Numerous optimization techniques have been proposed in recent years and have shown to be effective [36]. However, BNNs still suffer from accuracy degradation due to the severe information loss caused by parameter binarization, and the gap with their real-valued counterparts is still significant in several cases [37]. ...

Binary Neural Networks in FPGAs: Architectures, Tool Flows and Hardware Comparisons

... Furthermore, batch normalization and ReLU activation were integrated into DCGAN. As a result, deconvolution was used more frequently in GAN generator architecture, and DCGAN became more well-liked [26]. ...

Generative Adversarial Networks (GANs) for Audio-Visual Speech Recognition in Artificial Intelligence IoT

... The application of artificial intelligence (AI) technology in industrial production has also been widely concerned. Silver and Xu et al. (2023) proved the superior performance of deep neural network in complex strategy games through experiments, and further demonstrated the potential of AI technology [10]. Therefore, the purpose of this study is to put forward a brand-new real-time cost control model by comprehensively applying a variety of modern technologies to fill this gap in the existing research. ...

New Hybrid Graph Convolution Neural Network with Applications in Game Strategy

... The use of the ENF signal in digital media for forensic purposes is not limited to the time-of-recording detection or verification. By serving as a power signature, it also enables other practical applications, including geo-location estimation [24], [30], [31] (e.g., to identify the country of origin of a recording), multimedia synchronization [32], [33] (e.g., to temporally align videos taken by two cameras to merge their views into a single panoramic view), media authentication [34]- [36] (e.g., to determine if a video is original or tampered with), and camera characterization [37], [38] (e.g., to attribute the source camcorder of a video). ...

Exploiting the Rolling Shutter Read-Out Time for ENF-Based Camera Identification