About
67
Publications
21,383
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,245
Citations
Publications
Publications (67)
Generative adversarial networks (GANs) have been extensively studied in the past few years. Arguably their most significant impact has been in the area of computer vision where great advances have been made in challenges such as plausible image generation, image-to-image translation, facial attribute manipulation, and similar domains. Despite the s...
Convolution has been the core ingredient of modern neural networks, triggering the surge of deep learning in vision. In this work, we rethink the inherent principles of standard convolution for vision tasks, specifically spatial-agnostic and channel-specific. Instead, we present a novel atomic operation for deep neural networks by inverting the afo...
Spatial-temporal, channel-wise, and motion patterns are three complementary and crucial types of information for video action recognition. Conventional 2D CNNs are computationally cheap but cannot catch temporal relationships; 3D CNNs can achieve good performance but are computationally intensive. In this work, we tackle this dilemma by designing a...
Weakly supervised object localization (WSOL), adopting only image-level annotations to learn the pixel-level localization model, can release human resources in the annotation process. Most one-stage WSOL methods learn the localization model with multi-instance learning, making them only activate discriminative object parts rather than the whole obj...
Weakly supervised object localization (WSOL) relaxes the requirement of dense annotations for object localization by using image-level annotation to supervise the learning process. However, most WSOL methods only focus on forcing the object classifier to produce high activation score on object parts without considering the influence of background l...
Nowadays, convolutional neural networks (CNNs) have led the developments of machine learning. However, most CNN architectures are obtained by manual design, which is empirical, time-consuming, and non-transparent. In this paper, we aim at offering better insight into CNN models from the perspective of optimization theory. We propose a unified frame...
Training deep neural networks (DNNs) typically requires massive computational power. Existing DNNs exhibit low time and storage efficiency due to the high degree of redundancy. In contrast to most existing DNNs, biological and social networks with vast numbers of connections are highly efficient and exhibit scale-free properties indicative of the p...
Generative adversarial networks (GANs) studies have grown exponentially in the past few years. Their impact has been seen mainly in the computer vision field with realistic image and video manipulation, especially generation, making significant advancements. While these computer vision advances have garnered much attention, GAN applications have di...
Steerable models can provide very general and flexible equivariance by formulating equivariance requirements in the language of representation theory and feature fields, which has been recognized to be effective for many vision tasks. However, deriving steerable models for 3D rotations is much more difficult than that in the 2D case, due to more co...
Steerable models can provide very general and flexible equivariance by formulating equivariance requirements in the language of representation theory and feature fields, which has been recognized to be effective for many vision tasks. However , deriving steerable models for 3D rotations is much more difficult than that in the 2D case, due to more c...
Lifelong learning algorithms aim to enable robots to handle open-set and detrimental conditions, and yet there is a lack of adequate datasets with diverse factors for benchmarking. In this work, we constructed and released a lifelong learning robotic vision dataset, OpenLORIS-Object. This dataset was collected by RGB-D camera capturing dynamic envi...
Weakly supervised object localization (WSOL) focuses on localizing objects only with the supervision of image-level classification masks. Most previous WSOL methods follow the classification activation map (CAM) that localizes objects based on the classification structure with the multi-instance learning (MIL) mechanism. However, the MIL mechanism...
Deep neural networks are able to memorize noisy labels easily with a softmax cross-entropy (CE) loss. Previous studies attempted to address this issue focus on incorporating a noise-robust loss function to the CE loss. However, the memorization issue is alleviated but still remains due to the non-robust CE loss. To address this issue, we focus on l...
Weakly supervised object localization (WSOL) relaxes the requirement of dense annotations for object localization by using image-level classification masks to supervise its learning process. However, current WSOL methods suffer from excessive activation of background locations and need post-processing to obtain the localization mask. This paper att...
Semi-supervised video action recognition tends to enable deep neural networks to achieve remarkable performance even with very limited labeled data. However, existing methods are mainly transferred from current image-based methods (e.g., FixMatch). Without specifically utilizing the temporal dynamics and inherent multimodal attributes, their result...
In the last few years, we have witnessed a renewed and fast-growing interest in continual learning with deep neural networks with the shared objective of making current AI systems more adaptive, efficient and autonomous. However, despite the significant and undoubted progress of the field in addressing the issue of catastrophic forgetting, benchmar...
Most of existing video action recognition models ingest raw RGB frames. However, the raw video stream requires enormous storage and contains significant temporal redundancy. Video compression (e.g., H.264, MPEG-4) reduces superfluous information by representing the raw video stream using the concept of Group of Pictures (GOP). Each GOP is composed...
In this paper, we show our solution to the Google Landmark Recognition 2021 Competition. Firstly, embeddings of images are extracted via various architectures (i.e. CNN-, Transformer- and hybrid-based), which are optimized by ArcFace loss. Then we apply an efficient pipeline to re-rank predictions by adjusting the retrieval score with classificatio...
In this paper, we propose MINE to perform novel view synthesis and depth estimation via dense 3D reconstruction from a single image. Our approach is a continuous depth generalization of the Multiplane Images (MPI) by introducing the NEural radiance fields (NeRF). Given a single image as input, MINE predicts a 4-channel image (RGB and volume density...
Retrieving occlusion relation among objects in a single image is challenging due to sparsity of boundaries in image. We observe two key issues in existing works: firstly, lack of an architecture which can exploit the limited amount of coupling in the decoder stage between the two subtasks, namely occlusion boundary extraction and occlusion orientat...
The nonlocal-based blocks are designed for capturing long-range spatial-temporal dependencies in computer vision tasks. Although having shown excellent performance, they still lack the mechanism to encode the rich, structured information among elements in an image or video. In this paper, to theoretically analyze the property of these nonlocal-base...
Generative adversarial networks (GANs) studies have grown exponentially in the past few years. Their impact has been seen mainly in the computer vision field with realistic image and video manipulation, especially generation, making significant advancements. While these computer vision advances have garnered much attention, GAN applications have di...
Contrastive learning applied to self-supervised representation learning has seen a resurgence in deep models. In this paper, we find that existing contrastive learning based solutions for self-supervised video recognition focus on inter-variance encoding but ignore the intra-variance existing in clips within the same video. We thus propose to learn...
Accurate statistical models of neural spike responses can characterize the information carried by neural populations. But the limited samples of spike counts during recording usually result in model overfitting. Besides, current models assume spike counts to be Poisson-distributed, which ignores the fact that many neurons demonstrate over-dispersed...
Learning continually from non-stationary data streams is a long-standing goal and a challenging problem in machine learning. Recently, we have witnessed a renewed and fast-growing interest in continual learning, especially within the deep learning community. However, algorithmic solutions are often difficult to re-implement, evaluate and port acros...
In this paper, we propose an approach to perform novel view synthesis and depth estimation via dense 3D reconstruction from a single image. Our NeMI unifies Neural radiance fields (NeRF) with Multiplane Images (MPI). Specifically, our NeMI is a general two-dimensional and image-conditioned extension of NeRF, and a continuous depth generalization of...
Superpixel is generated by automatically clustering pixels in an image into hundreds of compact partitions, which is widely used to perceive the object contours for its excellent contour adherence. Although some works use the Convolution Neural Network (CNN) to generate high-quality superpixel, we challenge the design principles of these networks,...
In the last few years, we have witnessed a renewed and fast-growing interest in continual learning with deep neural networks with the shared objective of making current AI systems more adaptive, efficient and autonomous. However, despite the significant and undoubted progress of the field in addressing the issue of catastrophic forgetting, benchmar...
With the expansion of the global robot market, robotics is moving from the robot 3.0 era to the robot 4.0 era. In robot 4.0 era, robots should not only have the capability of perception and collaboration, but also have the capability of understanding the environment and making decisions by themselves just like human being. Then they can provide ser...
Humans have a remarkable ability to learn continuously from th e environment and inner experience. One of the grand goals of robots is to build an artificial "lifelong learning" agent that can shape a cultivated understanding of the world from the current scene and previous knowledge via an autonomous lifelong development. It is challenging for the...
Generative adversarial networks (GANs) are increasingly attracting attention in the computer vision, natural language processing, speech synthesis and similar domains. Arguably the most striking results have been in the area of image synthesis. However, evaluating the performance of GANs is still an open and challenging problem. Existing evaluation...
This report summarizes IROS 2019-Lifelong Robotic Vision Competition (Lifelong Object Recognition Challenge) with methods and results from the top $8$ finalists (out of over~$150$ teams). The competition dataset (L)ifel(O)ng (R)obotic V(IS)ion (OpenLORIS) - Object Recognition (OpenLORIS-object) is designed for driving lifelong/continual learning re...
Egocentric gestures are the most natural form of communication for humans to interact with wearable devices such as VR/AR helmets and glasses. A major issue in such scenarios for real-world applications is that may easily become necessary to add new gestures to the system e.g., a proper VR system should allow users to customize gestures incremental...
Generative adversarial networks (GANs) are increasingly attracting attention in the computer vision, natural language processing, speech synthesis and similar domains. However, evaluating the performance of GANs is still an open and challenging problem. Existing evaluation metrics primarily measure the dissimilarity between real and generated image...
The recent breakthroughs in computer vision have benefited from the availability of large representative datasets (e.g. ImageNet and COCO) for training. Yet, robotic vision poses unique challenges for applying visual algorithms developed from these standard computer vision datasets due to their implicit assumption over non-varying distributions for...
Recent breakthroughs in computer vision areas, ranging from detection, segmentation, to classification, rely on the availability of large-scale representative training datasets. Yet, robotic vision poses new challenges towards applying visual algorithms developed from these datasets because the latter implicitly assume a fixed set of categories and...
Service robots should be able to operate autonomously in dynamic and daily changing environments over an extended period of time. While Simultaneous Localization And Mapping (SLAM) is one of the most fundamental problems for robotic autonomy, most existing SLAM works are evaluated with data sequences that are recorded in a short period of time. In...
The nonlocal network is designed for capturing long-range spatial-temporal dependencies in several computer vision tasks. Although having shown excellent performances, it needs an elaborate preparation for both the number and position of the building blocks. In this paper, we propose a new formulation of the nonlocal block and interpret it from the...
Pedestrian trajectory prediction is a challenging task because of the complexity of real-world human social behaviors and uncertainty of the future motion. For the first issue, existing methods adopt fully connected topology for modeling the social behaviors, while ignoring non-symmetric pairwise relationships. To effectively capture social behavio...
Latent dynamics discovery is challenging in extracting complex dynamics from high-dimensional noisy neural data. Many dimensionality reduction methods have been widely adopted to extract low-dimensional, smooth and time-evolving latent trajectories. However, simple state transition structures, linear embedding assumptions, or inflexible inference n...
Generative adversarial networks (GANs) have been extensively studied in the past few years. Arguably the revolutionary techniques are in the area of computer vision such as plausible image generation, image to image translation, facial attribute manipulation and similar domains. Despite the significant success achieved in computer vision field, app...
Generative adversarial networks (GANs) are increasingly attracting attention in the computer vision, natural language processing, speech synthesis and similar domains. Arguably the most striking results have been in the area of image synthesis. However, evaluating the performance of GANs is still an open and challenging problem. Existing evaluation...
Current experimental techniques impose spatial limits on the number of neuronal units that can be recorded invivo. To model the neuronal dynamics utilizing these sampled data, Latent Variable Models (LVMs) have been proposed to study the common unobserved processes within the system that drives neuronal activities, through an implicit network with...
Current experimental techniques impose spatial limits on the number of neuronal units that can be recorded in-vivo. To model the neural dynamics utilizing these sampled data, Latent Variable Models (LVMs) have been proposed to study the common unobserved processes within the system that drives neural activities, through an implicit network with hid...
Linear Dynamical Systems are widely used to study the underlying patterns of multivariate time series. A basic assumption of these models is that high-dimensional time series can be characterized by some underlying, low-dimensional and time-varying latent states. However, existing approaches to LDS modeling mostly learn the latent space with a pres...
The brain encodes information by neural spiking activities, which can be described by time series data as spike counts. Latent Vari- able Models (LVMs) are widely used to study the unknown factors (i.e. the latent states) that are dependent in a network structure to modulate neural spiking activities. Yet, challenges in performing experiments to re...
In this paper, we present an efficient framework to study the directional interactions within the multiple-input multiple-output (MIMO) biological neural network from spiketrain data. We used an efficient generalized linear model (GLM) with Laguerre basis functions to model a MIMO neural system, and developed an Effective Connectivity Matrix (ECM)...
This paper presents an investigation into the cortico-muscular relationship during a grasping task by evaluating the information transfer between EEG and EMG signals. Information transfer was computed via a non-linear model-free measure, transfer entropy (TE). To examine the cross-frequency interaction, TEs were computed after the times series were...
In this letter, a Hierarchical Parametric Empirical Bayes (HPEB) model is proposed to fit spike count data. We have integrated Generalized Linear Models and empirical Bayes theory to simultaneously solve three problems: (1) over-dispersion of spike count values; (2) biased estimation of the maximum likelihood method and (3) difficulty in sampling f...
The amount of publicly accessible experimental data has gradually increased in recent years, which makes it possible to reconsider many longstanding questions in neuroscience. In this paper, an efficient framework is presented for reconstructing functional connectivity using experimental spike-train data. A modified generalized linear model (GLM) w...
As the amount of experimental data made publicly accessible has gradually increased in recent years, it is now possible to reconsider many of the longstanding questions in neuroscience. In this paper, we present an efficient framework for reconstructing the functional connectivity from the spike train data curated from the Collaborative Research in...
As the amount of experimental data made publicly accessible has gradually increased in recent years, it is now possible to reconsider many of the longstanding questions in neuroscience. In this paper, we present an efficient frame-work for reconstructing the functional connectivity from the spike train data curated from the Collaborative Research i...
Video object segmentation entails selecting and extracting objects of interest from a video sequence. Video Segmentation of Objects (VSO) is a critical task which has many applications, such as video edit, video decomposition and object recognition. The core of VSO system consists of two major problems of computer vision, namely object segmentation...