Gim Hee Lee

Gim Hee Lee
National University of Singapore | NUS · Department of Computer Science

PhD in Computer Science

About

132
Publications
60,142
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,886
Citations
Additional affiliations
June 2014 - July 2015
Mitsubishi Electric Research Laboratories
Position
  • Member of Research Staff
January 2009 - June 2014
ETH Zurich
Position
  • Research Assistant

Publications

Publications (132)
Preprint
Full-text available
Current supervised cross-domain image retrieval methods can achieve excellent performance. However, the cost of data collection and labeling imposes an intractable barrier to practical deployment in real applications. In this paper, we investigate the unsupervised cross-domain image retrieval task, where class labels and pairing annotations are no...
Preprint
Full-text available
Since Intersection-over-Union (IoU) based optimization maintains the consistency of the final IoU prediction metric and losses, it has been widely used in both regression and classification branches of single-stage 2D object detectors. Recently, several 3D object detection methods adopt IoU-based optimization and directly replace the 2D IoU with 3D...
Preprint
In this paper, we consider the problem of domain generalization in semantic segmentation, which aims to learn a robust model using only labeled synthetic (source) data. The model is expected to perform well on unseen real (target) domains. Our study finds that the image style variation can largely influence the model's performance and the style fea...
Article
Deep learning-based approaches have shown remarkable performance in the 3D object detection task. However, they suffer from a catastrophic performance drop on the originally trained classes when incrementally learning new classes without revisiting the old data. This "catastrophic forgetting" phenomenon impedes the deployment of 3D object detection...
Preprint
Recent works on unsupervised domain adaptation (UDA) focus on the selection of good pseudo-labels as surrogates for the missing labels in the target data. However, source domain bias that deteriorates the pseudo-labels can still exist since the shared network of the source and target domains are typically used for the pseudo-label selections. The s...
Preprint
Content-based Video Retrieval (CBVR) is used on media-sharing platforms for applications such as video recommendation and filtering. To manage databases that scale to billions of videos, video-level approaches that use fixed-size embeddings are preferred due to their efficiency. In this paper, we introduce Video Region Attention Graph Networks (VRA...
Preprint
Incremental few-shot object detection aims at detecting novel classes without forgetting knowledge of the base classes with only a few labeled training data from the novel classes. Most related prior works are on incremental object detection that rely on the availability of abundant training samples per novel class that substantially limits the sca...
Preprint
In this paper, we study the task of synthetic-to-real domain generalized semantic segmentation, which aims to learn a model that is robust to unseen real-world scenes using only synthetic data. The large domain shift between synthetic and real-world data, including the limited source environmental variations and the large distribution gap between s...
Preprint
Despite recent success in incorporating learning into point cloud registration, many works focus on learning feature descriptors and continue to rely on nearest-neighbor feature matching and outlier filtering through RANSAC to obtain the final set of correspondences for pose estimation. In this work, we conjecture that attention mechanisms can repl...
Preprint
Full-text available
State-of-the-art approaches for 6D object pose estimation require large amounts of labeled data to train the deep networks. However, the acquisition of 6D object pose annotations is tedious and labor-intensive in large quantity. To alleviate this problem, we propose a weakly supervised 6D object pose estimation approach based on 2D keypoint detecti...
Preprint
Deep learning-based approaches have shown remarkable performance in the 3D object detection task. However, they suffer from a catastrophic performance drop on the originally trained classes when incrementally learning new classes without revisiting the old data. This "catastrophic forgetting" phenomenon impedes the deployment of 3D object detection...
Preprint
We introduce a new setting of Novel Class Discovery in Semantic Segmentation (NCDSS), which aims at segmenting unlabeled images containing new classes given prior knowledge from a labeled set of disjoint classes. In contrast to existing approaches that look at novel class discovery in image classification, we focus on the more challenging semantic...
Preprint
Full-text available
The environment of most real-world scenarios such as malls and supermarkets changes at all times. A pre-built map that does not account for these changes becomes out-of-date easily. Therefore, it is necessary to have an up-to-date model of the environment to facilitate long-term operation of a robot. To this end, this paper presents a general lifel...
Preprint
Most existing animal pose and shape estimation approaches reconstruct animal meshes with a parametric SMAL model. This is because the low-dimensional pose and shape parameters of the SMAL model makes it easier for deep networks to learn the high-dimensional animal meshes. However, the SMAL model is learned from scans of toy animals with limited pos...
Preprint
Transformation Synchronization is the problem of recovering absolute transformations from a given set of pairwise relative motions. Despite its usefulness, the problem remains challenging due to the influences from noisy and outlier relative motions, and the difficulty to model analytically and suppress them with high fidelity. In this work, we avo...
Preprint
Deep networks have shown remarkable results in the task of object detection. However, their performance suffers critical drops when they are subsequently trained on novel classes without any sample from the base classes originally used to train the model. This phenomenon is known as catastrophic forgetting. Recently, several incremental learning me...
Preprint
Deep learning technique has yielded significant improvements in point cloud completion with the aim of completing missing object shapes from partial inputs. However, most existing methods fail to recover realistic structures due to over-smoothing of fine-grained details. In this paper, we develop a voxel-based network for point cloud completion by...
Preprint
Full-text available
In this paper, we propose MINE to perform novel view synthesis and depth estimation via dense 3D reconstruction from a single image. Our approach is a continuous depth generalization of the Multiplane Images (MPI) by introducing the NEural radiance fields (NeRF). Given a single image as input, MINE predicts a 4-channel image (RGB and volume density...
Preprint
In this work, we introduce a new concept, named source-free open compound domain adaptation (SF-OCDA), and study it in semantic segmentation. SF-OCDA is more challenging than the traditional domain adaptation but it is more practical. It jointly considers (1) the issues of data privacy and data storage and (2) the scenario of multiple target domain...
Preprint
Full-text available
Steerable CNN imposes the prior knowledge of transformation invariance or equivariance in the network architecture to enhance the the network robustness on geometry transformation of data and reduce overfitting. It has been an intuitive and widely used technique to construct a steerable filter by augmenting a filter with its transformed copies in t...
Preprint
Full-text available
This paper presents DeepI2P: a novel approach for cross-modality registration between an image and a point cloud. Given an image (e.g. from a rgb-camera) and a general point cloud (e.g. from a 3D Lidar scanner) captured at different locations in the same scene, our method estimates the relative rigid transformation between the coordinate frames of...
Preprint
Bottom-up approaches for image-based multi-person pose estimation consist of two stages: (1) keypoint detection and (2) grouping of the detected keypoints to form person instances. Current grouping approaches rely on learned embedding from only visual features that completely ignore the spatial configuration of human poses. In this work, we formula...
Preprint
Existing approaches for multi-view multi-person 3D pose estimation explicitly establish cross-view correspondences to group 2D pose detections from multiple camera views and solve for the 3D pose estimation for each person. Establishing cross-view correspondences is challenging in multi-person scenes, and incorrect correspondences will lead to sub-...
Preprint
Animal pose estimation is an important field that has received increasing attention in the recent years. The main challenge for this task is the lack of labeled data. Existing works circumvent this problem with pseudo labels generated from data of other easily accessible domains such as synthetic data. However, these pseudo labels are noisy even wi...
Preprint
We propose a method for detecting structural changes in a city using images captured from vehicular mounted cameras over traversals at two different times. We first generate 3D point clouds for each traversal from the images and approximate GNSS/INS readings using Structure-from-Motion (SfM). A direct comparison of the two point clouds for change d...
Chapter
Current works on multi-person 3D pose estimation mainly focus on the estimation of the 3D joint locations relative to the root joint and ignore the absolute locations of each pose. In this paper, we propose the Human Depth Estimation Network (HDNet), an end-to-end framework for absolute root joint localization in the camera coordinate space. Our HD...
Chapter
Epipolar constraints are at the core of feature matching and depth estimation in current multi-person multi-camera 3D human pose estimation methods. Despite the satisfactory performance of this formulation in sparser crowd scenes, its effectiveness is frequently challenged under denser crowd circumstances mainly due to two sources of ambiguity. The...
Chapter
We present a novel learning approach to recover the 6D poses and sizes of unseen object instances from an RGB-D image. To handle the intra-class shape variation, we propose a deep network to reconstruct the 3D object model by explicitly modeling the deformation from a pre-learned categorical shape prior. Additionally, our network infers the dense c...
Chapter
The \(\mathrm {SE}(3)\) invariants of a pose include its rotation angle and screw translation. In this paper, we present a complete comprehensive study of the relative pose estimation problem for a calibrated camera constrained by known \(\mathrm {SE}(3)\) invariant, which involves 5 minimal problems in total. These problems reduces the minimal num...
Preprint
Point clouds are often sparse and incomplete, which imposes difficulties for real-world applications, such as 3D object classification, detection and segmentation. Existing shape completion methods tend to generate coarse shapes of objects without fine-grained details. Moreover, current approaches require fully-complete ground truth, which are diff...
Preprint
3D human pose estimation from a single image is an inverse problem due to the inherent ambiguity of the missing depth. Several previous works addressed the inverse problem by generating multiple hypotheses. However, these works are strongly supervised and require ground truth 2D-to-3D correspondences which can be difficult to obtain. In this paper,...
Preprint
In view of the difficulty in reconstructing object details in point cloud completion, we propose a shape prior learning method for object completion. The shape priors include geometric information in both complete and the partial point clouds. We design a feature alignment strategy to learn the shape prior from complete points, and a coarse to fine...
Preprint
Epipolar constraints are at the core of feature matching and depth estimation in current multi-person multi-camera 3D human pose estimation methods. Despite the satisfactory performance of this formulation in sparser crowd scenes, its effectiveness is frequently challenged under denser crowd circumstances mainly due to two sources of ambiguity. The...
Preprint
Current works on multi-person 3D pose estimation mainly focus on the estimation of the 3D joint locations relative to the root joint and ignore the absolute locations of each pose. In this paper, we propose the Human Depth Estimation Network (HDNet), an end-to-end framework for absolute root joint localization in the camera coordinate space. Our HD...
Preprint
We present a novel learning approach to recover the 6D poses and sizes of unseen object instances from an RGB-D image. To handle the intra-class shape variation, we propose a deep network to reconstruct the 3D object model by explicitly modeling the deformation from a pre-learned categorical shape prior. Additionally, our network infers the dense c...
Preprint
Full-text available
The $\mathrm{SE}(3)$ invariants of a pose include its rotation angle and screw translation. In this paper, we present a complete comprehensive study of the relative pose estimation problem for a calibrated camera constrained by known $\mathrm{SE}(3)$ invariant, which involves 5 minimal problems in total. These problems reduces the minimal number of...
Preprint
Many existing approaches for point cloud semantic segmentation are strongly supervised. These strongly supervised approaches heavily rely on a large amount of labeled training data that is difficult to obtain and suffer from poor generalization to new classes. To mitigate these limitations, we propose a novel attention-aware multi-prototype transdu...
Article
Full-text available
The problem of localization on a geo-referenced satellite map given a query ground view image is useful yet remains challenging due to the drastic change in viewpoint. To this end, in this paper we work on the extension of our earlier work on the Cross-View Matching Network (CVM-Net) (Hu et al. in IEEE conference on computer vision and pattern reco...
Preprint
Point cloud analysis has received much attention recently; and segmentation is one of the most important tasks. The success of existing approaches is attributed to deep network design and large amount of labelled training data, where the latter is assumed to be always available. However, obtaining 3d point cloud segmentation labels is often very co...
Preprint
Point clouds are often sparse and incomplete. Existing shape completion methods are incapable of generating details of objects or learning the complex point distributions. To this end, we propose a cascaded refinement network together with a coarse-to-fine strategy to synthesize the detailed object shapes. Considering the local details of partial i...
Preprint
Full-text available
Iterative Closest Point (ICP) solves the rigid point cloud registration problem iteratively in two steps: (1) make hard assignments of spatially closest point correspondences, and then (2) find the least-squares rigid transformation. The hard assignments of closest point correspondences based on spatial distances are sensitive to the initial rigid...
Preprint
Full-text available
Accurate 6D object pose estimation is fundamental to robotic manipulation and grasping. Previous methods follow a local optimization approach which minimizes the distance between closest point pairs to handle the rotation ambiguity of symmetric objects. In this work, we propose a novel discrete-continuous formulation for rotation regression to reso...
Preprint
The performance of existing point cloud-based 3D object detection methods heavily relies on large-scale high-quality 3D annotations. However, such annotations are often tedious and expensive to collect. Semi-supervised learning is a good alternative to mitigate the data annotation issue, but has remained largely unexplored in 3D object detection. I...
Conference Paper
Full-text available
Self-calibration of camera intrinsics and radial distortion has a long history of research in the computer vision community. However, it remains rare to see real applications of such techniques to modern Simultaneous Localization And Mapping (SLAM) systems, especially in driving scenarios. In this paper, we revisit the geometric approach to this pr...
Preprint
Identifiability, or recovery of the true latent representations from which the observed data originates, is a fundamental goal of representation learning. However, most deep generative models do not address the question of identifiability, and cannot recover the true latent sources that generate the observations. Recent work proposed identifiable g...
Preprint
Existing deep learning approaches on 3d human pose estimation for videos are either based on Recurrent or Convolutional Neural Networks (RNNs or CNNs). However, RNN-based frameworks can only tackle sequences with limited frames because sequential models are sensitive to bad frames and tend to drift over long sequences. Although existing CNN-based t...
Preprint
In this paper, we present the PS^2-Net -- a locally and globally aware deep learning framework for semantic segmentation on 3D scene-level point clouds. In order to deeply incorporate local structures and global context to support 3D scene segmentation, our network is built on four repeatedly stacked encoders, where each encoder has two basic compo...
Preprint
Full-text available
Self-calibration of camera intrinsics and radial distortion has a long history of research in the computer vision community. However, it remains rare to see real applications of such techniques to modern Simultaneous Localization And Mapping (SLAM) systems, especially in driving scenarios. In this paper, we revisit the geometric approach to this pr...
Preprint
Full-text available
Motivated by the success of encoding multi-scale contextual information for image analysis, we present our PointAtrousGraph (PAG) - a deep permutation-invariant hierarchical encoder-decoder architecture for learning multi-scale edge features in unorganized 3D points. Our PAG is constructed by several novel modules, such as point atrous convolution,...
Conference Paper
Full-text available
Existing deep learning approaches on 3d human pose estimation for videos are either based on Recurrent or Convolutional Neural Networks (RNNs or CNNs). However, RNN-based frameworks can only tackle sequences with limited frames because sequential models are sensitive to bad frames and tend to drift over long sequences. Although existing CNN-based t...
Preprint
Outlier feature matches and loop-closures that survived front-end data association can lead to catastrophic failures in the back-end optimization of large-scale point cloud based 3D reconstruction. To alleviate this problem, we propose a probabilistic approach for robust back-end optimization in the presence of outliers. More specifically, we model...
Conference Paper
Full-text available
Large-scale point cloud generated from 3D sensors is more accurate than its image-based counterpart. However, it is seldom used in visual pose estimation due to the difficulty in obtaining 2D-3D image to point cloud correspondences. In this paper, we propose the 2D3D-MatchNet-an end-to-end deep network architecture to jointly learn the descriptors...
Preprint
Full-text available
We investigate the direction of training a 3D object detector for new object classes from only 2D bounding box labels of these new classes, while simultaneously transferring information from 3D bounding box labels of the existing classes. To this end, we propose a transferable semi-supervised 3D object detection model that learns a 3D object detect...
Preprint
Large-scale point cloud generated from 3D sensors is more accurate than its image-based counterpart. However, it is seldom used in visual pose estimation due to the difficulty in obtaining 2D-3D image to point cloud correspondences. In this paper, we propose the 2D3D-MatchNet - an end-to-end deep network architecture to jointly learn the descriptor...
Preprint
3D human pose estimation from a monocular image or 2D joints is an ill-posed problem because of depth ambiguity and occluded joints. We argue that 3D human pose estimation from a monocular input is an inverse problem where multiple feasible solutions can exist. In this paper, we propose a novel approach to generate multiple feasible hypotheses of t...
Conference Paper
Full-text available
Outlier feature matches and loop-closures that survived front-end data association can lead to catastrophic failures in the back-end optimization of large-scale point cloud based 3D reconstruction. To alleviate this problem, we propose a probabilistic approach for robust back-end optimization in the presence of outliers. More specifically, we model...
Conference Paper
Full-text available
3D human pose estimation from a monocular image or 2D joints is an ill-posed problem because of depth ambiguity and occluded joints. We argue that 3D human pose estimation from a monocular input is an inverse problem where multiple feasible solutions can exist. In this paper, we propose a novel approach to generate multiple feasible hypotheses of t...
Preprint
Full-text available
Despite the recent active research on processing point clouds with deep networks, few attention has been on the sensitivity of the networks to rotations. In this paper, we propose a deep learning architecture that achieves discrete $\mathbf{SO}(2)$/$\mathbf{SO}(3)$ rotation equivariance for point cloud recognition. Specifically, the rotation of an...
Preprint
Full-text available
In this paper, we propose the USIP detector: an Unsupervised Stable Interest Point detector that can detect highly repeatable and accurately localized keypoints from 3D point clouds under arbitrary transformations without the need for any ground truth training data. Our USIP detector consists of a feature proposal network that learns stable keypoin...
Preprint
Full-text available
In this paper, we develop a modified differential Structure from Motion (SfM) algorithm that can estimate relative pose from two consecutive frames despite of Rolling Shutter (RS) artifacts. In particular, we show that under constant velocity assumption, the errors induced by the rolling shutter effect can be easily rectified by a linear scaling op...
Preprint
Full-text available
The problem of localization on a geo-referenced satellite map given a query ground view image is useful yet remains challenging due to the drastic change in viewpoint. To this end, in this paper we work on the extension of our earlier work on the Cross-View Matching Network (CVM-Net) \cite{Hu2018} for the ground-to-aerial image matching task since...
Preprint
Full-text available
Many existing translation averaging algorithms are either sensitive to disparate camera baselines and have to rely on extensive preprocessing to improve the observed Epipolar Geometry graph, or if they are robust against disparate camera baselines, require complicated optimization to minimize the highly nonlinear angular error objective. In this pa...