Fig 4 - uploaded by Alexander Krull
Content may be subject to copyright.
Source publication
This work investigates the problem of 6-Degrees-Of-Freedom (6-DOF) object tracking from RGB-D images, where the object is rigid and a 3D model of the object is known. As in many previous works, we utilize a Particle Filter (PF) framework. In order to have a fast tracker, the key aspect is to design a clever proposal distribution which works reliabl...
Context in source publication
Citations
... Zhong et al. [56] used the Rigid Pose dataset for their evaluation. Furthermore, the ACCV14 dataset [57], an RGB-D dataset, was used for their evaluation. The Princeton [41] dataset is an RGB-D dataset used by Rasoulidanesh et al. [40] for evaluating their method for tracking the object along with depth. ...
Object tracking is one of the most important problems in computer vision applications such as robotics, autonomous driving, and pedestrian movement. There has been a significant development in camera hardware where researchers are experimenting with the fusion of different sensors and developing image processing algorithms to track objects. Image processing and deep learning methods have significantly progressed in the last few decades. Different data association methods accompanied by image processing and deep learning are becoming crucial in object tracking tasks. The data requirement for deep learning methods has led to different public datasets that allow researchers to benchmark their methods. While there has been an improvement in object tracking methods, technology, and the availability of annotated object tracking datasets, there is still scope for improvement. This review contributes by systemically identifying different sensor equipment, datasets, methods, and applications, providing a taxonomy about the literature and the strengths and limitations of different approaches, thereby providing guidelines for selecting equipment, methods, and applications. Research questions and future scope to address the unresolved issues in the object tracking field are also presented with research direction guidelines.
... The dense 2D-3D correspondence is established by predicting the 3D coordinates of each object pixel or by predicting a dense UV map. Prior to deep learning, some methods used random forests to predict the dense coordinates of objects [7,31,32]. [33] extended the standard random forest to a contextual regression framework to iteratively reduce the uncertainty of the predicted object coordinates. However, this approach has poor performance because it can only handle limited features in simple scenes. ...
The current challenging problems of learning a robust 6D pose lie in noise in RGB/RGBD images, sparsity of point cloud and severe occlusion. To tackle the problems, object geometric information is critical. In this work, we present a novel pipeline for 6DoF object pose estimation. Unlike previous methods that directly regressing pose parameters and predicting keypoints, we tackle this challenging task with a point-pair based approach and leverage geometric information as much as possible. Specifically, at the representation learning stage, we build a point cloud network locally modeling CNN to encode point cloud, which is able to extract effective geometric features while the point cloud is projected into a high-dimensional space. Moreover, we design a coordinate conversion network to regress point cloud in the object coordinate system in a decoded way. Then, the pose could be calculated through point pairs matching algorithm. Experimental results show that our method achieves state-of-the-art performance on several datasets.
... Besides single frame pose estimation, many recent works focus on the temporal tracking of object poses. Instance-level object pose tracking approaches include optimization [66,81,106,85], filtering [96,15,46,41,16], and direct regression of inter-frame pose change [94]. Recent works on categorylevel object pose tracking can emerge category-level keypoints [88] without known CAD model in testing, refining coarse pose from keypoint registration by pose graph optimization [93], and learning inter-frame pose change from canonicalized point clouds [95]. ...
In this work, we tackle the challenging task of jointly tracking hand object pose and reconstructing their shapes from depth point cloud sequences in the wild, given the initial poses at frame 0. We for the first time propose a point cloud based hand joint tracking network, HandTrackNet, to estimate the inter-frame hand joint motion. Our HandTrackNet proposes a novel hand pose canonicalization module to ease the tracking task, yielding accurate and robust hand joint tracking. Our pipeline then reconstructs the full hand via converting the predicted hand joints into a template-based parametric hand model MANO. For object tracking, we devise a simple yet effective module that estimates the object SDF from the first frame and performs optimization-based tracking. Finally, a joint optimization step is adopted to perform joint hand and object reasoning, which alleviates the occlusion-induced ambiguity and further refines the hand pose. During training, the whole pipeline only sees purely synthetic data, which are synthesized with sufficient variations and by depth simulation for the ease of generalization. The whole pipeline is pertinent to the generalization gaps and thus directly transferable to real in-the-wild data. We evaluate our method on two real hand object interaction datasets, e.g. HO3D and DexYCB, without any finetuning. Our experiments demonstrate that the proposed method significantly outperforms the previous state-of-the-art depth-based hand and object pose estimation and tracking methods, running at a frame rate of 9 FPS.
... However, the annotations are mainly focused on the hand joint positions, and only a small proportion of the objects are provided with their meshes and 3D poses. Probably most related to ours is the dataset [28], which consists of three objects recorded with the Kinect sensors. Our dataset differs in the three main aspects. ...
... Our dataset differs in the three main aspects. First, we obtain the reference 6D pose from an HTC Vive controller attached to the object, whereas [28] used manual annotation. Second, [28] provides 3 187 images in total, whereas our whole dataset contains more than 100 000 images. ...
... First, we obtain the reference 6D pose from an HTC Vive controller attached to the object, whereas [28] used manual annotation. Second, [28] provides 3 187 images in total, whereas our whole dataset contains more than 100 000 images. Third, [28] lacks variability across several subjects, left and right hand, different camera views, tasks, or clutter in the scene. ...
This paper introduces a dataset for training and evaluating methods for 6D pose estimation of hand-held tools in task demonstrations captured by a standard RGB camera. Despite the significant progress of 6D pose estimation methods, their performance is usually limited for heavily occluded objects, which is a common case in imitation learning where the object is typically partially occluded by the manipulating hand. Currently, there is a lack of datasets that would enable the development of robust 6D pose estimation methods for these conditions. To overcome this problem, we collect a new dataset (Imitrob) aimed at 6D pose estimation in imitation learning and other applications where a human holds a tool and performs a task. The dataset contains image sequences of three different tools and six manipulation tasks with two camera viewpoints, four human subjects, and left/right hand. Each image is accompanied by an accurate ground truth measurement of the 6D object pose, obtained by the HTC Vive motion tracking device. The use of the dataset is demonstrated by training and evaluating a recent 6D object pose estimation method (DOPE) in various setups. The dataset and code are publicly available at http://imitrob.ciirc.cvut.cz/imitrobdataset.php.
... Instead of relying on a single image for absolute pose estimation, tracking methods exploit temporal information. While earlier methods were prone to fail in the presence of heavy occlusion and clutter [23,7,1], data-driven methods have been proposed to learn more robust features by using Random Forests [22,43,44]. This problem has been recently formulated under a deep learning framework, where a network is trained to regress the pose difference between image pairs extracted from RGB [8,4] or RGB-D videos [12,13,58,49]. ...
Estimating the relative pose of a new object without prior knowledge is a hard problem, while it is an ability very much needed in robotics and Augmented Reality. We present a method for tracking the 6D motion of objects in RGB video sequences when neither the training images nor the 3D geometry of the objects are available. In contrast to previous works, our method can therefore consider unknown objects in open world instantly, without requiring any prior information or a specific training phase. We consider two architectures, one based on two frames, and the other relying on a Transformer Encoder, which can exploit an arbitrary number of past frames. We train our architectures using only synthetic renderings with domain randomization. Our results on challenging datasets are on par with previous works that require much more information (training images of the target objects, 3D models, and/or depth data). Our source code is available at https://github.com/nv-nguyen/pizza
... Instead of relying on a single image for absolute pose estimation, tracking methods exploit temporal information. While earlier methods were prone to fail in the presence of heavy occlusion and clutter [23,7,1], data-driven methods have been proposed to learn more robust features by using Random Forests [22,43,44]. This problem has been recently formulated under a deep learning framework, where a network is trained to regress the pose difference between image pairs extracted from RGB [8,4] or RGB-D videos [12,13,58,49]. ...
... Thus, exclusive evaluation on such datasets cannot entirely reflect the attributes of a 6D object pose tracking approach. Other datasets [37,145] collected video sequences where objects are manipulated by a human hand. Nevertheless, human arm and hand motions can greatly vary from those of robots. ...
This thesis deals with object pose estimation and tracking, and solve robot manipulation tasks. It aims to address uncertainty due to dynamics and generalize to novel object instances by reducing the dependency on either instance or category level 3D models. Robot object manipulation often requires reasoning about object poses given visual data. For instance, pose estimation can be used to initiate pick-and-drop manipulation and has been studied extensively. Purposeful manipulation, however, such as precise assembly or withinhand re-orientation, requires sustained reasoning of an object's state, since dynamic effects due to contacts and slippage, may alter the relative configuration between the object and the robotic hand. This motivates the temporal tracking of object poses over image sequences, which reduces computational latency, while maintaining or even enhancing pose quality relative to single-shot pose estimation. Most existing techniques in this domain assume instance-level 3D models. This complicates generalization to novel, unseen instances, and thus hinders deployment to novel environments. Even if instance-level 3D models are unavailable, however, it may be possible to access category-level models. Thus, it is desirable to learn category-level priors, which can be used for the visual understanding of novel, unknown object instances. In the most general case, where the robot has to deal with out-of-distribution instances or it cannot access category-level priors, object-agnostic perception methods are needed. Given this context, this thesis proposes a category-level representation, called NUNOCS, to unify the representation of various intra-class object instances and facilitate the transfer of category-level knowledge across such instances. This work also integrates the strengths of both modern deep learning as well as pose graph optimization to achieve generalizable object tracking in the SE(3) space, without needing either instance or category level 3D models. When instance-level object models are available, a synthetic data generation pipeline is developed to learn the relative motion along manifolds by reasoning over image residuals. This allows to achieve state-of-art SE(3) pose tracking results, while circumventing manual efforts in data collection or annotation. It also demonstrates that the developed solutions for object tracking provide efficient solutions to multiple manipulation challenges. Specifically, this thesis starts from a single-image object pose estimation approach that deals with severe occlusions during manipulation. It then moves to long-term object pose tracking via reasoning over image residuals between consecutive frames, while training exclusively over synthetic data. In the case of object tracking along a video sequence, the dependency on either instance-level or category-level CAD models is reduced via leveraging multi-view consistency, in the form of a memory-augmented pose graph optimization, to achieve spatial-temporal consistency. For initializing pose estimates in video sequences involving novel unseen objects, category-level priors are extracted by taking advantage of easily accessible virtual 3D model databases. Following these ideas, frameworks for category-level, task-relevant grasping, and vision-based, closed-loop manipulation are developed, which resolve complicated and high precision tasks. The learning process is scalable as the training is performed exclusively over synthetic data or through a robot's self-interaction process conducted solely in simulation. The proposed methods are evaluated first over public computer vision benchmarks, boosting the previous state-of-art tracking accuracy from 33.3% to 87.4% on the NOCS dataset, despite reducing dependency on category-level 3D models for training. When applied to real robotic setups, they significantly improve category-level manipulation performance, validating their effectiveness and robustness. In addition, this thesis unlocks and demonstrates multiple complex manipulation skills in open world environments. This is despite limited input assumptions, such as training solely over synthetic data, dealing with novel unknown objects, or learning from a single visual demonstration.
... Another relatively new development is the availability of affordable depth cameras that measure the surface distance for each pixel. While purely depthbased object tracking is possible, most methods (Ren et al., 2017;Kehl et al., 2017;Tan et al., 2017;Krull et al., 2015;Krainin et al., 2011) combine information from both depth and RGB cameras. In general, this leads to superior results. ...
Region-based methods have become increasingly popular for model-based, monocular 3D tracking of texture-less objects in cluttered scenes. However, while they achieve state-of-the-art results, most methods are computationally expensive, requiring significant resources to run in real-time. In the following, we build on our previous work and develop SRT3D, a sparse region-based approach to 3D object tracking that bridges this gap in efficiency. Our method considers image information sparsely along so-called correspondence lines that model the probability of the object’s contour location. We thereby improve on the current state of the art and introduce smoothed step functions that consider a defined global and local uncertainty. For the resulting probabilistic formulation, a thorough analysis is provided. Finally, we use a pre-rendered sparse viewpoint model to create a joint posterior probability for the object pose. The function is maximized using second-order Newton optimization with Tikhonov regularization. During the pose estimation, we differentiate between global and local optimization, using a novel approximation for the first-order derivative employed in the Newton method. In multiple experiments, we demonstrate that the resulting algorithm improves the current state of the art both in terms of runtime and quality, performing particularly well for noisy and cluttered images encountered in the real world.
... The dense 2D-3D correspondences are obtained by predicting the 3D object coordinate of each object pixel or by predicting dense UV maps. Before deep learning becomes popular, early works usually use random forest for predicting object coordinates [6,74,108]. Brachmann et al. [7] extend the standard random forest to an auto context regression framework, which iteratively reduces the uncertainty of the predicted object coordinate. ...
Object pose detection and tracking has recently attracted increasing attention due to its wide applications in many areas, such as autonomous driving, robotics, and augmented reality. Among methods for object pose detection and tracking, deep learning is the most promising one that has shown better performance than others. However, survey study about the latest development of deep learning-based methods is lacking. Therefore, this study presents a comprehensive review of recent progress in object pose detection and tracking that belongs to the deep learning technical route. To achieve a more thorough introduction, the scope of this study is limited to methods taking monocular RGB/RGBD data as input and covering three kinds of major tasks: instance-level monocular object pose detection, category-level monocular object pose detection, and monocular object pose tracking. In our work, metrics, datasets, and methods of both detection and tracking are presented in detail. Comparative results of current state-of-the-art methods on several publicly available datasets are also presented, together with insightful observations and inspiring future research directions.
... Apart from correspondence points and ICP, methods that utilize signed distance functions are often used [18,47,48,54]. In addition, approaches that employ particle filters [10,11,30,69] or robust Gaussian filters [27] instead of gradient-based optimization are also very popular. ...
Tracking objects in 3D space and predicting their 6DoF pose is an essential task in computer vision. State-of-the-art approaches often rely on object texture to tackle this problem. However, while they achieve impressive results, many objects do not contain sufficient texture, violating the main underlying assumption. In the following, we thus propose ICG, a novel probabilistic tracker that fuses region and depth information and only requires the object geometry. Our method deploys correspondence lines and points to iteratively refine the pose. We also implement robust occlusion handling to improve performance in real-world settings. Experiments on the YCB-Video, OPT, and Choi datasets demonstrate that, even for textured objects, our approach outperforms the current state of the art with respect to accuracy and robustness. At the same time, ICG shows fast convergence and outstanding efficiency, requiring only 1.3 ms per frame on a single CPU core. Finally, we analyze the influence of individual components and discuss our performance compared to deep learning-based methods. The source code of our tracker is publicly available.