Fig 6 - uploaded by Henning Tjaden
Content may be subject to copyright.
The two depth map types used within our approach, where brighter pixels are closer to the camera. Left: The usual depth map I d corresponding to the closest surface points. Right: The reverse depth map I r d corresponding to the most distant surface points.
Source publication
We propose an algorithm for real-time 6DOF pose tracking of rigid 3D objects using a monocular RGB camera. The key idea is to derive a region-based cost function using temporally consistent local color histograms. While such region-based cost functions are commonly optimized using first-order gradient descent techniques, we systematically derive a...
Context in source publication
Context 1
... [26] it has been shown that it is beneficial not only to consider the points on the surface closest to the camera but also the most distant ones (on the backside of the object) for pose optimization. In order to obtain the respective coordinates for each pixel, we compute an additional reverse depth map I r d , for which we simply invert the OpenGL depth check used to compute the corresponding Z-buffer (see Figure 6). Given I r d , the farthest surface point X (x, I r d ) corresponding to a pixel x is also recovered as X (x, ...
Similar publications
For detection and tracking of a near space hypersonic maneuvering target in a heavy clutter environment, a novel transform domain method called Variable-Diameter-Arc-Helix Radon Transform (VDAH-RT) is proposed as a tool to integrate observations along the maneuvering path. Considering the 3D maneuvering trajectory is like a variable diameter arc he...
Citations
... Finally, the intermediate pose is refined using a region-based method. Evaluation experiments, using both synthetic and real images in comparison to recently published representative methods of [3][4][5], were performed extensively. The results indicate that the proposed method is able to achieve performance superior to state-of-the-art methods, especially for objects with large pose shifts. ...
... Tjaden et al. [23] introduced a novel localized model using the temporally consistent local color histograms to preserve temporal consistency. In [3], the authors summarized their previous work [21,23] and introduced a novel iteratively reweighted Gauss-Newton optimization method. Region-based methods with localized models [3,5,9,21,[23][24][25] only use the pixels within a limited band along the projected object contour, and are therefore prone to failure when tracking symmetrical objects. ...
... In [3], the authors summarized their previous work [21,23] and introduced a novel iteratively reweighted Gauss-Newton optimization method. Region-based methods with localized models [3,5,9,21,[23][24][25] only use the pixels within a limited band along the projected object contour, and are therefore prone to failure when tracking symmetrical objects. Zhong et al. [9] introduced an approach combining direct and region-based methods by utilizing the pixels of foreground's interior. ...
Monocular object pose tracking has been a key technology in autonomous rendezvous of two moving platforms. However, rapid relative motion between platforms causes large interframe pose shifts, which leads to pose tracking failure. Based on the derivation of the region-based pose tracking method and the theory of rigid body kinematics, we put forward that the stability of the color segmentation model and linearization in pose optimization are the key to region-based monocular object pose tracking. A reliable metric named VoI is designed to measure interframe pose shifts, based on which we argue that motion continuity recovery is a promising way to tackle the translation-dominant large pose shift issue. Then, a 2D tracking method is adopted to bridge the interframe motion continuity gap. For texture-rich objects, the motion continuity can be recovered through localized region-based pose transferring, which is performed by solving a PnP (Perspective-n-Point) problem within the tracked 2D bounding boxes of two adjacent frames. Moreover, for texture-less objects, a direct translation approach is introduced to estimate an intermediate pose of the frame. Finally, a region-based pose refinement is exploited to obtain the final tracked pose. Experimental results on synthetic and real image sequences indicate that the proposed method achieves superior performance to state-of-the-art methods in tracking objects with large pose shifts.
... Here, the feature maps x are extracted by the backbone network; w 1 and w 2 denote the parameters of the convolution layers of D; * denotes the standard multi-channel convolution; w 1 is used for dimensionality reduction to simplify the calculation and w 2 is used for calculating the classification confidence scores. Aim to regress the score of a potential object, the L2 segmentation loss is used to establish the objective function based on the fast-converging Gauss-Newton [27,36,38] optimizer as: ...
Most video object segmentation networks have difficulties in balancing accuracy and speed, leading them to fail to meet the requirements of application. In this paper, we propose a lightweight online-trained video object segmentation network. Specifically, to force the network focus on the potential object, we propose a new way to guide the encoder module by classification score map, and integrate a cross-dimension attention into the refinement segmentation module. Meanwhile, to reduce the negative influence of unreliable samples, we use two indexes to adaptively choose templates for the memory module. Experiments were conducted on three popular benchmarks, and our approach has achieved a good trade-off between accuracy and speed.
... It inspired many subsequent approaches that proposed various modifications. For example, to better differentiate between foreground and background, methods that localize the statistical modeling were developed [19], [20], [21]. Also, contour constraints were suggested that explicitly deal with partial occlusions and color ambiguities [22]. ...
... Also, contour constraints were suggested that explicitly deal with partial occlusions and color ambiguities [22]. With respect to optimization, different techniques, including Levenberg-Marquardt [23], Gauss-Newton [20], and Newton with Tikhonov regularization [24], were considered. The relatively poor efficiency of region-based methods was addressed with the development of SRT3D [24], [10]. ...
... It features a highly-efficient sparse formulation. For evaluation purposes, the OPT [18], RBOT [20], and BCOT [25] datasets are often used. Finally, combined approaches were also proposed. ...
In many applications of advanced robotic manipulation, six degrees of freedom (6DoF) object pose estimates are continuously required. In this work, we develop a multi-modality tracker that fuses information from visual appearance and geometry to estimate object poses. The algorithm extends our previous method ICG, which uses geometry, to additionally consider surface appearance. In general, object surfaces contain local characteristics from text, graphics, and patterns, as well as global differences from distinct materials and colors. To incorporate this visual information, two modalities are developed. For local characteristics, keypoint features are used to minimize distances between points from keyframes and the current image. For global differences, a novel region approach is developed that considers multiple regions on the object surface. In addition, it allows the modeling of external geometries. Experiments on the YCB-Video and OPT datasets demonstrate that our approach ICG+ performs best on both datasets, outperforming both conventional and deep learning-based methods. At the same time, the algorithm is highly efficient and runs at more than 300 Hz. The source code of our tracker is publicly available.
... Besides single frame pose estimation, many recent works focus on the temporal tracking of object poses. Instance-level object pose tracking approaches include optimization [66,81,106,85], filtering [96,15,46,41,16], and direct regression of inter-frame pose change [94]. Recent works on categorylevel object pose tracking can emerge category-level keypoints [88] without known CAD model in testing, refining coarse pose from keypoint registration by pose graph optimization [93], and learning inter-frame pose change from canonicalized point clouds [95]. ...
In this work, we tackle the challenging task of jointly tracking hand object pose and reconstructing their shapes from depth point cloud sequences in the wild, given the initial poses at frame 0. We for the first time propose a point cloud based hand joint tracking network, HandTrackNet, to estimate the inter-frame hand joint motion. Our HandTrackNet proposes a novel hand pose canonicalization module to ease the tracking task, yielding accurate and robust hand joint tracking. Our pipeline then reconstructs the full hand via converting the predicted hand joints into a template-based parametric hand model MANO. For object tracking, we devise a simple yet effective module that estimates the object SDF from the first frame and performs optimization-based tracking. Finally, a joint optimization step is adopted to perform joint hand and object reasoning, which alleviates the occlusion-induced ambiguity and further refines the hand pose. During training, the whole pipeline only sees purely synthetic data, which are synthesized with sufficient variations and by depth simulation for the ease of generalization. The whole pipeline is pertinent to the generalization gaps and thus directly transferable to real in-the-wild data. We evaluate our method on two real hand object interaction datasets, e.g. HO3D and DexYCB, without any finetuning. Our experiments demonstrate that the proposed method significantly outperforms the previous state-of-the-art depth-based hand and object pose estimation and tracking methods, running at a frame rate of 9 FPS.
... In order to achieve real-time speed, previous optimization-based 3D tracking methods search for only the local minima of the non-convex cost function. Note [22], RBGT [18], SRT3D [19]) would decrease fast with the increase of displacements (frame step S). (b) The proposed hybrid non-local optimization method. ...
... The coarse-to-fine search is commonly adopted in previous tracking methods for handling large displacements. For 3D tracking, it can be implemented by image pyramids [7,22] or by varying the length of search lines [18]. However, note that since the 3D rotation is independent of the object scale in image space, coarse-to-fine search in image space would take little effect on the rotation components. ...
... Zhong et al. [29] proposed to use polar coordinates for better handling occlusion. The recent works of Stobier et al. [18,19] proposed a sparse probabilistic model and Gaussian approximations for the derivatives in optimization, achieving state-ofthe-art accuracy on the RBOT dataset [22] and can run at a fast speed. The above methods all do only local optimization, and thus are sensitive to large displacements. ...
Optimization-based 3D object tracking is known to be precise and fast, but sensitive to large inter-frame displacements. In this paper we propose a fast and effective non-local 3D tracking method. Based on the observation that erroneous local minimum are mostly due to the out-of-plane rotation, we propose a hybrid approach combining non-local and local optimizations for different parameters, resulting in efficient non-local search in the 6D pose space. In addition, a precomputed robust contour-based tracking method is proposed for the pose optimization. By using long search lines with multiple candidate correspondences, it can adapt to different frame displacements without the need of coarse-to-fine search. After the pre-computation, pose updates can be conducted very fast, enabling the non-local optimization to run in real time. Our method outperforms all previous methods for both small and large displacements. For large displacements, the accuracy is greatly improved ($81.7\% \;\text{v.s.}\; 19.4\%$). At the same time, real-time speed ($>$50fps) can be achieved with only CPU. The source code is available at \url{https://github.com/cvbubbles/nonlocal-3dtracking}.
... Second row: Pose tracking results of the GOS method [21]. Third row: Pose tracking results of the RBOT method [24]. Forth row: Pose tracking results of the proposed method. ...
... According to the construction of energy function given in Section 2.2, in order to solve the optimal pose, this paper adopts the Gauss-Newton pose optimization method to solve this complex nonlinear optimization problem by referring to [24]. First, Equation (17) is reconstructed into a nonlinear iterative reweighted least squares problem of the following form: ...
... We compared the proposed algorithm with the existing representative region-based pose tracking method RBOT algorithm [24] and the edge-based pose tracking method GOS algorithm [21] on the two image sequences, respectively. For the comparison algorithms, we adopted the default parameter settings suggested in their papers. ...
Due to its structural simplicity and its strong anti-electromagnetic ability, landing guidance based on airborne monocular vision has gained more and more attention. Monocular 6D pose tracking of the aircraft carrier is one of the key technologies in visual landing guidance. However, owing to the large range span in the process of carrier landing, the scale of the carrier target in the image variates greatly. There is still a lack of robust monocular pose tracking methods suitable for this scenario. To tackle this problem, a new aircraft carrier pose tracking algorithm based on scale-adaptive local region is proposed in this paper. Firstly, the projected contour of the carrier target is uniformly sampled to establish local circular regions. Then, the local area radius is adjusted according to the pixel scale of the projected contour to build the optimal segmentation energy function. Finally, the 6D pose tracking of the carrier target is realized by iterative optimization. Experimental results on both synthetic and real image sequences show that the proposed method achieves robust and efficient 6D pose tracking of the carrier target under the condition of large distance span, which meets the application requirements of carrier landing guidance.
... Object tracking involves the usage of measuring the distance of that particular object from the camera. This has been done using a monocular camera (Crivellaro et al., 2017), (Tjaden et al., 2018) using machine learning algorithms, or a stereo camera (Lin and Wang, 2010), (Issac et al., 2016) using triangulation methods as well as other machine learning methodologies. The current paper uses a depth camera (Lukezic et al., 2019) where the IR sensor in the depth camera is used to measure the distance of the object from the camera once it has been detected. ...
This research presents a novel bio-inspired framework for two robots interacting together for a cooperative package delivery task with a human-in the-loop. It contributes to eliminating the need for network-based robot-robot interaction in constrained environments. An individual robot is instructed to move in specific shapes with a particular orientation at a certain speed for the other robot to infer using object detection (custom YOLOv4) and depth perception. The shape is identified by calculating the area occupied by the detected polygonal route. A metric for the area’s extent is calculated and empirically used to assign regions for specific shapes and gives an overall accuracy of 93.3% in simulations and 90% in a physical setup. Additionally, gestures are analyzed for their accuracy of intended direction, distance, and the target coordinates in the map. The system gives an average positional RMSE of 0.349 in simulation and 0.461 in a physical experiment. A video demonstration of the problem statement along with the simulations and experiments for real world applications has been given here and in Supplementary Material.
... Tracking, a number of methods proposed objective functions, which capture the discrepancy between the current observation and the previous state. They compute relative transformations based on the minima of the residual function [38,39,40,41,42]. In particular, methods combine the optical/AR flow and point-to-plane distance to solve tracking in a least-squares sense [39,38]. ...
... Estimating the task's state dynamically, such as the object's 6D pose [42,44,257], when onboard sensing is unavailable can be achieved through feedback from inexpensive, external alternatives, such as RGBD cameras [258,259]. Although requiring additional computation, this sensing modality is advantageous as it does not require invasive, bulky sensor suites on the robot, while also providing a wider "field of sensing" for perceiving an extended workspace. ...
This thesis deals with object pose estimation and tracking, and solve robot manipulation tasks. It aims to address uncertainty due to dynamics and generalize to novel object instances by reducing the dependency on either instance or category level 3D models. Robot object manipulation often requires reasoning about object poses given visual data. For instance, pose estimation can be used to initiate pick-and-drop manipulation and has been studied extensively. Purposeful manipulation, however, such as precise assembly or withinhand re-orientation, requires sustained reasoning of an object's state, since dynamic effects due to contacts and slippage, may alter the relative configuration between the object and the robotic hand. This motivates the temporal tracking of object poses over image sequences, which reduces computational latency, while maintaining or even enhancing pose quality relative to single-shot pose estimation. Most existing techniques in this domain assume instance-level 3D models. This complicates generalization to novel, unseen instances, and thus hinders deployment to novel environments. Even if instance-level 3D models are unavailable, however, it may be possible to access category-level models. Thus, it is desirable to learn category-level priors, which can be used for the visual understanding of novel, unknown object instances. In the most general case, where the robot has to deal with out-of-distribution instances or it cannot access category-level priors, object-agnostic perception methods are needed. Given this context, this thesis proposes a category-level representation, called NUNOCS, to unify the representation of various intra-class object instances and facilitate the transfer of category-level knowledge across such instances. This work also integrates the strengths of both modern deep learning as well as pose graph optimization to achieve generalizable object tracking in the SE(3) space, without needing either instance or category level 3D models. When instance-level object models are available, a synthetic data generation pipeline is developed to learn the relative motion along manifolds by reasoning over image residuals. This allows to achieve state-of-art SE(3) pose tracking results, while circumventing manual efforts in data collection or annotation. It also demonstrates that the developed solutions for object tracking provide efficient solutions to multiple manipulation challenges. Specifically, this thesis starts from a single-image object pose estimation approach that deals with severe occlusions during manipulation. It then moves to long-term object pose tracking via reasoning over image residuals between consecutive frames, while training exclusively over synthetic data. In the case of object tracking along a video sequence, the dependency on either instance-level or category-level CAD models is reduced via leveraging multi-view consistency, in the form of a memory-augmented pose graph optimization, to achieve spatial-temporal consistency. For initializing pose estimates in video sequences involving novel unseen objects, category-level priors are extracted by taking advantage of easily accessible virtual 3D model databases. Following these ideas, frameworks for category-level, task-relevant grasping, and vision-based, closed-loop manipulation are developed, which resolve complicated and high precision tasks. The learning process is scalable as the training is performed exclusively over synthetic data or through a robot's self-interaction process conducted solely in simulation. The proposed methods are evaluated first over public computer vision benchmarks, boosting the previous state-of-art tracking accuracy from 33.3% to 87.4% on the NOCS dataset, despite reducing dependency on category-level 3D models for training. When applied to real robotic setups, they significantly improve category-level manipulation performance, validating their effectiveness and robustness. In addition, this thesis unlocks and demonstrates multiple complex manipulation skills in open world environments. This is despite limited input assumptions, such as training solely over synthetic data, dealing with novel unknown objects, or learning from a single visual demonstration.
... FRTM (Fast and Robust Target Models for VOS, Robinson et al. 2020) designs a discriminative linear model for generating target-specific predictions, which are then refined by a segmentation network. During inference, only the target model requires training, achieved by performing the Gauss-Newton-based optimisation (Tjaden et al. 2018) on the first frame annotation and subsequent frame predictions. Discussion: Due to the lightweight target model and efficient optimisation, FRTM performs SVOS faster than online fine-tuning-based methods. ...
As one of the fundamental problems in the field of video understanding, video object segmentation aims at segmenting objects of interest throughout the given video sequence. Recently, with the advancements of deep learning techniques, deep neural networks have shown outstanding performance improvements in many computer vision applications, with video object segmentation being one of the most advocated and intensively investigated. In this paper, we present a systematic review of the deep learning-based video segmentation literature, highlighting the pros and cons of each category of approaches. Concretely, we start by introducing the definition, background concepts and basic ideas of algorithms in this field. Subsequently, we summarise the datasets for training and testing a video object segmentation algorithm, as well as common challenges and evaluation metrics. Next, previous works are grouped and reviewed based on how they extract and use spatial and temporal features, where their architectures, contributions and the differences among each other are elaborated. At last, the quantitative and qualitative results of several representative methods on a dataset with many remaining challenges are provided and analysed, followed by further discussions on future research directions. This article is expected to serve as a tutorial and source of reference for learners intended to quickly grasp the current progress in this research area and practitioners interested in applying the video object segmentation methods to their problems. A public website is built to collect and track the related works in this field: https://github.com/gaomingqi/VOS-Review .
... Because of the discussed shortcomings, region-based techniques (Stoiber et al., 2020;Zhong et al., 2020b;Tjaden et al., 2018;Prisacariu and Reid, 2012) have become increasingly popular. The big advantage of such methods is that they are able to reliably track a wide variety of objects in cluttered scenes, using only a monocular RGB camera and a texture-less 3D model of the object. ...
... Later, Hexner and Hagege (2016) proposed the use of local appearance models that were inspired by the localized contours of Lankton and Tannenbaum (2008). The idea was further improved by Tjaden et al. (2018) with the development of temporally consistent local color histograms. Finally, Zhong et al. (2020b) proposed a method that introduces polar-based region partitioning and edge-based occlusion detection. ...
... Later, a hierarchical rendering approach that uses the Levenberg-Marquardt algorithm was developed by Prisacariu et al. (2015). Also, Tjaden et al. (2018) proposed the use of a Gauss-Newton method to improve convergence. In addition to optimization, another idea towards better efficiency is the use of simplified signed distance functions (Liu et al., 2020). ...
Region-based methods have become increasingly popular for model-based, monocular 3D tracking of texture-less objects in cluttered scenes. However, while they achieve state-of-the-art results, most methods are computationally expensive, requiring significant resources to run in real-time. In the following, we build on our previous work and develop SRT3D, a sparse region-based approach to 3D object tracking that bridges this gap in efficiency. Our method considers image information sparsely along so-called correspondence lines that model the probability of the object’s contour location. We thereby improve on the current state of the art and introduce smoothed step functions that consider a defined global and local uncertainty. For the resulting probabilistic formulation, a thorough analysis is provided. Finally, we use a pre-rendered sparse viewpoint model to create a joint posterior probability for the object pose. The function is maximized using second-order Newton optimization with Tikhonov regularization. During the pose estimation, we differentiate between global and local optimization, using a novel approximation for the first-order derivative employed in the Newton method. In multiple experiments, we demonstrate that the resulting algorithm improves the current state of the art both in terms of runtime and quality, performing particularly well for noisy and cluttered images encountered in the real world.