Fig 6 - uploaded by Henning Tjaden
Content may be subject to copyright.
The two depth map types used within our approach, where brighter pixels are closer to the camera. Left: The usual depth map I d corresponding to the closest surface points. Right: The reverse depth map I r d corresponding to the most distant surface points.
Source publication
We propose an algorithm for real-time 6DOF pose tracking of rigid 3D objects using a monocular RGB camera. The key idea is to derive a region-based cost function using temporally consistent local color histograms. While such region-based cost functions are commonly optimized using first-order gradient descent techniques, we systematically derive a...
Context in source publication
Context 1
... [26] it has been shown that it is beneficial not only to consider the points on the surface closest to the camera but also the most distant ones (on the backside of the object) for pose optimization. In order to obtain the respective coordinates for each pixel, we compute an additional reverse depth map I r d , for which we simply invert the OpenGL depth check used to compute the corresponding Z-buffer (see Figure 6). Given I r d , the farthest surface point X (x, I r d ) corresponding to a pixel x is also recovered as X (x, ...
Similar publications
For detection and tracking of a near space hypersonic maneuvering target in a heavy clutter environment, a novel transform domain method called Variable-Diameter-Arc-Helix Radon Transform (VDAH-RT) is proposed as a tool to integrate observations along the maneuvering path. Considering the 3D maneuvering trajectory is like a variable diameter arc he...
Citations
... If the results for this scenario were not available the AUC scores in the evaluation called all tests are reported instead. It is observed that the proposed algorithm attains better results than results achieved by PWP3D [68], UDP [69], ElasticFusion [70], RBOT [71] and a recently proposed edge-based algorithm [72]. In the FreeMotion scenario it achieves better average results than edge-based method [73]. ...
Pose estimation methods for robotics should return a distribution of poses rather than just a single pose estimate. Motivated by this, in this work we investigate multi-modal pose representations for reliable 6-DoF object tracking. A neural network architecture for simultaneous object segmentation and estimation of fiducial points of the object on RGB images is proposed. Given a priori probability distribution of object poses a particle filter is employed to estimate the posterior probability distribution of object poses. An advanced observation model relying on matching the projected 3D model with the segmented object and a distance transform-based object representation is used to weight samples representing the probability distribution. Afterwards, the object pose determined by the PnP algorithm is included in the probability distribution via replacing a particle with the smallest weight. Next, a k-means++ algorithm is executed to determine modes in a multi-modal probability distribution. A multi-swarm particle swarm optimization is then executed to determine the finest modes in the probability distribution. A subset of particles for final pose optimization is found in a multi-criteria analysis using the TOPSIS algorithm. They are verified using conflicting criteria that are determined on the basis of object keypoints, segmented object, and the distance transform. On the challenging YCB-Video dataset it outperforms recent algorithms for both object pose estimation and object pose tracking.
... However, these methods necessitate manual labeling of training data, which is time-consuming and does not promote diversity across various scenes and object categories [21]. Also, achieving real-time performance proves challenging for the majority of these works, even with the utilization of highperformance GPUs [40]. ...
Pose estimation and tracking of objects is a fundamental application in 3D vision. Event cameras possess remarkable attributes such as high dynamic range, low latency, and resilience against motion blur, which enables them to address challenging high dynamic range scenes or high-speed motion. These features make event cameras an ideal complement over standard cameras for object pose estimation. In this work, we propose a line-based robust pose estimation and tracking method for planar or non-planar objects using an event camera. Firstly, we extract object lines directly from events, then provide an initial pose using a globally-optimal Branch-and-Bound approach, where 2D-3D line correspondences are not known in advance. Subsequently, we utilize event-line matching to establish correspondences between 2D events and 3D models. Furthermore, object poses are refined and continuously tracked by minimizing event-line distances. Events are assigned different weights based on these distances, employing robust estimation algorithms. To evaluate the precision of the proposed methods in object pose estimation and tracking, we have devised and established an event-based moving object dataset. Compared against state-of-the-art methods, the robustness and accuracy of our methods have been validated both on synthetic experiments and the proposed dataset. The source code is available at https://github.com/Zibin6/LOPET.
... The PETS2009 dataset was used by Gennaro et al. [30] and Wang et al. [72] for pedestrian tracking application. The region-based object tracking (RBOT) [74] dataset is a monocular RGB dataset developed to determine the pose, such as translation and rotation, of the objects. These are known objects, and their pose is relative to the camera. ...
Object tracking is one of the most important problems in computer vision applications such as robotics, autonomous driving, and pedestrian movement. There has been a significant development in camera hardware where researchers are experimenting with the fusion of different sensors and developing image processing algorithms to track objects. Image processing and deep learning methods have significantly progressed in the last few decades. Different data association methods accompanied by image processing and deep learning are becoming crucial in object tracking tasks. The data requirement for deep learning methods has led to different public datasets that allow researchers to benchmark their methods. While there has been an improvement in object tracking methods, technology, and the availability of annotated object tracking datasets, there is still scope for improvement. This review contributes by systemically identifying different sensor equipment, datasets, methods, and applications, providing a taxonomy about the literature and the strengths and limitations of different approaches, thereby providing guidelines for selecting equipment, methods, and applications. Research questions and future scope to address the unresolved issues in the object tracking field are also presented with research direction guidelines.
... When it comes to effectively benchmarking any proposed 6DoF pose detector, using annotated datasets as ground truth (GT) appears to be a common practice. Among the most renowned is RBOT [35], a semi-synthetic dataset for the evaluation of monocular object pose tracking algorithms. The YCB-Video [21] dataset provides sequences of videos informed by accurate 6DoF pose notations. ...
We present TimberTool (TTool v2.1.1), a software designed for woodworking tasks assisted by augmented reality (AR), emphasizing its essential function of the real-time localization of a tool head’s poses within camera frames. The localization process, a fundamental aspect of AR-assisted tool operations, enables informed integration with contextual tracking, facilitating the computation of meaningful feedback for guiding users during tasks on the target object. In the context of timber construction, where object pose tracking has been predominantly explored in additive processes, TTool addresses a noticeable gap by focusing on subtractive tasks with manual tools. The proposed methodology utilizes a machine learning (ML) classifier to detect tool heads, offering users the capability to input a global pose and utilizing an automatic pose refiner for final pose detection and model alignment. Notably, TTool boasts adaptability through a customizable platform tailored to specific tool sets, and its open accessibility encourages widespread utilization. To assess the effectiveness of TTool in AR-assisted woodworking, we conducted a preliminary experimental campaign using a set of tools commonly employed in timber carpentry. The findings suggest that TTool can effectively contribute to AR-assisted woodworking tasks by detecting the six-degrees-of-freedom (6DoF) pose of tool heads to a satisfactory level, with a millimetric positional error of 3.9 ± 1 mm with possible large room for improvement and 1.19 ± 0.6° for what concerns the angular accuracy.
... The selection of the type of tracking algorithm to use considered three categories: (1) region-based, (2) feature-based, and (3) SLAM-based. Region-based algorithms [13][14][15] significantly limit tracking objects with heterogeneous colors and dense backgrounds. Both conditions are present in the objects to be tracked in the packing operation, so they are not explored in this study. ...
Available solutions to assist human operators in cargo packing processes offer alternatives to maximize the spatial occupancy of containers used in intralogistics. However, these solutions consist of sequential instructions for picking each box and positioning it in the containers, making it challenging for an operator to interpret and requiring them to alternate between reading the instructions and executing the task. A potential solution to these issues lies in a tool that naturally communicates each box's initial and final location in the desired sequence to the operator. While 6D visual object tracking systems have demonstrated good performance, they have yet to be evaluated in real-world scenarios of manual box packing. They also need to use the available prior knowledge of the packing operation, such as the number of boxes, box size, and physical packing sequence. This study explores the inclusion of box size priors in 6D plane segment tracking systems driven by images from moving cameras and quantifies their contribution in terms of tracker performance when assessed in manual box packing operations. To do this, it compares the performance of a plane segment tracking system, considering variations in the tracking algorithm and camera speed (onboard the packing operator) during the mapping of a manual cargo packing process. The tracking algorithm varies at two levels: algorithm ( A wpk ), which integrates prior knowledge of box sizes in the scene, and algorithm ( A woutpk ), which assumes ignorance of box properties. Camera speed is also evaluated at two levels: low speed ( S low ) and high speed ( S high ). This study analyzes the impact of these factors on the precision, recall, and F1-score of the plane segment tracking system. ANOVA analysis was applied to the precision and F1-score results, which allows determining that neither the camera speed-algorithm interactions nor the camera speed are significant in the precision of the tracking system. The factor that presented a significant effect is the tracking algorithm. Tukey's pairwise comparisons concluded that the precision and F1-score of each algorithm level are significantly different, with algorithm A wpk being superior in each evaluation. This superiority reaches its maximum in the tracking of top plane segments: 22 and 14 percentage units for precision and F1-score metrics, respectively. However, the results on the recall metric remain similar with and without the addition of prior knowledge. The contribution of including prior knowledge of box sizes in ( 6 D ) plane segment tracking algorithms is identified in reducing false positives. This reduction is associated with significant increases in the tracking system's precision and F1-score metrics. Future work will investigate whether the identified benefits propagate to the tracking problem on objects composed of plane segments, such as cubes or boxes.
... Region-based methods use image statistics to model the probability that a pixel belongs to an object or background areas in the environment [11]. And the pose and the corresponding contour of the object that best fit the image segmentation can be found, and a change in the pose of the object can be tracked. ...
The assembly and maintenance of products in the aviation industry constitute a crucial aspect of the product life cycle, with numerous tasks still reliant on manual operations. In order to solve the problem of narrow operation spaces and blind areas in the processes of manual assembly and maintenance, we proposed an augmented reality (AR) assistant guidance method specifically designed for such scenarios. By employing a multi-modality anti-occlusion tracking algorithm, pose data of assembly parts can be obtained, upon which AR guidance information is displayed. Additionally, we proposed an assembly step identification method to alleviate user interaction pressure. We developed an AR visualization assistant guidance system and designed and conducted a user evaluation experiment to measure the learnability, usability, and mental effort required. The results demonstrate that our method significantly enhances training efficiency by 128.77%, as well as improving assembly and maintenance efficiency by 29.53% and 27.27% compared with traditional methods. Moreover, it has significant advantages in learnability, usability, and mental effort, providing a feasible and effective resolution for addressing blind areas during assembly and maintenance within the aviation industry.
... The method used here is sparse because it is uses lines when evolving the contour, not regions, as used by e.g. [11]. ...
Remotely Operated Vehicles (ROVs) are essential instruments in most industrial applications of subsea Inspection, Maintenance, and Repair (IMR). In IMR applications, especially in short-distance inspection and intervention operations, ROVs should be able to find its position and orientation with respect to underwater platforms or specific objects on the platform. Current work class ROVs are normally operated by two pilots. Improving the situational awareness and sensing capabilities is an essential necessity from industrial actors in such a challenging environment. Automated visual based interpretation shows promising results and essential tools toward higher level of autonomy within IMR and ROV operations. Motivated by the large amount of available information in ROVcameras, this article presents a pipeline for object detection, 6D pose estimation and tracking. The main contribution of the paper is its new detection procedure where the current tracker state is evaluated for acceptance or rejection of the new detection. The decomposed tasks of detection, 6D pose estimation and tracking in the proposed pipeline is tested on an open underwater dataset featuring pose annotated objects used for ROV intervention. The results show that the method is highly suitable for supervised autonomy and a step forward to autonomous operations in underwater IMR applications.
... Finally, the intermediate pose is refined using a region-based method. Evaluation experiments, using both synthetic and real images in comparison to recently published representative methods of [3][4][5], were performed extensively. The results indicate that the proposed method is able to achieve performance superior to state-of-the-art methods, especially for objects with large pose shifts. ...
... Tjaden et al. [23] introduced a novel localized model using the temporally consistent local color histograms to preserve temporal consistency. In [3], the authors summarized their previous work [21,23] and introduced a novel iteratively reweighted Gauss-Newton optimization method. Region-based methods with localized models [3,5,9,21,[23][24][25] only use the pixels within a limited band along the projected object contour, and are therefore prone to failure when tracking symmetrical objects. ...
... In [3], the authors summarized their previous work [21,23] and introduced a novel iteratively reweighted Gauss-Newton optimization method. Region-based methods with localized models [3,5,9,21,[23][24][25] only use the pixels within a limited band along the projected object contour, and are therefore prone to failure when tracking symmetrical objects. Zhong et al. [9] introduced an approach combining direct and region-based methods by utilizing the pixels of foreground's interior. ...
Monocular object pose tracking has been a key technology in autonomous rendezvous of two moving platforms. However, rapid relative motion between platforms causes large interframe pose shifts, which leads to pose tracking failure. Based on the derivation of the region-based pose tracking method and the theory of rigid body kinematics, we put forward that the stability of the color segmentation model and linearization in pose optimization are the key to region-based monocular object pose tracking. A reliable metric named VoI is designed to measure interframe pose shifts, based on which we argue that motion continuity recovery is a promising way to tackle the translation-dominant large pose shift issue. Then, a 2D tracking method is adopted to bridge the interframe motion continuity gap. For texture-rich objects, the motion continuity can be recovered through localized region-based pose transferring, which is performed by solving a PnP (Perspective-n-Point) problem within the tracked 2D bounding boxes of two adjacent frames. Moreover, for texture-less objects, a direct translation approach is introduced to estimate an intermediate pose of the frame. Finally, a region-based pose refinement is exploited to obtain the final tracked pose. Experimental results on synthetic and real image sequences indicate that the proposed method achieves superior performance to state-of-the-art methods in tracking objects with large pose shifts.
... The semi-synthetic region-based object tracking (RBOT) dataset [12] consists of RGB sequences with four different levels of difficulties as shown in Fig 2. The 3D mesh models of 18 objects with perfect ground truth trajectories are also provided. Although the objects are synthetically rendered, the background consists of moving real images that are highly cluttered which makes tracking much more challenging. ...
... Similar to previous studies, we evaluate the tracking success rate as defined by Tjaden et al. [12], the translation and rotation errors are calculated as where t est (k) and R est (k) are the estimated and t gt (k) and R gt (k) are the ground truth translation vector and rotation matrix for frame k ∈ {0, ..., 1000}. Thus, the tracking success rate is defined as the percentage of estimated pose that is successful if both e t < 5cm and e r < 5 • . ...
... Table II shows the average runtime per frame, where recent methods such as ours, SRT3D, and ICG can achieve realtime tracking on CPU only due to the use of precomputed sparse viewpoint information. This avoids the costly process of rendering an image to extract the object's silhouette, which would take around 0.7 ms on a GPU for each iteration [12]. ...
... Here, the feature maps x are extracted by the backbone network; w 1 and w 2 denote the parameters of the convolution layers of D; * denotes the standard multi-channel convolution; w 1 is used for dimensionality reduction to simplify the calculation and w 2 is used for calculating the classification confidence scores. Aim to regress the score of a potential object, the L2 segmentation loss is used to establish the objective function based on the fast-converging Gauss-Newton [27,36,38] optimizer as: ...
Most video object segmentation networks have difficulties in balancing accuracy and speed, leading them to fail to meet the requirements of application. In this paper, we propose a lightweight online-trained video object segmentation network. Specifically, to force the network focus on the potential object, we propose a new way to guide the encoder module by classification score map, and integrate a cross-dimension attention into the refinement segmentation module. Meanwhile, to reduce the negative influence of unreliable samples, we use two indexes to adaptively choose templates for the memory module. Experiments were conducted on three popular benchmarks, and our approach has achieved a good trade-off between accuracy and speed.