ES6D: A Computation Efficient and Symmetry-Aware 6D Pose Regression Framework

Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.


In this paper, a computation efficient regression framework is presented for estimating the 6D pose of rigid objects from a single RGB-D image, which is applicable to handling symmetric objects. This framework is designed in a simple architecture that efficiently extracts point-wise features from RGB-D data using a fully convolutional network, called XYZNet, and directly regresses the 6D pose without any post refinement. In the case of symmetric object, one object has multiple ground-truth poses, and this one-to-many relationship may lead to estimation ambiguity. In order to solve this ambiguity problem, we design a symmetry-invariant pose distance metric, called average (maximum) grouped primitives distance or A(M)GPD. The proposed A(M)GPD loss can make the regression network converge to the correct state, i.e., all minima in the A(M)GPD loss surface are mapped to the correct poses. Extensive experiments on YCB-Video and T-LESS datasets demonstrate the proposed framework's substantially superior performance in top accuracy and low computational cost.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
6-DoF object pose estimation from a single RGB image is a fundamental and long-standing problem in computer vision. Current leading approaches solve it by training deep networks to either regress both rotation and translation from image directly or to construct 2D-3D correspondences and further solve them via PnP indirectly. We argue that rotation and translation should be treated differently for their significant difference. In this work, we propose a novel 6-DoF pose estimation approach: Coordinates-based Disentangled Pose Network (CDPN), which disentangles the pose to predict rotation and translation separately to achieve highly accurate and robust pose estimation. Our method is flexible, efficient, highly accurate and can deal with texture-less and occluded objects. Extensive experiments on LINEMOD and Occlusion datasets are conducted and demonstrate the superiority of our approach. Concretely , our approach significantly exceeds the state-of-the-art RGB-based methods on commonly used metrics.
Full-text available
The parameterization of rotations is a central topic in many theoretical and applied fields such as rigid body mechanics, multibody dynamics, robotics, spacecraft attitude dynamics, navigation, 3D image processing, computer graphics, etc. Nowadays, the main alternative to the use of rotation matrices, to represent rotations in $\R^3$, is the use of Euler parameters arranged in quaternion form. Whereas the passage from a set of Euler parameters to the corresponding rotation matrix is unique and straightforward, the passage from a rotation matrix to its corresponding Euler parameters has been revealed to be somewhat tricky if numerical aspects are considered. Since the map from quaternions to $3{\times}3$ rotation matrices is a 2-to-1 covering map, this map cannot be smoothly inverted. As a consequence, it is erroneously assumed that all inversions should necessarily contain singularities that arise in the form of quotients where the divisor can be arbitrarily small. This misconception is herein clarified. This paper reviews the most representative methods available in the literature, including a comparative analysis of their computational costs and error performances. The presented analysis leads to the conclusion that Cayley's factorization, a little-known method used to compute the double quaternion representation of rotations in four dimensions from $4{\times}4$ rotation matrices, is the most robust method when particularized to three dimensions.
Conference Paper
Full-text available
We present PPFNet - Point Pair Feature NETwork for deeply learning a globally informed 3D local feature descriptor to find correspondences in unorganized point clouds. PPFNet learns local descriptors on pure geometry and is highly aware of the global context, an important cue in deep learning. Our 3D representation is computed as a collection of point-pair-features combined with the points and normals within a local vicinity. Our permutation invariant network design is inspired by PointNet and sets PPFNet to be ordering-free. As opposed to voxelization, our method is able to consume raw point clouds to exploit the full sparsity. PPFNet uses a novel \textit{N-tuple} loss and architecture injecting the global information naturally into the local descriptor. It shows that context awareness also boosts the local feature representation. Qualitative and quantitative evaluations of our network suggest increased recall, improved robustness and invariance as well as a vital step in the 3D descriptor extraction performance.
Full-text available
This article conerns the expressive power of depth in neural nets with ReLU activations. We prove that ReLU nets with width $2d+2$ can approximate any continuous scalar function on the $d$-dimensional cube $[0,1]^d$ arbitrarily well. We obtain quantitative depth estimates for such approximations. Our approach is based on the observation that ReLU nets are particularly well-suited for representing convex functions. Indeed, we give a constructive proof that ReLU nets with width $d+1$ can approximate any continuous convex function of $d$ arbitrarily well. Moreover, when approximating convex, piecewise affine functions by width $d+1$ ReLU nets, we obtain matching upper and lower bounds on the required depth, proving that our construction is essentially optimal.
Full-text available
We introduce T-LESS, a new public dataset for estimating the 6D pose, i.e. translation and rotation, of texture-less rigid objects. The dataset features thirty industry-relevant objects with no significant texture and no discriminative color or reflectance properties. The objects exhibit symmetries and mutual similarities in shape and/or size. Compared to other datasets, a unique property is that some of the objects are parts of others. The dataset includes training and test images that were captured with three synchronized sensors, specifically a structured-light and a time-of-flight RGB-D sensor and a high-resolution RGB camera. There are approximately 39K training and 10K test images from each sensor. Additionally, two types of 3D models are provided for each object, i.e. a manually created CAD model and a semi-automatically reconstructed one. Training images depict individual objects against a black background. Test images originate from twenty test scenes having varying complexity, which increases from simple scenes with several isolated objects to very challenging ones with multiple instances of several objects and with a high amount of clutter and occlusion. The images were captured from a systematically sampled view sphere around the object/scene, and are annotated with accurate ground truth 6D poses of all modeled objects. Initial evaluation results indicate that the state of the art in 6D object pose estimation has ample room for improvement, especially in difficult cases with significant occlusion. The T-LESS dataset is available online at
Full-text available
A pose of a rigid object is usually regarded as a rigid transformation, described by a translation and a rotation. In this article, we define a pose as a distinguishable static state of the considered object, and show that the usual identification of the pose space with the space of rigid transformations is abusive, as it is not adapted to objects with proper symmetries. Based solely on geometric considerations, we propose a frame-invariant metric on the pose space, valid for any physical object, and requiring no arbitrary tuning. This distance can be evaluated efficiently thanks to a representation of poses within a low dimension Euclidean space, and enables to perform efficient neighborhood queries such as radius searches or k-nearest neighbor searches within a large set of poses using off-the-shelf methods. We lastly solve the problems of projection from the Euclidean space onto the pose space, and of pose averaging for this metric. The practical value of those theoretical developments is illustrated with an application of pose estimation of instances of a 3D rigid object given an input depth map, via a Mean Shift procedure .
Conference Paper
Full-text available
In this technical demonstration, we will show our framework of automatic modeling, detection, and tracking of arbitrary texture-less 3D objects with a Kinect. The detection is mainly based on the recent template-based LINEMOD approach [1] while the automatic template learning from reconstructed 3D models, the fast pose estimation and the quick and robust false positive removal is a novel addition. In this demonstration, we will show each step of our pipeline, starting with the fast reconstruction of arbitrary 3D objects, followed by the automatic learning and the robust detection and pose estimation of the reconstructed objects in real-time. As we will show, this makes our framework suitable for object manipulation e.g. in robotics applications.
Full-text available
We present MOPED, a framework for Multiple Object Pose Estimation and Detection that seamlessly integrates single-image and multi-image object recognition and pose estimation in one optimized, robust, and scalable framework. We address two main challenges in computer vision for robotics: robust performance in complex scenes, and low latency for real-time operation. We achieve robust performance with Iterative Clustering Estimation (ICE), a novel algorithm that iteratively combines feature clustering with robust pose estimation. Feature clustering quickly partitions the scene and produces object hypotheses. The hypotheses are used to further refine the feature clusters, and the two steps iterate until convergence. ICE is easy to parallelize, and easily integrates single- and multi-camera object recognition and pose estimation. We also introduce a novel object hypothesis scoring function based on M-estimator theory, and a novel pose clustering algorithm that robustly handles recognition outliers. We achieve scalability and low latency with an improved feature matching algorithm for large databases, a GPU/CPU hybrid architecture that exploits parallelism at all levels, and an optimized resource scheduler. We provide extensive experimental results demonstrating state-of-the-art performance in terms of recognition, scalability, and latency in real-world robotic applications.
Object 6D pose estimation is a fundamental task in many applications. Conventional methods solve the task by detecting and matching the keypoints, then estimating the pose. Recent efforts bringing deep learning into the problem mainly overcome the vulnerability of conventional methods to environmental variation due to the hand-crafted feature design. However, these methods cannot achieve end-to-end learning and good interpretability at the same time. In this letter, we propose REDE, a novel end-to-end object pose estimator using RGB-D data, which utilizes network for keypoint regression, and a differentiable geometric pose estimator for pose error back-propagation. Besides, to achieve better robustness when outlier keypoint prediction occurs, we further propose a differentiable outliers elimination method that regresses the candidate result and the confidence simultaneously. Via confidence weighted aggregation of multiple candidates, we can reduce the effect from the outliers in the final estimation. Finally, following the conventional method, we apply a learnable refinement process to further improve the estimation. The experimental results on three benchmark datasets show that REDE slightly outperforms the state-of-the-art approaches and is more robust to object occlusion. Our code is available at .
Estimating the 6D pose of known objects is important for robots to interact with objects in the real world. The problem is challenging due to the variety of objects as well as the complexity of the scene caused by clutter and occlusion between objects. In this work, we introduce a new Convolutional Neural Network (CNN) for 6D object pose estimation named PoseCNN. PoseCNN estimates the 3D translation of an object by localizing its center in the image and predicting its distance from the camera. The 3D rotation of the object is estimated by regressing to a quaternion representation. PoseCNN is able to handle symmetric objects and is also robust to occlusion between objects. In addition, we contribute a large scale video dataset for 6D object pose estimation named the YCB-Video dataset. Our dataset provides accurate 6D poses of 21 objects from the YCB dataset observed in 92 videos with 133,827 frames. We conduct experiments on our YCB-Video dataset and the OccludedLINEMOD dataset to show that PoseCNN provides very good estimates using only color as input.
Augmented reality (AR) allows to seamlessly insert virtual objects in an image sequence. In order to accomplish this goal, it is important that synthetic elements are rendered and aligned in the scene in an accurate and visually acceptable way. The solution of this problem can be related to a pose estimation or, equivalently, a camera localization process. This paper aims at presenting a brief but almost self-contented introduction to the most important approaches dedicated to vision-based camera localization along with a survey of several extension proposed in the recent years. For most of the presented approaches, we also provide links to code of short examples. This should allow readers to easily bridge the gap between theoretical aspects and practical implementations.
Today, visual recognition systems are still rarely employed in robotics applications. Perhaps one of the main reasons for this is the lack of demanding benchmarks that mimic such scenarios. In this paper, we take advantage of our autonomous driving platform to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM and 3D object detection. Our recording platform is equipped with four high resolution video cameras, a Velodyne laser scanner and a state-of-the-art localization system. Our benchmarks comprise 389 stereo and optical flow image pairs, stereo visual odometry sequences of 39.2 km length, and more than 200k 3D object annotations captured in cluttered scenarios (up to 15 cars and 30 pedestrians are visible per image). Results from state-of-the-art algorithms reveal that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world. Our goal is to reduce this bias by providing challenging benchmarks with novel difficulties to the computer vision community. Our benchmarks are available online at:
We show that standard multilayer feedforward networks with as few as a single hidden layer and arbitrary bounded and nonconstant activation function are universal approximators with respect to Lp(μ) performance criteria, for arbitrary finite input environment measures μ, provided only that sufficiently many hidden units are available. If the activation function is continuous, bounded and nonconstant, then continuous mappings can be learned uniformly over compact input sets. We also give very general conditions ensuring that networks with sufficiently smooth activation functions are capable of arbitrarily accurate approximation to a function and its derivatives.
On evaluation of 6d object pose estimation
  • Tomáš Hodaň
  • Jiří Matas
  • Obdržálek Andštěpán
Tomáš Hodaň, Jiří Matas, andŠtěpán Obdržálek. On evaluation of 6d object pose estimation. In European Conference on Computer Vision, pages 606-619. Springer, 2016. 3, 4
Bop challenge 2020 on 6d object localization
  • Tomáš Hodaň
  • Martin Sundermeyer
  • Bertram Drost
  • Yann Labbé
  • Eric Brachmann
  • Frank Michel
  • Carsten Rother
  • Jiří Matas
Tomáš Hodaň, Martin Sundermeyer, Bertram Drost, Yann Labbé, Eric Brachmann, Frank Michel, Carsten Rother, and Jiří Matas. Bop challenge 2020 on 6d object localization. In European Conference on Computer Vision, pages 577-594. Springer, 2020. 1
Cosypose: Consistent multi-view multi-object 6d pose estimation
  • Yann Labbé
  • Justin Carpentier
  • Aubry Mathieu
  • Josef Sivic
Yann Labbé, Justin Carpentier, Mathieu Aubry, and Josef Sivic. Cosypose: Consistent multi-view multi-object 6d pose estimation. In European Conference on Computer Vision, pages 574-591. Springer, 2020. 7
A unified framework for multi-view multi-class object pose estimation
  • Chi Li
  • Jin Bai
  • Gregory D Hager
Chi Li, Jin Bai, and Gregory D Hager. A unified framework for multi-view multi-class object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 254-269, 2018. 2, 4, 8
Pointvoxel cnn for efficient 3d deep learning
  • Zhijian Liu
  • Haotian Tang
  • Yujun Lin
  • Song Han
Zhijian Liu, Haotian Tang, Yujun Lin, and Song Han. Pointvoxel cnn for efficient 3d deep learning. Advances in Neural Information Processing Systems, 32, 2019. 2
On object symmetries and 6d pose estimation from images
  • Giorgia Pitteri
  • Michaël Ramamonjisoa
  • Slobodan Ilic
  • Vincent Lepetit
Giorgia Pitteri, Michaël Ramamonjisoa, Slobodan Ilic, and Vincent Lepetit. On object symmetries and 6d pose estimation from images. In 2019 International Conference on 3D Vision (3DV), pages 614-622. IEEE, 2019. 2
Pointnet: Deep learning on point sets for 3d classification and segmentation
  • Hao Charles R Qi
  • Kaichun Su
  • Leonidas J Mo
  • Guibas
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652-660, 2017. 2
Pointnet++: Deep hierarchical feature learning on point sets in a metric space
  • Li Charles Ruizhongtai Qi
  • Hao Yi
  • Leonidas J Su
  • Guibas
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099-5108, 2017. 3, 7
Pam: Point-wise attention module for 6d object pose estimation
  • Myoungha Song
  • Jeongho Lee
  • Donghwan Kim
Myoungha Song, Jeongho Lee, and Donghwan Kim. Pam: Point-wise attention module for 6d object pose estimation. arXiv preprint arXiv:2008.05242, 2020. 2
Deep object pose estimation for semantic robotic grasping of household objects
  • Jonathan Tremblay
  • Thang To
  • Balakumar Sundaralingam
  • Yu Xiang
  • Dieter Fox
  • Stan Birchfield
Jonathan Tremblay, Thang To, Balakumar Sundaralingam, Yu Xiang, Dieter Fox, and Stan Birchfield. Deep object pose estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790, 2018. 1
  • Zelin Xu
  • Ke Chen
Zelin Xu, Ke Chen, and Kui Jia. W-posenet: Dense correspondence regularized pixel pair pose regression. arXiv preprint arXiv:1912.11888, 2019. 2