Preprint

6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

We propose 6DGS to estimate the camera pose of a target RGB image given a 3D Gaussian Splatting (3DGS) model representing the scene. 6DGS avoids the iterative process typical of analysis-by-synthesis methods (e.g. iNeRF) that also require an initialization of the camera pose in order to converge. Instead, our method estimates a 6DoF pose by inverting the 3DGS rendering process. Starting from the object surface, we define a radiant Ellicell that uniformly generates rays departing from each ellipsoid that parameterize the 3DGS model. Each Ellicell ray is associated with the rendering parameters of each ellipsoid, which in turn is used to obtain the best bindings between the target image pixels and the cast rays. These pixel-ray bindings are then ranked to select the best scoring bundle of rays, which their intersection provides the camera center and, in turn, the camera rotation. The proposed solution obviates the necessity of an "a priori" pose for initialization, and it solves 6DoF pose estimation in closed form, without the need for iterations. Moreover, compared to the existing Novel View Synthesis (NVS) baselines for pose estimation, 6DGS can improve the overall average rotational accuracy by 12% and translation accuracy by 22% on real scenes, despite not requiring any initialization pose. At the same time, our method operates near real-time, reaching 15fps on consumer hardware.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
This document explains how to mesh the hemisphere with equal view factor elements. The main characteristic of the method is the definition of elements delimited by the two classical spherical coordinates (polar and azimuth angles) similar to the geographical longitude and latitude. This choice is very convenient to identify the localization of the elements on the sphere; it also simplifies a lot the determination of rays for either deterministic or stratified sampled Monte Carlo ray tracing. The generation of the mesh is very fast and consequently well suited for ray tracing methods. The quality of the set of rays spatially very well distributed is a fundamental element of the whole process reliability.
Conference Paper
Full-text available
This paper deals with local 3D descriptors for surface matching. First, we categorize existing methods into two classes: Signatures and Histograms. Then, by discussion and experiments alike, we point out the key issues of unique- ness and repeatability of the local reference frame. Based on these observations, we formulate a novel comprehensive proposal for surface representation, which encompasses a new unique and repeatable local reference frame as well as a new 3D descriptor. The latter lays at the intersection between Signatures and His- tograms, so as to possibly achieve a better balance between descriptiveness and robustness. Experiments on publicly available datasets as well as on range scans obtained with Spacetime Stereo provide a thorough validation of our proposal.
Article
We present VERF, a collection of two methods (VERF-PnP and VERF-Light) for providing runtime assurance on the correctness of a camera pose estimate of a monocular camera without relying on direct depth measurements. We leverage the ability of NeRF (Neural Radiance Fields) to render novel RGB perspectives of a scene. We only require as input the camera image whose pose is being estimated, an estimate of the camera pose we want to monitor, and a NeRF model containing the scene pictured by the camera. We can then predict if the pose estimate is within a desired distance from the ground truth and justify our prediction with a level of assurance. VERF-Light does this by rendering a viewpoint with NeRF at the estimated pose and estimating its relative offset to the sensor image up to scale. Since scene scale is unknown, the approach renders another auxiliary image and reasons over the consistency of the optical flows across the three images. VERF-PnP takes a different approach by rendering a stereo pair of images with NeRF and utilizing the Perspective-n-Point (PnP) algorithm. We evaluate both methods on the LLFF dataset, on data from a Unitree A1 quadruped robot, and on data collected from Blue Origin's sub-orbital New Shepard rocket to demonstrate the effectiveness of the proposed pose monitoring method across a range of scene scales. We also show monitoring can be completed in under half a second on a 3090 GPU.
Article
Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks that are costly to train and render, while recent faster methods inevitably trade off speed for quality. For unbounded and complete scenes (rather than isolated objects) and 1080p resolution rendering, no current method can achieve real-time display rates. We introduce three key elements that allow us to achieve state-of-the-art visual quality while maintaining competitive training times and importantly allow high-quality real-time (≥ 30 fps) novel-view synthesis at 1080p resolution. First, starting from sparse points produced during camera calibration, we represent the scene with 3D Gaussians that preserve desirable properties of continuous volumetric radiance fields for scene optimization while avoiding unnecessary computation in empty space; Second, we perform interleaved optimization/density control of the 3D Gaussians, notably optimizing anisotropic covariance to achieve an accurate representation of the scene; Third, we develop a fast visibility-aware rendering algorithm that supports anisotropic splatting and both accelerates training and allows realtime rendering. We demonstrate state-of-the-art visual quality and real-time rendering on several established datasets.
Article
Object SLAM introduces the concept of objects into Simultaneous Localization and Mapping (SLAM) and helps understand indoor scenes for mobile robots and object-level interactive applications. The state-of-art object SLAM systems face challenges such as partial observations, occlusions, unobservable problems, limiting the mapping accuracy and robustness. This letter proposes a novel monocular Semantic Object SLAM (SO-SLAM) system that addresses the introduction of object spatial constraints. We explore three representative spatial constraints, including scale proportional constraint, symmetrical texture constraint and plane supporting constraint. Based on these semantic constraints, we propose two new methods - a more robust object initialization method and an orientation fine optimization method. We have verified the performance of the algorithm on the public datasets and an author-recorded mobile robot dataset and achieved a significant improvement on mapping effects. We will release the code here. <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup
Article
We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully connected (nonconvolutional) deep network, whose input is a single continuous 5D coordinate (spatial location ( x , y , z ) and viewing direction ( θ, ϕ )) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis.
Article
Herein, we present a novel approach for monocular dual quadric initialization that combines three-dimensional (3D) map points with two-dimensional (2D) object detection for forward-translating camera movements. The traditional approach using 2D detection bounding boxes in multiple views fails in straight vehicle motion scenarios as object observation is limited to few frames. Although single image initialization is possible when multiple constraints are introduced, such initialization is based on strong assumptions. In this work, we incorporate constraints from 3D map points with single-view 2D object detection to robustly initialize the dual quadric. Constraints from 3D map points are converted to planar constraints from their convex hull. Together with the projective planar constraints from bounding boxes, the proposed method can infer accurate dual quadric parameters. Further, comparison studies with the state of the art (SOTA) show that the proposed approach achieves the same accuracy of center localization but outperforms the existing methods in shape estimation and success ratio of initialization. The proposed method dose not rely on assumptions of dimension and pose of 3D objects; hence, it is more generic and accurate. Based on the KITTI raw dataset, the initialization success ratio is up to 97.7\% with an average position error of 1.58 m, and 2D IoU of 80\% when the number of map points per object accumulates to 60. When applied to the TUM RGB-D dataset, the proposed approach yields an initialization success ratio of 92.7\% when the number of map points per object accumulates to 30, revealing a 16.2\% increment compared with the SOTA using an RGB-D camera. Finally, we integrate the initialization method into a simultaneous localization and mapping sy
Article
Recent years have seen the emergence of very effective ConvNet-based object detectors that have reconfigured the computer vision landscape. As a consequence, new approaches that propose object-based reasoning to solve traditional problems, such as camera pose estimation, have appeared. In particular, these methods have shown that modelling 3D objects by ellipsoids and 2D detections by ellipses offers a convenient manner to link 2D and 3D data. Following that promising direction, we propose here a novel object-based pose estimation algorithm that does not require any sensor but a RGB camera. Our method operates from at least two object detections, and is based on a new paradigm that enables to decrease the Degrees of Freedom (DoF) of the pose estimation problem from six to three, while two simplifying yet realistic assumptions reduce the remaining DoF to only one. Exhaustive search is performed over the unique unknown parameter to recover the full camera pose. Robust algorithms designed to deal with any number of objects as well as a refinement step are introduced. Effectiveness of the method has been assessed on the challenging T-LESS and Freiburg datasets.
Chapter
Recent approaches on visual scene understanding attempt to build a scene graph – a computational representation of objects and their pairwise relationships. Such rich semantic representation is very appealing, yet difficult to obtain from a single image, especially when considering complex spatial arrangements in the scene. Differently, an image sequence conveys useful information using the multi-view geometric relations arising from camera motions. Indeed, object relationships are naturally related to the 3D scene structure. To this end, this paper proposes a system that first computes the geometrical location of objects in a generic scene and then efficiently constructs scene graphs from video by embedding such geometrical reasoning. Such compelling representation is obtained using a new model where geometric and visual features are merged using an RNN framework. We report results on a dataset we created for the task of 3D scene graph generation in multiple views.
Chapter
Simultaneous Localization And Mapping (SLAM) is a fundamental problem in mobile robotics. While point-based SLAM methods provide accurate camera localization, the generated maps lack semantic information. On the other hand, state of the art object detection methods provide rich information about entities present in the scene from a single image. This work marries the two and proposes a method for representing generic objects as quadrics which allows object detections to be seamlessly integrated in a SLAM framework. For scene coverage, additional dominant planar structures are modeled as infinite planes. Experiments show that the proposed points-planes-quadrics representation can easily incorporate Manhattan and object affordance constraints, greatly improving camera localization and leading to semantically meaningful maps.
Article
In several recently proposed stochastic optimization methods (e.g. RMSProp, Adam, Adadelta), parameter updates are scaled by the inverse square roots of exponential moving averages of squared past gradients. Maintaining these per-parameter second-moment estimators requires memory equal to the number of parameters. For the case of neural network weight matrices, we propose maintaining only the per-row and per-column sums of these moving averages, and estimating the per-parameter second moments based on these sums. We demonstrate empirically that this method produces similar results to the baseline. Secondly, we show that adaptive methods can produce larger-than-desired updates when the decay rate of the second moment accumulator is too slow. We propose update clipping and a gradually increasing decay rate scheme as remedies. Combining these methods and dropping momentum, we achieve comparable results to the published Adam regime in training the Transformer model on the WMT 2014 English-German machine translation task, while using very little auxiliary storage in the optimizer. Finally, we propose scaling the parameter updates based on the scale of the parameters themselves.
Article
We present a benchmark for image-based 3D reconstruction. The benchmark sequences were acquired outside the lab, in realistic conditions. Ground-truth data was captured using an industrial laser scanner. The benchmark includes both outdoor scenes and indoor environments. High-resolution video sequences are provided as input, supporting the development of novel pipelines that take advantage of video input to increase reconstruction fidelity. We report the performance of many image-based 3D reconstruction pipelines on the new benchmark. The results point to exciting challenges and opportunities for future work.
Article
This paper addresses the computation of radiative exchange factors through Monte Carlo ray tracing with the aim of reducing their computation time when dealing with the finite element method. Both direction and surface samplings are studied. The recently-introduced isocell method for partitioning the unit disk is applied to the direction sampling and compared to other direction sampling methods. It is then combined to different surface sampling schemes with either one or multiple rays traced per surface sampling point. Two promising approaches present better performances than standard spacecraft thermal analysis software. The first approach combines a Gauss surface sampling strategy with a local isocell direction sampling, whereas the second approach fires one ray per surface points using a global isocell direction sampling scheme. The advantages and limitations of the two methods are discussed, and they are benchmarked against a standard thermal analysis software using the entrance baffle of the Extreme Ultraviolet Imager instrument of the Solar Orbiter mission.
Article
The author (2) has shown that corresponding to each positive square matrix A (i.e. every a ij > 0) is a unique doubly stochastic matrix of the form D 1 AD 2 , where the D i are diagonal matrices with positive diagonals. This doubly stochastic matrix can be obtained as the limit of the iteration defined by alternately normalizing the rows and columns of A. In this paper, it is shown that with a sacrifice of one diagonal D it is still possible to obtain a stochastic matrix. Of course, it is necessary to modify the iteration somewhat. More precisely, it is shown that corresponding to each positive square matrix A is a unique stochastic matrix of the form DAD where D is a diagonal matrix with a positive diagonal. It is shown further how this stochastic matrix can be obtained as a limit to an iteration on A.
Article
Thesis (M.S.)--Dept. of Computer Science, University of Utah, 1988. Includes bibliographical references.
Conference Paper
An object recognition system has been developed that uses a new class of local image features. The features are invariant to image scaling, translation, and rotation, and partially invariant to illumination changes and affine or 3D projection. These features share similar properties with neurons in inferior temporal cortex that are used for object recognition in primate vision. Features are efficiently detected through a staged filtering approach that identifies stable points in scale space. Image keys are created that allow for local geometric deformations by representing blurred image gradients in multiple orientation planes and at multiple scales. The keys are used as input to a nearest neighbor indexing method that identifies candidate object matches. Final verification of each match is achieved by finding a low residual least squares solution for the unknown model parameters. Experimental results show that robust object recognition can be achieved in cluttered partially occluded images with a computation time of under 2 seconds
Onepose++: Keypointfree one-shot object pose estimation without cad models
  • X He
  • J Sun
  • Y Wang
  • D Huang
  • H Bao
  • X Zhou
He, X., Sun, J., Wang, Y., Huang, D., Bao, H., Zhou, X.: Onepose++: Keypointfree one-shot object pose estimation without cad models. In: NeurIPS (2022)
Partition of the circle in cells of equal area and shape
  • L Masset
  • O Brüls
  • G Kerschen
Masset, L., Brüls, O., Kerschen, G.: Partition of the circle in cells of equal area and shape. Tech. rep., Structural Dynamics Research Group, Aerospace and Mechanical Engineering Department, University of Liege, 'Institut de Mecanique et G´enie Civil (B52/3) (2011)
Fourier features let networks learn high frequency functions in low dimensional domains
  • M Tancik
  • P P Srinivasan
  • B Mildenhall
  • S Fridovich-Keil
  • N Raghavan
  • U Singhal
  • R Ramamoorthi
  • J T Barron
  • R Ng
Tancik, M., Srinivasan, P.P., Mildenhall, B., Fridovich-Keil, S., Raghavan, N., Singhal, U., Ramamoorthi, R., Barron, J.T., Ng, R.: Fourier features let networks learn high frequency functions in low dimensional domains. In: NeurIPS (2020)
Nemo: Neural mesh models of contrastive features for robust 3d pose estimation
  • A Wang
  • A Kortylewski
  • A Yuille
Wang, A., Kortylewski, A., Yuille, A.: Nemo: Neural mesh models of contrastive features for robust 3d pose estimation. In: ICLR (2020)
Voge: a differentiable volume renderer using gaussian ellipsoids for analysis-by-synthesis
  • A Wang
  • P Wang
  • J Sun
  • A Kortylewski
  • A Yuille
Wang, A., Wang, P., Sun, J., Kortylewski, A., Yuille, A.: Voge: a differentiable volume renderer using gaussian ellipsoids for analysis-by-synthesis. In: ICLR (2022)