Conference Paper

IFFNeRF: Initialisation Free and Fast 6DoF pose estimation from a single image and a NeRF model

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The model IFFNeRF [57], published in 2024, is a real-time 6DoF camera pose estimation method that utilizes NeRF-based Metropolis-Hastings sampling and an attention-driven ray matching mechanism to estimate poses without an initial guess, demonstrating improved robustness and efficiency across synthetic and real-world datasets ...
... Moreover, the higher prevalence of single image models with RGB inputs suggests that these models strike a balance between performance and computational efficiency, making them suitable for real-time applications where speed is paramount. 10 ✓ Pix2Pose [10] 10 ✓ DeepIM [61] 12 ✓ SO-Pose [23] 20 ✓ PoET [43] 20 ✓ CoS-PVNet [51] 21 ✓ SE-UF-PVNet [49] 23.6 ✓ ASPP-DF-PVNet [18] 25 ✓ RNNPose [86] 26.6 ✓ Compressed YOLO-6D [40] 27 ✓ Focal Segmentation [53] 29.2 ✓ HybridPose [16] 30 ✓ BB8 [5] 31 ✓ DPOD [12] 33 ✓ Lite-HRPE [55] 33 ✓ CDPN [8] 33.3 ✓ Song et al. [24] 33.3 ✓ SC6D [33] 33.3 ✓ Lichun Wang et al. [46] 34 ✓ PVNet [32] 35.8 ✓ CRT-6D [41] 38.5 ✓ YOLOX-6D-Pose [56] 50.5 ✓ Fupan Wang et al. [54] 70.9 ✓ YOLO-6D+ [17] 71 ✓ YOLO-6D+ [17] 71 ✓ RePOSE [22] 92 ✓ Self6D [64] ✓ Hayashi et al. [68] ✓ [35] ✓ C2FNET [75] ✓ Improved PVNet [45] ✓ SASA-PVNet [80] ✓ BSAM-PVNet [50] ✓ EPOS [62] 1.3 ✓ ✓ GEN6D [38] 1.56 ✓ ✓ ZebraPose [37] 9.1 ✓ ✓ IFFNeRF [57] 34 ✓ ✓ AAE [13] 42 ✓ ✓ FFB6D [26] 13.33 ✓ ✓✓ GDR-NET [25] 45.5 ✓ ✓ PoseCNN [7] ✓ ✓ Junhao Cai et al. [85] ✓ LatentFusion [63] ✓ Zih-Yun Chiu et al. [48] 6.7 ✓ ✓ EPro-PnP [52] 31 ✓ ✓ Autonomous Mooring Model [60] ✓ [9] 33.3 ✓ ✓ PoseRBPF [11] 12.2 ✓ ✓ DPOPV2 [27] 32.3 ✓ ✓ Chen et al. [58] 0.5 ✓ 3DNEL [73] 1 ✓ Sijin Luo et al. [81] 5.1 ✓ BundleTrack [19] 10 ✓ FS-Net [21] 20 ✓ FS6D-DPM [28] 20 ✓ G2L-Net [15] 23 ✓ GPV-Pose [29] 50 ✓ Haotong Lin et al. [70] ✓ MSDA [79] ✓ Improved PVNet 2 [78] ✓ Yiwei Song et al. [87] ✓ DFTr network [42] 20 ✓ ✓ DCL-Net [67] ✓ ✓ SwinDePose [83] 0.5 ✓ Table 16. Single Image Models and Input Types (Part III). ...
Article
Full-text available
Three-dimensional object recognition is crucial in modern applications, including robotics in manufacturing, household items, augmented and virtual reality, and autonomous driving. Extensive research and numerous surveys have been conducted in this field. This study aims to create a model selection guide by addressing key questions we need to answer when we want to select a 6D pose estimation model: inputs, modalities, real-time capabilities, hardware requirements, evaluation datasets, performance metrics, strengths, limitations, and special attributes such as symmetry or occlusion handling. By analyzing 84 models, including 62 new ones beyond previous surveys, and identifying 25 datasets 14 newly introduced, we organized the results into comparison tables and standardized summarization templates. This structured approach facilitates easy model comparison and selection based on practical application needs. The focus of this study is on the practical aspects of utilizing 6D pose estimation models, providing a valuable resource for researchers and practitioners.
... These factors can limit their real-world applicability. Recently, IFFNeRF [6] utilized a method that inverts the NeRF model to re-render an image to match a target one. However, unlike our approach, it does not consider the specificities of 3DGS, which include ellipsoid elongation and rotation, and their non-uniform distribution across the scene surface. ...
... (ab) 1.6 + (ac) 1.6 + (bc) 1. 6 3 ...
Preprint
We propose 6DGS to estimate the camera pose of a target RGB image given a 3D Gaussian Splatting (3DGS) model representing the scene. 6DGS avoids the iterative process typical of analysis-by-synthesis methods (e.g. iNeRF) that also require an initialization of the camera pose in order to converge. Instead, our method estimates a 6DoF pose by inverting the 3DGS rendering process. Starting from the object surface, we define a radiant Ellicell that uniformly generates rays departing from each ellipsoid that parameterize the 3DGS model. Each Ellicell ray is associated with the rendering parameters of each ellipsoid, which in turn is used to obtain the best bindings between the target image pixels and the cast rays. These pixel-ray bindings are then ranked to select the best scoring bundle of rays, which their intersection provides the camera center and, in turn, the camera rotation. The proposed solution obviates the necessity of an "a priori" pose for initialization, and it solves 6DoF pose estimation in closed form, without the need for iterations. Moreover, compared to the existing Novel View Synthesis (NVS) baselines for pose estimation, 6DGS can improve the overall average rotational accuracy by 12% and translation accuracy by 22% on real scenes, despite not requiring any initialization pose. At the same time, our method operates near real-time, reaching 15fps on consumer hardware.
... CROSSFIRE [17] incorporates learned local features to mitigate local minima but still relies on accurate initial pose priors. IFFNeRF [18] proposes NeRF model inversion to rerender images matching a target view but overlooks unique 3DGS characteristics, such as ellipsoid elongation, rotation, and non-uniform spatial distribution, which our approach effectively addresses. [13] pioneers LiDAR-camera fused 3DGS mapping using KD-trees and 2D voxel grids, employing NCC for coarse alignment and PnP for pose refinement. ...
Preprint
Full-text available
6-DoF pose estimation is a fundamental task in computer vision with wide-ranging applications in augmented reality and robotics. Existing single RGB-based methods often compromise accuracy due to their reliance on initial pose estimates and susceptibility to rotational ambiguity, while approaches requiring depth sensors or multi-view setups incur significant deployment costs. To address these limitations, we introduce SplatPose, a novel framework that synergizes 3D Gaussian Splatting (3DGS) with a dual-branch neural architecture to achieve high-precision pose estimation using only a single RGB image. Central to our approach is the Dual-Attention Ray Scoring Network (DARS-Net), which innovatively decouples positional and angular alignment through geometry-domain attention mechanisms, explicitly modeling directional dependencies to mitigate rotational ambiguity. Additionally, a coarse-to-fine optimization pipeline progressively refines pose estimates by aligning dense 2D features between query images and 3DGS-synthesized views, effectively correcting feature misalignment and depth errors from sparse ray sampling. Experiments on three benchmark datasets demonstrate that SplatPose achieves state-of-the-art 6-DoF pose estimation accuracy in single RGB settings, rivaling approaches that depend on depth or multi-view images.
... Various works that apply NFs for localization and pose estimation utilize NeRF's internal features to establish 2D-3D correspondences [111,112], remove the need for an initial pose estimate [113], augment the training set of the pose regressor with a few-shot NeRF [114], or apply a decoupled representation of pose along with an edge-based sampling strategy to enhance the learning signal [115]. They also address dynamic scenes by integrating geometric motion and Neural Fields in Robotics segmentation for initial pose estimation, combined with static ray sampling to speed up view synthesis [116]. ...
Preprint
Full-text available
Neural Fields have emerged as a transformative approach for 3D scene representation in computer vision and robotics, enabling accurate inference of geometry, 3D semantics, and dynamics from posed 2D data. Leveraging differentiable rendering, Neural Fields encompass both continuous implicit and explicit neural representations enabling high-fidelity 3D reconstruction, integration of multi-modal sensor data, and generation of novel viewpoints. This survey explores their applications in robotics, emphasizing their potential to enhance perception, planning, and control. Their compactness, memory efficiency, and differentiability, along with seamless integration with foundation and generative models, make them ideal for real-time applications, improving robot adaptability and decision-making. This paper provides a thorough review of Neural Fields in robotics, categorizing applications across various domains and evaluating their strengths and limitations, based on over 200 papers. First, we present four key Neural Fields frameworks: Occupancy Networks, Signed Distance Fields, Neural Radiance Fields, and Gaussian Splatting. Second, we detail Neural Fields' applications in five major robotics domains: pose estimation, manipulation, navigation, physics, and autonomous driving, highlighting key works and discussing takeaways and open challenges. Finally, we outline the current limitations of Neural Fields in robotics and propose promising directions for future research. Project page: https://robonerf.github.io
Article
6D pose estimation from a monocular RGB sensor is essential for robotic assembly. While deep learning approaches leverage priors from synthetic and labeled real-world data to estimate 6D poses, their generalizability is often constrained by the limited scale and realism of training datasets. Moreover, scale ambiguity is an inherent issue when image datasets are captured without calibration or depth sensing. To address these limitations, we propose Phys-Field, a physics-aware neural surface framework for joint 6D pose and scale estimation in robotic assembly without requiring 3D models or depth sensing. We incorporate an isotropic scale factor into the 6D pose estimation process, formulating the task as an iterative optimization problem within the Sim(3) group. The framework integrates multiple differentiable neural surfaces to estimate corresponding Sim(3) transformations by minimizing photometric differences between rendered and observed images through rendering inversion. To achieve conflict-free compositional rendering and efficient collision detection, we introduce deep convex decomposition. Concurrently, Phys-Field leverages dynamic simulations with physics parameters derived from large language models to evaluate the plausibility of the estimated Sim(3) transformations, effectively resolving scale ambiguity and accelerating convergence. We validate our framework’s effectiveness in a real-world assembly task, with experimental results showing notable improvements in reconstruction fidelity, Sim(3) transformations estimation accuracy, and assembly success rate.
Article
Full-text available
The 6D pose estimation of an object from an image is a central problem in many domains of Computer Vision (CV) and researchers have struggled with this issue for several years. Traditional pose estimation methods (1) leveraged on geometrical approaches, exploiting manually annotated local features, or (2) relied on 2D object representations from different points of view and their comparisons with the original image. The two methods mentioned above are also known as Feature-based and Template-based, respectively. With the diffusion of Deep Learning (DL), new Learning-based strategies have been introduced to achieve the 6D pose estimation, improving traditional methods by involving Convolutional Neural Networks (CNN). This review analyzed techniques belonging to different research fields and classified them into three main categories: Template-based methods, Feature-based methods, and Learning-Based methods. In recent years, the research mainly focused on Learning-based methods, which allow the training of a neural network tailored for a specific task. For this reason, most of the analyzed methods belong to this category, and they have been in turn classified into three sub-categories: Bounding box prediction and Perspective-n-Point (PnP) algorithm-based methods, Classification-based methods, and Regression-based methods. This review aims to provide a general overview of the latest 6D pose recovery methods to underline the pros and cons and highlight the best-performing techniques for each group. The main goal is to supply the readers with helpful guidelines for the implementation of performing applications even under challenging circumstances such as auto-occlusions, symmetries, occlusions between multiple objects, and bad lighting conditions.
Conference Paper
Full-text available
This document explains how to mesh the hemisphere with equal view factor elements. The main characteristic of the method is the definition of elements delimited by the two classical spherical coordinates (polar and azimuth angles) similar to the geographical longitude and latitude. This choice is very convenient to identify the localization of the elements on the sphere; it also simplifies a lot the determination of rays for either deterministic or stratified sampled Monte Carlo ray tracing. The generation of the mesh is very fast and consequently well suited for ray tracing methods. The quality of the set of rays spatially very well distributed is a fundamental element of the whole process reliability.
Article
Neural Radiance Fields (NeRF) is a popular view synthesis technique that represents a scene as a continuous volumetric function, parameterized by multilayer perceptrons that provide the volume density and view-dependent emitted radiance at each location. While NeRF-based techniques excel at representing fine geometric structures with smoothly varying view-dependent appearance, they often fail to accurately capture and reproduce the appearance of glossy surfaces. We address this limitation by introducing Ref-NeRF, which replaces NeRF's parameterization of view-dependent outgoing radiance with a representation of reflected radiance and structures this function using a collection of spatially-varying scene properties. We show that together with a regularizer on normal vectors, our model significantly improves the realism and accuracy of specular reflections. Furthermore, we show that our model's internal representation of outgoing radiance is interpretable and useful for scene editing.
Chapter
NeRF aims to learn a continuous neural scene representation by using a finite set of input images taken from various viewpoints. A well-known limitation of NeRF methods is their reliance on data: the fewer the viewpoints, the higher the likelihood of overfitting. This paper addresses this issue by introducing a novel method to generate geometrically consistent image transitions between viewpoints using View Morphing. Our VM-NeRF approach requires no prior knowledge about the scene structure, as View Morphing is based on the fundamental principles of projective geometry. VM-NeRF tightly integrates this geometric view generation process during the training procedure of standard NeRF approaches. Notably, our method significantly improves novel view synthesis, particularly when only a few views are available. Experimental evaluation reveals consistent improvement over current methods that handle sparse viewpoints in NeRF models. We report an increase in PSNR of up to 1.8 dB and 1.0 dB when training uses eight and four views, respectively. Source code: https://github.com/mbortolon97/VM-NeRF.
Chapter
We present TensoRF, a novel approach to model and reconstruct radiance fields. Unlike NeRF that purely uses MLPs, we model the radiance field of a scene as a 4D tensor, which represents a 3D voxel grid with per-voxel multi-channel features. Our central idea is to factorize the 4D scene tensor into multiple compact low-rank tensor components. We demonstrate that applying traditional CANDECOMP/PARAFAC (CP) decomposition – that factorizes tensors into rank-one components with compact vectors – in our framework leads to improvements over vanilla NeRF. To further boost performance, we introduce a novel vector-matrix (VM) decomposition that relaxes the low-rank constraints for two modes of a tensor and factorizes tensors into compact vector and matrix factors. Beyond superior rendering quality, our models with CP and VM decompositions lead to a significantly lower memory footprint in comparison to previous and concurrent works that directly optimize per-voxel features. Experimentally, we demonstrate that TensoRF with CP decomposition achieves fast reconstruction (<30 min) with better rendering quality and even a smaller model size (<4 MB) compared to NeRF. Moreover, TensoRF with VM decomposition further boosts rendering quality and outperforms previous state-of-the-art methods, while reducing the reconstruction time (<10 min) and retaining a compact model size (<75 MB).
Article
Neural graphics primitives, parameterized by fully connected neural networks, can be costly to train and evaluate. We reduce this cost with a versatile new input encoding that permits the use of a smaller network without sacrificing quality, thus significantly reducing the number of floating point and memory access operations: a small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through stochastic gradient descent. The multiresolution structure allows the network to disambiguate hash collisions, making for a simple architecture that is trivial to parallelize on modern GPUs. We leverage this parallelism by implementing the whole system using fully-fused CUDA kernels with a focus on minimizing wasted bandwidth and compute operations. We achieve a combined speedup of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds, and rendering in tens of milliseconds at a resolution of 1920×1080.
Chapter
We would like robots to achieve purposeful manipulation by placing any instance from a category of objects into a desired set of goal states. Existing manipulation pipelines typically specify the desired configuration as a target 6-DOF pose and rely on explicitly estimating the pose of the manipulated objects. However, representing an object with a parameterized transformation defined on a fixed template cannot capture large intra-category shape variation, and specifying a target pose at a category level can be physically infeasible or fail to accomplish the task – e.g. knowing the pose and size of a coffee mug relative to some canonical mug is not sufficient to successfully hang it on a rack by its handle. Hence we propose a novel formulation of category-level manipulation that uses semantic 3D keypoints as the object representation. This keypoint representation enables a simple and interpretable specification of the manipulation target as geometric costs and constraints on the keypoints, which flexibly generalizes existing pose-based manipulation methods. Using this formulation, we factor the manipulation policy into instance segmentation, 3D keypoint detection, optimization-based robot action planning and local dense-geometry-based action execution. This factorization allows us to leverage advances in these sub-problems and combine them into a general and effective perception-to-action manipulation pipeline. Our pipeline is robust to large intra-category shape variation and topology changes as the keypoint representation ignores task-irrelevant geometric details. Extensive hardware experiments demonstrate our method can reliably accomplish tasks with never-before seen objects in a category, such as placing shoes and mugs with significant shape variation into category level target configurations. The video, supplementary material and source code are available on our project page https://sites.google.com/view/kpam.
Article
We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully connected (nonconvolutional) deep network, whose input is a single continuous 5D coordinate (spatial location ( x , y , z ) and viewing direction ( θ, ϕ )) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis.
Chapter
We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x, y, z) and viewing direction ) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis. View synthesis results are best viewed as videos, so we urge readers to view our supplementary video for convincing comparisons.
Article
We present a benchmark for image-based 3D reconstruction. The benchmark sequences were acquired outside the lab, in realistic conditions. Ground-truth data was captured using an industrial laser scanner. The benchmark includes both outdoor scenes and indoor environments. High-resolution video sequences are provided as input, supporting the development of novel pipelines that take advantage of video input to increase reconstruction fidelity. We report the performance of many image-based 3D reconstruction pipelines on the new benchmark. The results point to exciting challenges and opportunities for future work.
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
Article
This paper addresses the computation of radiative exchange factors through Monte Carlo ray tracing with the aim of reducing their computation time when dealing with the finite element method. Both direction and surface samplings are studied. The recently-introduced isocell method for partitioning the unit disk is applied to the direction sampling and compared to other direction sampling methods. It is then combined to different surface sampling schemes with either one or multiple rays traced per surface sampling point. Two promising approaches present better performances than standard spacecraft thermal analysis software. The first approach combines a Gauss surface sampling strategy with a local isocell direction sampling, whereas the second approach fires one ray per surface points using a global isocell direction sampling scheme. The advantages and limitations of the two methods are discussed, and they are benchmarked against a standard thermal analysis software using the entrance baffle of the Extreme Ultraviolet Imager instrument of the Solar Orbiter mission.
Article
A general method, suitable for fast computing machines, for investigating such properties as equations of state for substances consisting of interacting individual molecules is described. The method consists of a modified Monte Carlo integration over configuration space. Results for the two-dimensional rigid-sphere system have been obtained on the Los Alamos MANIAC and are presented here. These results are compared to the free volume equation of state and to a four-term virial coefficient expansion.
Article
A general method, suitable for fast computing machines, for investigating such properties as equations of state for substances consisting of interacting individual molecules is described. The method consists of a modified Monte Carlo integration over configuration space. Results for the two-dimensional rigid-sphere system have been obtained on the Los Alamos MANIAC and are presented here. These results are compared to the free volume equation of state and to a four-term virial coefficient expansion. The Journal of Chemical Physics is copyrighted by The American Institute of Physics.
Article
Thesis (M.S.)--Dept. of Computer Science, University of Utah, 1988. Includes bibliographical references.
Partition of the circle in cells of equal area and shape
  • L Masset
  • O Bruls
  • G Kerschen
Fourier features let networks learn high frequency functions in low dimensional domains
  • M Tancik
  • P P Srinivasan
  • B Mildenhall
  • S Fridovich-Keil
  • N Raghavan
  • U Singhal
Dinov2: Learning robust visual features without supervision
  • M Oquab
  • T Darcet
  • T Moutakanni
  • H Vo
  • M Szafraniec
  • V Khalidov
  • P Fernandez
  • D Haziza
  • F Massa
  • A El-Nouby
  • M Assran
  • N Ballas
  • W Galuba
  • R Howes
  • P.-Y Huang
  • S.-W Li
  • I Misra
  • M Rabbat
  • V Sharma
  • G Synnaeve
  • H Xu
  • H Jegou
  • J Mairal
  • P Labatut
  • A Joulin
  • P Bojanowski
Neural sparse voxel fields
  • L Liu
  • J Gu
  • K Z Lin
  • T.-S Chua
  • C Theobalt
Partition of the circle in cells of equal area and shape
  • Masset
Dinov2: Learning robust visual features without supervision
  • Oquab
Fourier features let networks learn high frequency functions in low dimensional domains
  • Tancik
Neural sparse voxel fields
  • Liu
Dinov2: Learning robust visual features without supervision
  • M Oquab
  • T Darcet
  • T Moutakanni
  • H Vo
  • M Szafraniec
  • V Khalidov