Conference PaperPDF Available

Abstract and Figures

This paper tackles the problem of novel view synthesis (NVS) from 360 • images with imperfect camera poses or intrinsic parameters. We propose a novel end-to-end framework for training Neural Radiance Field (NeRF) models given only 360 • RGB images and their rough poses, which we refer to as Omni-NeRF. We extend the pinhole camera model of NeRF to a more general camera model that better fits omni-directional fish-eye lenses. The approach jointly learns the scene geometry and optimizes the camera parameters without knowing the fisheye projection. Link to the code: https://gitlab.inria.fr/kagu/omni-nerf
Content may be subject to copyright.
OMNI-NERF: NEURAL RADIANCE FIELD FROM 360IMAGE CAPTURES
Kai Gu, Thomas Maugey, Sebastian Knorr,and Christine Guillemot
INRIA Rennes - Bretagne Atlantique, France;
Ernst-Abbe University of Applied Sciences Jena, Germany; Technical University of Berlin, Germany.
{firstname.lastname}@inria.fr; {firstname.lastname}@eah-jena.de
ABSTRACT
This paper tackles the problem of novel view synthesis (NVS)
from 360images with imperfect camera poses or intrinsic
parameters. We propose a novel end-to-end framework for
training Neural Radiance Field (NeRF) models given only
360RGB images and their rough poses, which we refer to as
Omni-NeRF. We extend the pinhole camera model of NeRF to
a more general camera model that better fits omni-directional
fish-eye lenses. The approach jointly learns the scene geome-
try and optimizes the camera parameters without knowing the
fisheye projection.
Index TermsLight field, omni-directional imaging,
rendering, view synthesis, deep learning
1. INTRODUCTION
The concept of 6 degrees of freedom (6DOF) video content
has recently emerged with the goal of enabling immersive ex-
perience in terms of free roaming, i.e. allowing viewing the
scene from any viewpoint and direction in space [1]. How-
ever, no such real-life full 6DOF light field capturing solu-
tion exists so far. Light field cameras have been designed to
record orientations of light rays, hence to sample the plenop-
tic function in all directions, thus enabling view synthesis for
perspective shift and scene navigation [14]. Several camera
designs have been proposed for capturing light fields, going
from uniform arrays of pinholes placed in front of the sensor
[4] to arrays of micro-lenses placed between the main lens and
the sensor [11], arrays of cameras [17], and coded attenuation
masks [9]. However, these light field cameras have a limited
field of view. On the other hand, omni-directional cameras
allow capturing a panoramic scene with a 360field of view
but do not record information on the orientation of light rays
emitted by the scene.
Neural Radiance Fields (NeRF) have been introduced in
[10] as an implicit scene representation that allows render-
ing all light field views with high quality. NeRF models
the scene as a continuous function, and is parameterized as
a multi-layer perceptron (MLP). The function represents the
mapping between the 5D spatial and angular coordinates of
light rays emitted by the scene into its three RGB color com-
ponents and a volume density measure. NeRF is capable of
modeling complex large-scale, and even unbounded, scenes.
With a proper parameterization of the coordinates and a well-
designed foreground-background architecture, NeRF++ [18]
is capable of modeling scenes having a large depth, with sat-
isfying resolution in both the near and far fields.
First NeRF-based models need the camera pose param-
eters to map the pixel coordinates to the ray directions, and
use COLMAP [13, 12] for the camera parameter estimation.
An end-to-end framework, called NeRF–, is proposed in [16]
for training NeRF models without pre-computed camera pa-
rameters. The intrinsic and extrinsic camera parameters are
automatically discovered via joint optimisation during train-
ing of the NeRF model. The authors in [5] consider more
complex non-linear camera models and propose a new geo-
metric loss function to jointly learn the geometry of the scene
and the camera parameters. Solutions are also proposed in [2]
and [3] to enable faster inference as well as using a sparser set
of input views, and to enable generalization to new scenes.
Our motivation here is to be able to capture or reconstruct
light fields with a very large field of view, in particular 360.
We focus on the question: how do we extract omni-directional
information and potentially benefit from it when reconstruct-
ing a spherical light field of a large-scale scene with a non-
converged camera setup?
To address this question, we propose a method for learn-
ing a 360neural radiance field from omni-directional images
captured with fisheye cameras. We adapt the pinhole cam-
era model of NeRF to a general camera model that fits omni-
directional fisheye lenses. Finally, we extend our approach
to allow accurate camera parameter estimation (intrinsic and
extrinsic) for omni-directional fisheye lenses. To evaluate our
approach, we render photo-realistic panoramic fisheye views
from 2Blender scenes: an indoor ”Classroom” scene and a
larger scale scene ”Lone Monk” of an atrium surrounded by
buildings. With these datasets, we first evaluate how spherical
sampling improves the performance of view synthesis com-
pared to planar sampling. We prove that our model can learn
the fisheye distortion from scratch with ground truth camera
pose. We also assess the camera extrinsic parameter esti-
mation, with a noisy initialization which simulates the case
(a) 360fisheye images (b) Spatial camera grid
Fig. 1. Space sampling with 360cameras (fisheye camera
pairs) in a spatial grid.
where the camera parameters are imperfect or measured with
error. Finally, we train and evaluate our model on two real
scenes captured by fisheye cameras.
2. NERF: BACKGROUND
In this section we briefly introduce the method of synthesiz-
ing novel views of complex static scenes by representing the
scenes as Neural Radiance Fields (NeRF) [10].
In NeRF, a scene is represented with a Multi-Layer Per-
ceptron (MLP). Given the 3D spatial coordinate location X=
(x, y, z)of a voxel and the viewing direction d= (θ, ϕ)in a
scene, the MLP predicts the volume density σand the view-
dependent emitted radiance or color c= [R, G, B]at that
spatial location for volumetric rendering. In the concrete im-
plementation, Xis first fed into the MLP, and outputs σand
intermediate features, which are then fed into an additional
fully connected layer to predict the color c. The volume den-
sity is hence decided by the spatial location X, while the color
cis decided by both Xand viewing direction d. The network
can thus yield different illumination effects for different view-
ing directions. This procedure can be formulated as
[R, G, B, σ ] = FΘ(x, y, z, θ, ϕ),(1)
with Θ = {Wi, bi}being weights and biases of the MLP. The
network is trained by fitting the rendered (synthesized) views
with the reference (ground-truth) views via the minimization
of the total squared error between the rendered and true pixel
colors in the different views. NeRF uses a rendering function
based on classical volume rendering [6]. Rendering a view
from NeRF requires calculating the expected color C(r)by
integrating the accumulated transmittance T along the camera
ray r(t) = o+td, for obeing the origin of cast ray from the
desired virtual camera, with near and far bounds tnand tf.
This integral can be expressed as
C(r) = Ztf
tn
T(t)σ(r(t)c(r(t),d)dt, (2)
where T(t) = exp(Rt
tnσ(r(s)ds). In practice, the integral
is numerically approximated by sampling discretely along the
ray.
While the original NeRF in [10] requires the knowledge of
the camera pose parameters, a pose-free solution has been in-
troduced in [16] which estimates intrinsic and extrinsic cam-
era parameters while training the NeRF model.
Assuming a pinhole camera model, the camera parame-
ters can be expressed with the camera projection matrix
P=KR[I| − t].(3)
Iis a 3identity matrix and Kis the camera calibration matrix
K=
fx0cx
0fycy
0 0 1
,(4)
with the intrinsic camera parameters fxand fyas the focal
length, which are identical for square pixels, and cxand cy
as the offsets of the principal point from the top-left corner of
the image. The extrinsic camera parameters in Eq. (3) contain
the rotation and translation of the camera with respect to a 3-
D world coordinate system and are expressed by the 3×3
rotation matrix Rand the 3×1translation vector t.
When these camera parameters are unknown, they need to
be estimated. For camera rotation, the axis-angle representa-
tion is adopted in NeRF–:
Φ:= αω,ΦR3(5)
where αis the rotation angle and ωis the unit vector represent
the rotation axis. Φican be converted to the rotation matrix
Rusing the Rodrigues’ formula [16]. With such parametriza-
tion, the i-th camera extrinsics can be optimized by searching
for the parameters of Φiand ti. To render the color of the
m-th pixel pi,m = (u, v)from the i-th camera, we cast the
ray ri,m from the image plane as
ri,m(t) = o+td,(6)
where
d=R1
i
(ucx)/fx
(vcy)/fy
1
(7)
and o=tiusing the current estimation of camera parameters
πi= (fx, fy, cx, cy,Φi,ti).(8)
Then, the calculated coordinate and direction of the sam-
pled ray are fed into the NeRF model. The current estimated
color value of the pixel is rendered by Eq. (2) and can be
compared with the ground-truth color value by estimating the
squared error. Finally, the parameters Θof the NeRF model
and the camera parameters πiare optimized jointly by min-
imising the photometric loss as described in [16] and [7].
3. OMNI-NERF: OMNI-DIRECTIONAL NERF
3.1. Fisheye projection
Here, we focus on panoramic fisheye lenses which usually
have a field of view (FOV) greater than 180. To adapt NeRF
θ
object
point
optical
center
ru
rd
focal length
undistorted
image point
distorted
image point
Fig. 2. Distorted image projection.
to omni-directional images, the key is to recover the true ray
direction of pixel coordinates on the image plane. In order to
do so, we need to model the projection of the fisheye lens onto
the image plane. We can model these projections as relations
between rd, the radial distance from the image center to the
distorted image point, and θ, the incoming angle measured
from the lens axis as depicted in Fig. 2. For the specific
”equisolid” projection, the radial distance can be calculated
as
rd= 2f·sin ( θ
2).(9)
Different projections exist for panoramic fisheye lenses
[15]. In order to unify the description, we consider the sum
of the 4 first terms of the infinite series as an odd-order poly-
nomial representation of the incoming ray direction θdefined
as
θ=θd+k1θ3
d+k2θ5
d+k3θ7
d,(10)
where
θd= arctan ( rd
f),(11)
fis the focal length and k1,k2and k3are coefficients to fit
the different fisheye projections. Given a pixel pwith its pixel
coordinates (u, v)on the image plane, the radial distance rd
can then be expressed as
rd=q(ucx)2+ (vcy)2,(12)
where (cx, cy)are the coordinates of the principle point. The
actual direction of the ray of pixel pin the camera coordinate
system can then be expressed as a vector dc= (x, y, z)Twith
x= sin (θ)·(ucx)(13)
y= sin (θ)·(vcy)(14)
z= cos (θ).(15)
By applying a rotation and translation using the extrinsic pa-
rameters, the ray direction in world coordinates is then de-
fined as
d= [R1|t]dc.(16)
The rotation matrix Rcan be parameterized by the axis-
angle representation Φas described in Section 2.
When these camera parameters are not known, e.g. when
using real world captures, they need to be estimated. For this,
Sample planes Sample spheres
Image plane
Optical axis
Fig. 3. Planar sampling vs. spherical sampling.
we follow the approach of NeRF– and SCNeRF[5] where the
estimated parameters are:
πi= (fx, fy, cx, cy,Φi,ti, k1, k2, k3),(17)
i.e., extended by the parameters k1,k2and k3which approx-
imate the projection as in Eq. (10).
Assuming that the specific fisheye projection model is un-
known, we initialize (k1, k2, k3)to be (0,0,0), i.e., the pin-
hole perspective camera model. We then optimize the coeffi-
cients of the polynomial to fit the specific fisheye projection
by minimizing the photometric loss as mentioned in Section.
2. With different combinations of (k1, k2, k3)the polynomial
distortion model can approximate fisheye lenses or mirrors
with FOVs of up to 360.
3.2. Spherical sampling
In the original NeRF model, the pixels are rendered by sam-
pling 5D coordinates (location and viewing direction) along
the camera rays. For a camera ray r(t) = o+tdwith near
and far bounds tnand tf, where ois the origin of ray and dis
a vector giving the ray direction. The interval [tn, tf]is orig-
inally partitioned into Nevenly-spaced bins, and one sam-
ple is then uniformly drawn at random from each bin. Such
a sampling pattern is equivalent to placing sampling planes
parallel to the image plane as depicted in Fig. 3.
For fisheye lenses, however, the rays on the border are
more sparsely sampled than the rays close to the optical axis,
and the spacing tends to infinity when the angle of the inci-
dence ray approaches 90(see Fig. 3). As the sampling is
critical in the neural representation of the radiance field, such
large bins have a high chance to skip thin objects in the scene
and cause artifacts, i.e., resulting in degraded image quality.
Hence, we use a sampling on a sphere instead of the sam-
pling on planes to resolve the above issue. For a ray direction
d= (x, y, 1)T, we define the normalized direction as
dn= ( x
d,y
d,1
d)T(18)
With this spherical sampling scheme, the rays at the border
of the scene projected on an image have the same importance
as the rays in the center. Thus, the spherical sampling offers
a more uniform sampling of the whole scene. We therefore
define the near and far bounds to be concentric spheres with
radius tnear and tfar centered at the projection center.
”Classroom” ”Lone Monk”
FTV indoor Office
Fig. 4. Overview of all the scenes
Similarly, we partition [tnear, tf ar]into Nevenly-spaced
bins. Hence, the bins for rays in all the directions are equal as
shown in Fig. 3, and a ray is now expressed with
r(t) = o+tdn.(19)
Please note that the spherical sampling is not only pre-
ferred for fisheye lenses, but also improves the image quality
for any wide-angle lenses.
4. EXPERIMENTAL RESULTS
4.1. Dataset
The dataset consists of four scenes, two synthetic scenes from
the Blender demo ”Classroom” (by Christophe Seux.) and
”Lone Monk” (by Carlo Bergonzini) and two real scenes, the
FTV-indoor-sit dataset captured in a big meeting room [8] and
a densely sampled scene in an office room, both captured with
Samsung Gear360 cameras. Fig. 4 shows screenshots of the
test scenes.
First, we deploy virtual fisheye and perspective cameras
in the synthetic 3D scenes to create a dataset that samples the
3D scenes from different viewpoints. A pair of virtual cam-
eras facing forward and backward are placed in a bounded
cuboid at each vertex of the spatial grid to get full 360FOV.
By subdividing the edge of the cuboid, we use different sam-
pling of the space and therefore with a different number of
cameras. In particular, we investigated two different sam-
pling grids, namely 6x6x3 (108 view points) and 9x9x3 (243
view points) in each scene with fisheye and perspective cam-
era pairs, which are our training datasets. Furthermore, we
render a smooth path of 400 intermediate views in the sam-
pled 3D space by varying rotation and translation for both
camera types, which are our test datasets. Both training and
test images have a resolution of 600x600 pixels. The fish-
eye views and perspective views are rendered with equisolid
projection of 180FOV and perspective projection of 119
sample method Classroom Lone Monk
(samples) FE WA FE WA
planar(128) 22.46 24.77 15.13 24.68
planar(256) 24.78 25.19 25.38 25.04
spherical(128) 28.69 25.18 28.45 25.03
Table 1. PSNR values obtained with different sampling meth-
ods and synthetic datasets. ”FE” and ”WA” denote fisheye
and wide-angle perspective rendering, respectively.
FOV, respectively. We demonstrate the performance of Omni-
NeRF compared to SCNeRF using the synthetic datasets. Fi-
nally, we show that our model is capable of reconstructing
real scenes. For configuration details of the dataset and cam-
era setup, please refer to the provided supplementary material.
4.2. Fisheye and Spherical Sampling
Our first experiment aims to evaluate how spherical sampling
improves the performance of view synthesis compared with
planar sampling. For this, we train Omni-NeRF with syn-
thetic data using ground truth camera parameters and different
sampling schemes.
We use the hierarchical sampling strategy of NeRF and
consider training and rendering with planar sampling of 128
and 256 coarse samples along the rays. For spherical sam-
pling we use 128 coarse samples. Then, we allocate 128 fine
samples biased towards the relevant parts of the volume for
all the cases. The reported PSNR in Table 1 are averaged val-
ues over 400 test views of fisheye and perspective cameras.
The table shows that spherical sampling outperforms the pla-
nar sampling for fisheye rendering with much fewer samples.
For wide-angle perspective rendering, the spherical sampling
achieves similar performance with only half the number of
samples.
Fig. 5 shows small patches of rendered views with syn-
thetic data. The rendered views with planar sampling show
typical artifacts close to the border while the spherical sam-
pling with same or a lower number of samples resolves the
issue.
4.3. Fisheye distortion estimation
In our second experiment, we show that our model can learn
the fisheye distortion from scratch using ground truth camera
pose. And we also show that our model can optimize the cam-
era parameters using noisy camera pose as initialization. We
train our model and also SCNeRF on ”Classroom” and ”Lone
Monk” datasets. We first investigate the estimation of radial
distortion with different models, so we fix the extrinsics and
intrinsics to their ground truth values and we initialize both
models to correspond to pinhole projections. Fig. 6 shows
how our model fit to fisheye projection.
We evaluate our model with the average PSNR of the test
reference view GT. P 128 P 256 S 128
Fig. 5. Rendered views with planar (P) and spherical (S) sam-
pling with different numbers of inputs, with the ”Classroom”
scene. (GT denotes ground truth, PSNR is shown on each
sub-figure).
2000 iter. 4000 iter. 8000 iter.
Fig. 6. We initialize the projection as pinhole camera and start
to optimize the projection model. The figures show how our
model (first row) and SCNeRF (second row) fit the fisheye
projection during training.
set and the mean absolute error (MAE) of angles eθbetween
estimated ray directions ˆ
dand ground truth ray directions d
of all valid pixels in the fisheye view, given by
eθ= arccos ( ˆ
d·d
ˆ
d∥∥d).(20)
We then add uniformly distributed random noise to the ground
truth camera pose as initialization and optimize the noisy pose
jointly with the projection model. we add random rotation an-
gle θU(7.5,7.5)(rad)along random axis, and trans-
lation tU(0.075,0.075)(meter)along (x,y,z) axes for
each training view. We train our model and SCNeRF with
the same noisy initialization on the synthetic datasets. We re-
port the result of different scenes using ground truth pose and
noisy pose in Table 2 and visualize the projected ray error in
Fig. 7. The result shows that our model, with the proposed
polynomial representation of ray direction, can estimate the
fisheye distortion accurately, with ground truth or noisy cam-
era poses, while SCNeRF fails to learn the correct projection.
We finally train our model on real captured 360fisheye
Classroom Lone Monk
Method PSNR MAE PSNR MAE
SCNeRF (gt. pose) 13.04 0.21 12.94 0.28
Ours (gt. pose) 27.12 0.001 26.55 0.001
SCNeRF (noisy pose) 13.74 0.13 13.92 0.21
Ours (noisy pose) 21.65 0.004 24.68 0.003
Table 2. Comparison between SCNeRF and Omni-NeRF in
terms of fisheye distortion estimation. We report the average
PSNR of the smooth-path test set and the mean absolute error
of estimated ray direcitions after a same amount of iterations.
(a) SCNeRF (b) Ours
Fig. 7. Error of estimated ray directions. Please note that the
scales of the two figures are different.
data. The model has been initialized with imperfect camera
pose parameters and projection model. Fig. 8 shows exem-
plary the reference views and rendered views using our model
with 2different scenes.
4.4. Discussion
We found that NeRF– and other differentiable camera mod-
els have a limited search range of camera parameters. For
our non-planar and non-converging camera configurations,
rough initialization is necessary. Due to the peculiarities of
panoramic fisheye lenses, the camera usually captures some
of the pixels of the camera mount or the photographer, lead-
ing to problems of inconsistency between frames, which have
an impact on the scene reconstruction and parameter estima-
tion.
5. CONCLUSION
We proposed Omni-NeRF, a NeRF-based method that recon-
structs the scene with 360information from panoramic fish-
eye imaging. We optimized the parameterization for the esti-
mation of wide-angle fisheye distortion and we stressed the
importance of sampling in this case. We have shown that
the spherical sampling improves the performance when train-
ing with panoramic fisheye or wide-angle perspective images.
Furthermore, We have shown that our parametrization is bet-
ter suited than the one of SCNerf when using panoramic fish-
eye lenses. We have also shown that NeRF can be used
to reconstruct scenes with 360panoramic information, us-
ing a non-converged camera setup. Finally, we demonstrated
Reference View Rendered View
Fig. 8. Rendered view with Omni-NeRF trained for the FTV-
indoor-Sit [8] dataset and our office dataset. The model has
been initialized with imperfect camera pose parameters and
projection model.
that our method is also applicable to data collected with real
360 cameras, and that our model successfully reconstructs the
scene. Since we are dealing with the problem of reconstruct-
ing large-scale scenes, the question of how to optimally de-
ploy cameras in 3D space is worth further exploration.
6. ACKNOWLEDGEMENT
This project has received funding from the European Union’s
Horizon 2020 research and innovation programme under the
Marie Skłodowska-Curie grant agreement No 956770.
7. REFERENCES
[1] M. Broxton, J. Flynn, R. Overbeck, D. Erickson, P. Hed-
man, M. DuVall, J. Dourgarian, J. Busch, M. Whalen,
and P. Debevec. Immersive light field video with a lay-
ered mesh representation. ACM Transactions on Graph-
ics, 39(4):86:1–86:15, 2020.
[2] A. Chen, Z. Xu, F. Zhao, X. Zhang, F. Xiang, J. Yu, and
H. Su. MVSNeRF: Fast Generalizable Radiance Field
Reconstruction from Multi-View Stereo. arXiv preprint
arXiv:2103.15595, 2021.
[3] J. Chibane, A. Bansal, V. Lazova, and G. Pons-Moll.
Stereo Radiance Fields (SRF): Learning View Synthesis
for Sparse Views of Novel Scenes. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR),
2021.
[4] H. E. Ives. Parallax panoramagrams made with a large
diameter lens. Journal of the Optical Society of Amer-
ica, 20(6):332–342, 1930.
[5] Y. Jeong, S. Ahn, C. Choy, A. Anandkumar, M. Cho,
and J. Park. Self-Calibrating Neural Radiance Fields.
In IEEE International Conference on Computer Vision
(ICCV), 2021.
[6] J. T. Kajiya and B. P. Von Herzen. Ray Tracing Vol-
ume Densities. ACM Computer Graphics, 18(3):165–
174, 1984.
[7] C.-H. Lin, W.-C. Ma, A. Torralba, and S. Lucey. BARF:
Bundle-Adjusting Neural Radiance Fields. In IEEE
International Conference on Computer Vision (ICCV),
2021.
[8] T. Maugey, L. Guillo, and C. L. Cam. Ftv360: a multi-
view 360video dataset with calibration parameters. In
ACM Multimedia Systems Conference, 2019.
[9] E. Miandji, J. Unger, and C. Guillemot. Multi-shot sin-
gle sensor light field camera using a color coded mask.
In European Signal Processing Conference (EUSIPCO),
2018.
[10] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron,
R. Ramamoorthi, and R. Ng. Nerf: Representing scenes
as neural radiance fields for view synthesis. In European
Conference on Computer Vision (ECCV), 2020.
[11] R. Ng, M. Levoy, M. Br´
edif, G. Duval, M. Horowitz,
and P. Hanrahan. Light Field Photography with a Hand-
held Plenoptic Camera. Research Report CSTR 2005-
02, Stanford University, 2005.
[12] J. L. Sch¨
onberger, E. Zheng, J.-M. Frahm, and M. Polle-
feys. Pixelwise view selection for unstructured multi-
view stereo. In European Conference on Computer Vi-
sion (ECCV), 2016.
[13] J. L. Sch ¨
onberger and J.-M. Frahm. Structure-from-
motion revisited. In IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), 2016.
[14] J. Shi, X. Jiang, and C. Guillemot. Learning fused pixel
and feature-based view reconstructions for light fields.
In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2020.
[15] M. Thoby. Photographic lenses projections: com-
putational models, correction, conversion... Avail-
able: http://michel.thoby.free.fr/ Fisheye history short/
Projections/Fisheye projection-models.html., 2012.
[16] Z. Wang, S. Wu, W. Xie, M. Chen, and V. A. Prisacariu.
NeRF–: Neural Radiance Fields Without Known Cam-
era Parameters. arXiv preprint arXiv:2102.07064, 2021.
[17] B. Wilburn, N. Joshi, V. Vaish, E. V. Talvala, E. An-
tunez, A. Barth, A. Adams, M. Horowitz, and M. Levoy.
High performance imaging using large camera arrays.
ACM Transactions on Graphics, 24(3):765–776, 2005.
[18] K. Zhang, G. Riegler, N. Snavely, and V. Koltun.
NeRF++: Analyzing and Improving Neural Radiance
Fields. ArXiv, abs/2010.07492, 2020.
... After constructing the occupancy grid using depth priors, the NeRF model is trained from multiple simulated viewpoints, following the strategy outlined in [13]. During both the training and inference phases, only grid cells with high occupancy value are considered for ray tracing, and the occupancy values are dynami- cally updated throughout the training process. ...
... We compute the PSNR between the reference and rendered images at corresponding input image locations to ensure a precise evaluation. We performed a comparative analysis using a representative scene from the Structured3D dataset, comparing our method against OmniNeRF [13] and 360FusionNeRF [14]. ...
... The results, presented in Table 1, demonstrate that DP-NeRF outperforms both OmniNeRF [13] and 360FusionNeRF [14] in terms of PSNR, achieving a 10x reduction in training time. This highlights DP-NeRF's ability to generate realistic views that closely resemble the original features within the dataset, significantly enhancing training efficiency. ...
Preprint
MetaDecorator, is a framework that empowers users to personalize virtual spaces. By leveraging text-driven prompts and image synthesis techniques, MetaDecorator adorns static panoramas captured by 360{\deg} imaging devices, transforming them into uniquely styled and visually appealing environments. This significantly enhances the realism and engagement of virtual tours compared to traditional offerings. Beyond the core framework, we also discuss the integration of Large Language Models (LLMs) and haptics in the VR application to provide a more immersive experience.
... With advances in differentiable rendering and rasterization pipelines [41,65], recent works [34,55,62] showed that camera lens distortion can be optimized together with other parameters through a differentiable projection module. Prior works also adapt NeRF for panoramic inputs and fisheye distortions [27,31,43,63]. These solutions usually use parametric models tailored for specific lenses of interest and are not generalizable to a wide variety of lenses types. ...
Preprint
In this paper, we present a self-calibrating framework that jointly optimizes camera parameters, lens distortion and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. In particular, our technique enables high-quality scene reconstruction from Large field-of-view (FOV) imagery taken with wide-angle lenses, allowing the scene to be modeled from a smaller number of images. Our approach introduces a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. Our method is compatible with the fast rasterization of Gaussian Splatting, adaptable to a wide variety of camera lens distortion, and demonstrates state-of-the-art performance on both synthetic and real-world datasets.
... In recent 3D reconstruction research, Neural Radiance Field (NeRF) [36] has demonstrated the capability for photo-realistic novel-view synthesis, leading to studies on NeRF-based 360 image 3D reconstruction. This approach is widely used for directly reconstructing scenes in 3D [10,17,21,33] or indirectly representing 3D by estimating depth [6,8,30]. In particular, EgoNeRF [10] is a recently published NeRF-based reconstruction method, pointing out that a typical Cartesian coordinate is not appropriate for representing a large scene with omnidirectional images. ...
Preprint
Full-text available
Omnidirectional (or 360-degree) images are increasingly being used for 3D applications since they allow the rendering of an entire scene with a single image. Existing works based on neural radiance fields demonstrate successful 3D reconstruction quality on egocentric videos, yet they suffer from long training and rendering times. Recently, 3D Gaussian splatting has gained attention for its fast optimization and real-time rendering. However, directly using a perspective rasterizer to omnidirectional images results in severe distortion due to the different optical properties between two image domains. In this work, we present ODGS, a novel rasterization pipeline for omnidirectional images, with geometric interpretation. For each Gaussian, we define a tangent plane that touches the unit sphere and is perpendicular to the ray headed toward the Gaussian center. We then leverage a perspective camera rasterizer to project the Gaussian onto the corresponding tangent plane. The projected Gaussians are transformed and combined into the omnidirectional image, finalizing the omnidirectional rasterization process. This interpretation reveals the implicit assumptions within the proposed pipeline, which we verify through mathematical proofs. The entire rasterization process is parallelized using CUDA, achieving optimization and rendering speeds 100 times faster than NeRF-based methods. Our comprehensive experiments highlight the superiority of ODGS by delivering the best reconstruction and perceptual quality across various datasets. Additionally, results on roaming datasets demonstrate that ODGS restores fine details effectively, even when reconstructing large 3D scenes. The source code is available on our project page (https://github.com/esw0116/ODGS).
... Recent works have proposed NeRF based on camera models other than the traditional pinhole camera. Omni-NeRF [17] proposed a spherical sampling method for synthesizing novel views using fisheye images, demonstrating scene reconstruction for both synthetic and real fisheye camera datasets. 360FusionNeRF [18] introduced methods to render novel views from 360 • panoramic images in RGB-D, using geometric supervision to enhance rendering accuracy. ...
Article
Full-text available
Neural radiance fields (NeRF) have become an effective method for encoding scenes into neural representations, allowing for the synthesis of photorealistic views of unseen views from given input images. However, the applicability of traditional NeRF is significantly limited by its assumption that images are captured for object-centric scenes with a pinhole camera. Expanding these boundaries, we focus on driving scenarios using a fisheye camera, which offers the advantage of capturing visual information from a wide field of view. To address the challenges due to the unbounded and distorted characteristics of fisheye images, we propose an edge-aware integration loss function. This approach leverages sparse LiDAR projections and dense depth maps estimated from a learning-based depth model. The proposed algorithm assigns larger weights to neighboring points that have depth values similar to the sensor data. Experiments were conducted on the KITTI-360 and JBNU-Depth360 datasets, which are public and real-world datasets of driving scenarios using fisheye cameras. Experimental results demonstrated that the proposed method is effective in synthesizing novel view images, outperforming existing approaches.
... MODS [29], SOMSI [30], and Casual 6-DoF [31] showed the effectiveness of multi-sphere images for 360 • view synthesis. And Omni-NeRF [32] established omni-directional radiance field using spherical images. However, these methods take panorama shot input at dense multiple viewpoints, and the overlap ratio between input images is almost 100%. ...
Article
Full-text available
This paper proposes a hybrid radiance field representation for unbounded immersive light field reconstruction which supports high-quality rendering and aggressive view extrapolation. The key idea is to first formally separate the foreground and the background and then adaptively balance learning of them during the training process. To fulfill this goal, we represent the foreground and background as two separate radiance fields with two different spatial mapping strategies. We further propose an adaptive sampling strategy and a segmentation regularizer for more clear segmentation and robust convergence. Finally, we contribute a novel immersive light field dataset, named THUImmersive, with the potential to achieve much larger space 6DoF immersive rendering effects compared with existing datasets, by capturing multiple neighboring viewpoints for the same scene, to stimulate the research and AR/VR applications in the immersive light field domain. Extensive experiments demonstrate the strong performance of our method for unbounded immersive light field reconstruction.
... They are particularly used in technologies such as Visual Odometry and Simultaneous Localization And Mapping (SLAM) for estimating mobile motion ( [2], [3], [4]). In the field of computer vision, there is active research on training neural networks using fisheye images ( [5], [6], [7]). ...
Preprint
Full-text available
The increasing necessity for fisheye cameras in fields such as robotics and autonomous driving has led to the proposal of various fisheye camera models. While the evolution of camera models has facilitated the development of diverse systems in the field, the lack of adaptation between different fisheye camera models means that recalibration is always necessary, which is cumbersome. This paper introduces a conversion tool for various previously proposed fisheye camera models. It is user-friendly, simple, yet extremely fast and accurate, offering conversion capabilities for a broader range of models compared to existing tools. We have verified that models converted using our system perform correctly in applications such as SLAM. By utilizing our system, researchers can obtain output parameters directly from input parameters without the need for an image set and any recalibration processes, thus serving as a bridge across different fisheye camera models in various research fields. We provide our system as an open source tool available at: https://github.com/eowjd0512/fisheye-calib-adapter
... They can be trained solely on a set of 2D images, whose poses are obtained with structurefrom-motion (SfM) algorithms [29]. Since its introduction, the original NeRF technique has been improved in many ways, notably to make its training and inference faster [4,6,14,25], improve the visual quality by anti-aliasing [2], handle unbounded scenes [3,36], deal with inconsistencies through appearance embeddings [22], and by training on omnidirectional images [9,26]. In this work, we leverage several such improvements by implementing our approach within the Nerfstudio framework [30]. ...
Preprint
Most novel view synthesis methods such as NeRF are unable to capture the true high dynamic range (HDR) radiance of scenes since they are typically trained on photos captured with standard low dynamic range (LDR) cameras. While the traditional exposure bracketing approach which captures several images at different exposures has recently been adapted to the multi-view case, we find such methods to fall short of capturing the full dynamic range of indoor scenes, which includes very bright light sources. In this paper, we present PanDORA: a PANoramic Dual-Observer Radiance Acquisition system for the casual capture of indoor scenes in high dynamic range. Our proposed system comprises two 360{\deg} cameras rigidly attached to a portable tripod. The cameras simultaneously acquire two 360{\deg} videos: one at a regular exposure and the other at a very fast exposure, allowing a user to simply wave the apparatus casually around the scene in a matter of minutes. The resulting images are fed to a NeRF-based algorithm that reconstructs the scene's full high dynamic range. Compared to HDR baselines from previous work, our approach reconstructs the full HDR radiance of indoor scenes without sacrificing the visual quality while retaining the ease of capture from recent NeRF-like approaches.
... 360Roam [43] proposed learning an omnidirectional neural radiance field and progressively estimating a 3D probabilistic occupancy map to speed up volume rendering. OmniNeRF [44] introduced an end-to-end framework for training NeRF using only 360 • RGB images and their approximate poses. PanoHDR-NeRF [45] learns the full HDR radiance field from a low dynamic range (LDR) omnidirectional video by freely moving a standard camera around. ...
Preprint
The blooming of virtual reality and augmented reality (VR/AR) technologies has driven an increasing demand for the creation of high-quality, immersive, and dynamic environments. However, existing generative techniques either focus solely on dynamic objects or perform outpainting from a single perspective image, failing to meet the needs of VR/AR applications. In this work, we tackle the challenging task of elevating a single panorama to an immersive 4D experience. For the first time, we demonstrate the capability to generate omnidirectional dynamic scenes with 360-degree views at 4K resolution, thereby providing an immersive user experience. Our method introduces a pipeline that facilitates natural scene animations and optimizes a set of 4D Gaussians using efficient splatting techniques for real-time exploration. To overcome the lack of scene-scale annotated 4D data and models, especially in panoramic formats, we propose a novel Panoramic Denoiser that adapts generic 2D diffusion priors to animate consistently in 360-degree images, transforming them into panoramic videos with dynamic scenes at targeted regions. Subsequently, we elevate the panoramic video into a 4D immersive environment while preserving spatial and temporal consistency. By transferring prior knowledge from 2D models in the perspective domain to the panoramic domain and the 4D lifting with spatial appearance and geometry regularization, we achieve high-quality Panorama-to-4D generation at a resolution of (4096 ×\times 2048) for the first time. See the project website at https://4k4dgen.github.io.
Conference Paper
Full-text available
In this paper, we present a learning-based framework for light field view synthesis from a subset of input views. Building upon a lightweight optical flow estimation network to obtain depth maps, our method employs two reconstruction modules in pixel and feature domains respectively. For the pixel-wise reconstruction, occlusions are explicitly handled by a disparity-dependent interpolation filter, whereas in-painting on disoccluded areas is learned by convolutional layers. Due to disparity inconsistencies, the pixel-based reconstruction may lead to blurriness in highly textured areas as well as on object contours. On the contrary, the feature-based reconstruction well performs on high frequencies, making the reconstruction in the two domains complementary. End-to-end learning is finally performed including a fusion module merging pixel and feature-based reconstructions. Experimental results show that our method achieves state-of-the-art performance on both synthetic and real-world datasets, moreover, it is even able to extend light fields' baseline by extrapolating high quality views without additional training.
Conference Paper
Full-text available
We present a compressed sensing framework for reconstructing the full light field of a scene captured using a single-sensor consumer camera. To achieve this, we use a color coded mask in front of the camera sensor. To further enhance the reconstruction quality, we propose to utilize multiple shots by moving the mask or the sensor randomly. The compressed sensing framework relies on a training based dictionary over a light field data set. Numerical simulations show significant improvements in reconstruction quality over a similar coded aperture system for light field capture.
Article
We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully connected (nonconvolutional) deep network, whose input is a single continuous 5D coordinate (spatial location ( x , y , z ) and viewing direction ( θ, ϕ )) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis.
Article
We present a system for capturing, reconstructing, compressing, and rendering high quality immersive light field video. We accomplish this by leveraging the recently introduced DeepView view interpolation algorithm, replacing its underlying multi-plane image (MPI) scene representation with a collection of spherical shells that are better suited for representing panoramic light field content. We further process this data to reduce the large number of shell layers to a small, fixed number of RGBA+depth layers without significant loss in visual quality. The resulting RGB, alpha, and depth channels in these layers are then compressed using conventional texture atlasing and video compression techniques. The final compressed representation is lightweight and can be rendered on mobile VR/AR platforms or in a web browser. We demonstrate light field video results using data from the 16-camera rig of [Pozo et al. 2019] as well as a new low-cost hemispherical array made from 46 synchronized action sports cameras. From this data we produce 6 degree of freedom volumetric videos with a wide 70 cm viewing baseline, 10 pixels per degree angular resolution, and a wide field of view, at 30 frames per second video frame rates. Advancing over previous work, we show that our system is able to reproduce challenging content such as view-dependent reflections, semi-transparent surfaces, and near-field objects as close as 34 cm to the surface of the camera rig.
Conference Paper
In this paper, we present a new dataset in order to serve as a support for researches in Free Viewpoint Television (FTV) and 6 degrees-of-freedom (6DoF) immersive communication. This dataset relies on a novel acquisition procedure consisting in a synchronized capture of a scene by 40 omnidirectional cameras. We have also developed a calibration solution that estimates the position and orientation of each camera with respect to a same reference. This solution relies on a regular calibration of each individual camera, and a graph-based synchronization of all these parameters. These videos and the calibration solution are made publicly available.