Content uploaded by Sebastian Knorr
Author content
All content in this area was uploaded by Sebastian Knorr on Apr 12, 2022
Content may be subject to copyright.
OMNI-NERF: NEURAL RADIANCE FIELD FROM 360◦IMAGE CAPTURES
Kai Gu∗, Thomas Maugey∗, Sebastian Knorr†,‡and Christine Guillemot∗
∗INRIA Rennes - Bretagne Atlantique, France;
†Ernst-Abbe University of Applied Sciences Jena, Germany; ‡Technical University of Berlin, Germany.
∗{firstname.lastname}@inria.fr; †{firstname.lastname}@eah-jena.de
ABSTRACT
This paper tackles the problem of novel view synthesis (NVS)
from 360◦images with imperfect camera poses or intrinsic
parameters. We propose a novel end-to-end framework for
training Neural Radiance Field (NeRF) models given only
360◦RGB images and their rough poses, which we refer to as
Omni-NeRF. We extend the pinhole camera model of NeRF to
a more general camera model that better fits omni-directional
fish-eye lenses. The approach jointly learns the scene geome-
try and optimizes the camera parameters without knowing the
fisheye projection.
Index Terms—Light field, omni-directional imaging,
rendering, view synthesis, deep learning
1. INTRODUCTION
The concept of 6 degrees of freedom (6DOF) video content
has recently emerged with the goal of enabling immersive ex-
perience in terms of free roaming, i.e. allowing viewing the
scene from any viewpoint and direction in space [1]. How-
ever, no such real-life full 6DOF light field capturing solu-
tion exists so far. Light field cameras have been designed to
record orientations of light rays, hence to sample the plenop-
tic function in all directions, thus enabling view synthesis for
perspective shift and scene navigation [14]. Several camera
designs have been proposed for capturing light fields, going
from uniform arrays of pinholes placed in front of the sensor
[4] to arrays of micro-lenses placed between the main lens and
the sensor [11], arrays of cameras [17], and coded attenuation
masks [9]. However, these light field cameras have a limited
field of view. On the other hand, omni-directional cameras
allow capturing a panoramic scene with a 360◦field of view
but do not record information on the orientation of light rays
emitted by the scene.
Neural Radiance Fields (NeRF) have been introduced in
[10] as an implicit scene representation that allows render-
ing all light field views with high quality. NeRF models
the scene as a continuous function, and is parameterized as
a multi-layer perceptron (MLP). The function represents the
mapping between the 5D spatial and angular coordinates of
light rays emitted by the scene into its three RGB color com-
ponents and a volume density measure. NeRF is capable of
modeling complex large-scale, and even unbounded, scenes.
With a proper parameterization of the coordinates and a well-
designed foreground-background architecture, NeRF++ [18]
is capable of modeling scenes having a large depth, with sat-
isfying resolution in both the near and far fields.
First NeRF-based models need the camera pose param-
eters to map the pixel coordinates to the ray directions, and
use COLMAP [13, 12] for the camera parameter estimation.
An end-to-end framework, called NeRF–, is proposed in [16]
for training NeRF models without pre-computed camera pa-
rameters. The intrinsic and extrinsic camera parameters are
automatically discovered via joint optimisation during train-
ing of the NeRF model. The authors in [5] consider more
complex non-linear camera models and propose a new geo-
metric loss function to jointly learn the geometry of the scene
and the camera parameters. Solutions are also proposed in [2]
and [3] to enable faster inference as well as using a sparser set
of input views, and to enable generalization to new scenes.
Our motivation here is to be able to capture or reconstruct
light fields with a very large field of view, in particular 360◦.
We focus on the question: how do we extract omni-directional
information and potentially benefit from it when reconstruct-
ing a spherical light field of a large-scale scene with a non-
converged camera setup?
To address this question, we propose a method for learn-
ing a 360◦neural radiance field from omni-directional images
captured with fisheye cameras. We adapt the pinhole cam-
era model of NeRF to a general camera model that fits omni-
directional fisheye lenses. Finally, we extend our approach
to allow accurate camera parameter estimation (intrinsic and
extrinsic) for omni-directional fisheye lenses. To evaluate our
approach, we render photo-realistic panoramic fisheye views
from 2Blender scenes: an indoor ”Classroom” scene and a
larger scale scene ”Lone Monk” of an atrium surrounded by
buildings. With these datasets, we first evaluate how spherical
sampling improves the performance of view synthesis com-
pared to planar sampling. We prove that our model can learn
the fisheye distortion from scratch with ground truth camera
pose. We also assess the camera extrinsic parameter esti-
mation, with a noisy initialization which simulates the case
(a) 360◦fisheye images (b) Spatial camera grid
Fig. 1. Space sampling with 360◦cameras (fisheye camera
pairs) in a spatial grid.
where the camera parameters are imperfect or measured with
error. Finally, we train and evaluate our model on two real
scenes captured by fisheye cameras.
2. NERF: BACKGROUND
In this section we briefly introduce the method of synthesiz-
ing novel views of complex static scenes by representing the
scenes as Neural Radiance Fields (NeRF) [10].
In NeRF, a scene is represented with a Multi-Layer Per-
ceptron (MLP). Given the 3D spatial coordinate location X=
(x, y, z)of a voxel and the viewing direction d= (θ, ϕ)in a
scene, the MLP predicts the volume density σand the view-
dependent emitted radiance or color c= [R, G, B]at that
spatial location for volumetric rendering. In the concrete im-
plementation, Xis first fed into the MLP, and outputs σand
intermediate features, which are then fed into an additional
fully connected layer to predict the color c. The volume den-
sity is hence decided by the spatial location X, while the color
cis decided by both Xand viewing direction d. The network
can thus yield different illumination effects for different view-
ing directions. This procedure can be formulated as
[R, G, B, σ ] = FΘ(x, y, z, θ, ϕ),(1)
with Θ = {Wi, bi}being weights and biases of the MLP. The
network is trained by fitting the rendered (synthesized) views
with the reference (ground-truth) views via the minimization
of the total squared error between the rendered and true pixel
colors in the different views. NeRF uses a rendering function
based on classical volume rendering [6]. Rendering a view
from NeRF requires calculating the expected color C(r)by
integrating the accumulated transmittance T along the camera
ray r(t) = o+td, for obeing the origin of cast ray from the
desired virtual camera, with near and far bounds tnand tf.
This integral can be expressed as
C(r) = Ztf
tn
T(t)σ(r(t)c(r(t),d)dt, (2)
where T(t) = exp(−Rt
tnσ(r(s)ds). In practice, the integral
is numerically approximated by sampling discretely along the
ray.
While the original NeRF in [10] requires the knowledge of
the camera pose parameters, a pose-free solution has been in-
troduced in [16] which estimates intrinsic and extrinsic cam-
era parameters while training the NeRF model.
Assuming a pinhole camera model, the camera parame-
ters can be expressed with the camera projection matrix
P=KR[I| − t].(3)
Iis a 3identity matrix and Kis the camera calibration matrix
K=
fx0cx
0fycy
0 0 1
,(4)
with the intrinsic camera parameters fxand fyas the focal
length, which are identical for square pixels, and cxand cy
as the offsets of the principal point from the top-left corner of
the image. The extrinsic camera parameters in Eq. (3) contain
the rotation and translation of the camera with respect to a 3-
D world coordinate system and are expressed by the 3×3
rotation matrix Rand the 3×1translation vector t.
When these camera parameters are unknown, they need to
be estimated. For camera rotation, the axis-angle representa-
tion is adopted in NeRF–:
Φ:= αω,Φ∈R3(5)
where αis the rotation angle and ωis the unit vector represent
the rotation axis. Φican be converted to the rotation matrix
Rusing the Rodrigues’ formula [16]. With such parametriza-
tion, the i-th camera extrinsics can be optimized by searching
for the parameters of Φiand ti. To render the color of the
m-th pixel pi,m = (u, v)from the i-th camera, we cast the
ray ri,m from the image plane as
ri,m(t) = o+td,(6)
where
d=R−1
i
(u−cx)/fx
−(v−cy)/fy
−1
(7)
and o=tiusing the current estimation of camera parameters
πi= (fx, fy, cx, cy,Φi,ti).(8)
Then, the calculated coordinate and direction of the sam-
pled ray are fed into the NeRF model. The current estimated
color value of the pixel is rendered by Eq. (2) and can be
compared with the ground-truth color value by estimating the
squared error. Finally, the parameters Θof the NeRF model
and the camera parameters πiare optimized jointly by min-
imising the photometric loss as described in [16] and [7].
3. OMNI-NERF: OMNI-DIRECTIONAL NERF
3.1. Fisheye projection
Here, we focus on panoramic fisheye lenses which usually
have a field of view (FOV) greater than 180◦. To adapt NeRF
θ
object
point
optical
center
ru
rd
focal length
undistorted
image point
distorted
image point
Fig. 2. Distorted image projection.
to omni-directional images, the key is to recover the true ray
direction of pixel coordinates on the image plane. In order to
do so, we need to model the projection of the fisheye lens onto
the image plane. We can model these projections as relations
between rd, the radial distance from the image center to the
distorted image point, and θ, the incoming angle measured
from the lens axis as depicted in Fig. 2. For the specific
”equisolid” projection, the radial distance can be calculated
as
rd= 2f·sin ( θ
2).(9)
Different projections exist for panoramic fisheye lenses
[15]. In order to unify the description, we consider the sum
of the 4 first terms of the infinite series as an odd-order poly-
nomial representation of the incoming ray direction θdefined
as
θ=θd+k1θ3
d+k2θ5
d+k3θ7
d,(10)
where
θd= arctan ( rd
f),(11)
fis the focal length and k1,k2and k3are coefficients to fit
the different fisheye projections. Given a pixel pwith its pixel
coordinates (u, v)on the image plane, the radial distance rd
can then be expressed as
rd=q(u−cx)2+ (v−cy)2,(12)
where (cx, cy)are the coordinates of the principle point. The
actual direction of the ray of pixel pin the camera coordinate
system can then be expressed as a vector dc= (x, y, z)Twith
x= sin (θ)·(u−cx)(13)
y= sin (θ)·(v−cy)(14)
z= cos (θ).(15)
By applying a rotation and translation using the extrinsic pa-
rameters, the ray direction in world coordinates is then de-
fined as
d= [R−1|t]dc.(16)
The rotation matrix Rcan be parameterized by the axis-
angle representation Φas described in Section 2.
When these camera parameters are not known, e.g. when
using real world captures, they need to be estimated. For this,
Sample planes Sample spheres
Image plane
Optical axis
Fig. 3. Planar sampling vs. spherical sampling.
we follow the approach of NeRF– and SCNeRF[5] where the
estimated parameters are:
πi= (fx, fy, cx, cy,Φi,ti, k1, k2, k3),(17)
i.e., extended by the parameters k1,k2and k3which approx-
imate the projection as in Eq. (10).
Assuming that the specific fisheye projection model is un-
known, we initialize (k1, k2, k3)to be (0,0,0), i.e., the pin-
hole perspective camera model. We then optimize the coeffi-
cients of the polynomial to fit the specific fisheye projection
by minimizing the photometric loss as mentioned in Section.
2. With different combinations of (k1, k2, k3)the polynomial
distortion model can approximate fisheye lenses or mirrors
with FOVs of up to 360◦.
3.2. Spherical sampling
In the original NeRF model, the pixels are rendered by sam-
pling 5D coordinates (location and viewing direction) along
the camera rays. For a camera ray r(t) = o+tdwith near
and far bounds tnand tf, where ois the origin of ray and dis
a vector giving the ray direction. The interval [tn, tf]is orig-
inally partitioned into Nevenly-spaced bins, and one sam-
ple is then uniformly drawn at random from each bin. Such
a sampling pattern is equivalent to placing sampling planes
parallel to the image plane as depicted in Fig. 3.
For fisheye lenses, however, the rays on the border are
more sparsely sampled than the rays close to the optical axis,
and the spacing tends to infinity when the angle of the inci-
dence ray approaches 90◦(see Fig. 3). As the sampling is
critical in the neural representation of the radiance field, such
large bins have a high chance to skip thin objects in the scene
and cause artifacts, i.e., resulting in degraded image quality.
Hence, we use a sampling on a sphere instead of the sam-
pling on planes to resolve the above issue. For a ray direction
d= (x, −y, −1)T, we define the normalized direction as
dn= ( x
∥d∥,−y
∥d∥,−1
∥d∥)T(18)
With this spherical sampling scheme, the rays at the border
of the scene projected on an image have the same importance
as the rays in the center. Thus, the spherical sampling offers
a more uniform sampling of the whole scene. We therefore
define the near and far bounds to be concentric spheres with
radius tnear and tfar centered at the projection center.
”Classroom” ”Lone Monk”
FTV indoor Office
Fig. 4. Overview of all the scenes
Similarly, we partition [tnear, tf ar]into Nevenly-spaced
bins. Hence, the bins for rays in all the directions are equal as
shown in Fig. 3, and a ray is now expressed with
r(t) = o+tdn.(19)
Please note that the spherical sampling is not only pre-
ferred for fisheye lenses, but also improves the image quality
for any wide-angle lenses.
4. EXPERIMENTAL RESULTS
4.1. Dataset
The dataset consists of four scenes, two synthetic scenes from
the Blender demo ”Classroom” (by Christophe Seux.) and
”Lone Monk” (by Carlo Bergonzini) and two real scenes, the
FTV-indoor-sit dataset captured in a big meeting room [8] and
a densely sampled scene in an office room, both captured with
Samsung Gear360 cameras. Fig. 4 shows screenshots of the
test scenes.
First, we deploy virtual fisheye and perspective cameras
in the synthetic 3D scenes to create a dataset that samples the
3D scenes from different viewpoints. A pair of virtual cam-
eras facing forward and backward are placed in a bounded
cuboid at each vertex of the spatial grid to get full 360◦FOV.
By subdividing the edge of the cuboid, we use different sam-
pling of the space and therefore with a different number of
cameras. In particular, we investigated two different sam-
pling grids, namely 6x6x3 (108 view points) and 9x9x3 (243
view points) in each scene with fisheye and perspective cam-
era pairs, which are our training datasets. Furthermore, we
render a smooth path of 400 intermediate views in the sam-
pled 3D space by varying rotation and translation for both
camera types, which are our test datasets. Both training and
test images have a resolution of 600x600 pixels. The fish-
eye views and perspective views are rendered with equisolid
projection of 180◦FOV and perspective projection of 119◦
sample method Classroom Lone Monk
(samples) FE WA FE WA
planar(128) 22.46 24.77 15.13 24.68
planar(256) 24.78 25.19 25.38 25.04
spherical(128) 28.69 25.18 28.45 25.03
Table 1. PSNR values obtained with different sampling meth-
ods and synthetic datasets. ”FE” and ”WA” denote fisheye
and wide-angle perspective rendering, respectively.
FOV, respectively. We demonstrate the performance of Omni-
NeRF compared to SCNeRF using the synthetic datasets. Fi-
nally, we show that our model is capable of reconstructing
real scenes. For configuration details of the dataset and cam-
era setup, please refer to the provided supplementary material.
4.2. Fisheye and Spherical Sampling
Our first experiment aims to evaluate how spherical sampling
improves the performance of view synthesis compared with
planar sampling. For this, we train Omni-NeRF with syn-
thetic data using ground truth camera parameters and different
sampling schemes.
We use the hierarchical sampling strategy of NeRF and
consider training and rendering with planar sampling of 128
and 256 coarse samples along the rays. For spherical sam-
pling we use 128 coarse samples. Then, we allocate 128 fine
samples biased towards the relevant parts of the volume for
all the cases. The reported PSNR in Table 1 are averaged val-
ues over 400 test views of fisheye and perspective cameras.
The table shows that spherical sampling outperforms the pla-
nar sampling for fisheye rendering with much fewer samples.
For wide-angle perspective rendering, the spherical sampling
achieves similar performance with only half the number of
samples.
Fig. 5 shows small patches of rendered views with syn-
thetic data. The rendered views with planar sampling show
typical artifacts close to the border while the spherical sam-
pling with same or a lower number of samples resolves the
issue.
4.3. Fisheye distortion estimation
In our second experiment, we show that our model can learn
the fisheye distortion from scratch using ground truth camera
pose. And we also show that our model can optimize the cam-
era parameters using noisy camera pose as initialization. We
train our model and also SCNeRF on ”Classroom” and ”Lone
Monk” datasets. We first investigate the estimation of radial
distortion with different models, so we fix the extrinsics and
intrinsics to their ground truth values and we initialize both
models to correspond to pinhole projections. Fig. 6 shows
how our model fit to fisheye projection.
We evaluate our model with the average PSNR of the test
reference view GT. P 128 P 256 S 128
Fig. 5. Rendered views with planar (P) and spherical (S) sam-
pling with different numbers of inputs, with the ”Classroom”
scene. (GT denotes ground truth, PSNR is shown on each
sub-figure).
2000 iter. 4000 iter. 8000 iter.
Fig. 6. We initialize the projection as pinhole camera and start
to optimize the projection model. The figures show how our
model (first row) and SCNeRF (second row) fit the fisheye
projection during training.
set and the mean absolute error (MAE) of angles eθbetween
estimated ray directions ˆ
dand ground truth ray directions d
of all valid pixels in the fisheye view, given by
eθ= arccos ( ˆ
d·d
∥ˆ
d∥∥d∥).(20)
We then add uniformly distributed random noise to the ground
truth camera pose as initialization and optimize the noisy pose
jointly with the projection model. we add random rotation an-
gle ∆θ∼U(−7.5,7.5)(rad)along random axis, and trans-
lation ∆t∼U(0.075,0.075)(meter)along (x,y,z) axes for
each training view. We train our model and SCNeRF with
the same noisy initialization on the synthetic datasets. We re-
port the result of different scenes using ground truth pose and
noisy pose in Table 2 and visualize the projected ray error in
Fig. 7. The result shows that our model, with the proposed
polynomial representation of ray direction, can estimate the
fisheye distortion accurately, with ground truth or noisy cam-
era poses, while SCNeRF fails to learn the correct projection.
We finally train our model on real captured 360◦fisheye
Classroom Lone Monk
Method PSNR MAE PSNR MAE
SCNeRF (gt. pose) 13.04 0.21 12.94 0.28
Ours (gt. pose) 27.12 0.001 26.55 0.001
SCNeRF (noisy pose) 13.74 0.13 13.92 0.21
Ours (noisy pose) 21.65 0.004 24.68 0.003
Table 2. Comparison between SCNeRF and Omni-NeRF in
terms of fisheye distortion estimation. We report the average
PSNR of the smooth-path test set and the mean absolute error
of estimated ray direcitions after a same amount of iterations.
(a) SCNeRF (b) Ours
Fig. 7. Error of estimated ray directions. Please note that the
scales of the two figures are different.
data. The model has been initialized with imperfect camera
pose parameters and projection model. Fig. 8 shows exem-
plary the reference views and rendered views using our model
with 2different scenes.
4.4. Discussion
We found that NeRF– and other differentiable camera mod-
els have a limited search range of camera parameters. For
our non-planar and non-converging camera configurations,
rough initialization is necessary. Due to the peculiarities of
panoramic fisheye lenses, the camera usually captures some
of the pixels of the camera mount or the photographer, lead-
ing to problems of inconsistency between frames, which have
an impact on the scene reconstruction and parameter estima-
tion.
5. CONCLUSION
We proposed Omni-NeRF, a NeRF-based method that recon-
structs the scene with 360◦information from panoramic fish-
eye imaging. We optimized the parameterization for the esti-
mation of wide-angle fisheye distortion and we stressed the
importance of sampling in this case. We have shown that
the spherical sampling improves the performance when train-
ing with panoramic fisheye or wide-angle perspective images.
Furthermore, We have shown that our parametrization is bet-
ter suited than the one of SCNerf when using panoramic fish-
eye lenses. We have also shown that NeRF can be used
to reconstruct scenes with 360◦panoramic information, us-
ing a non-converged camera setup. Finally, we demonstrated
Reference View Rendered View
Fig. 8. Rendered view with Omni-NeRF trained for the FTV-
indoor-Sit [8] dataset and our office dataset. The model has
been initialized with imperfect camera pose parameters and
projection model.
that our method is also applicable to data collected with real
360 cameras, and that our model successfully reconstructs the
scene. Since we are dealing with the problem of reconstruct-
ing large-scale scenes, the question of how to optimally de-
ploy cameras in 3D space is worth further exploration.
6. ACKNOWLEDGEMENT
This project has received funding from the European Union’s
Horizon 2020 research and innovation programme under the
Marie Skłodowska-Curie grant agreement No 956770.
7. REFERENCES
[1] M. Broxton, J. Flynn, R. Overbeck, D. Erickson, P. Hed-
man, M. DuVall, J. Dourgarian, J. Busch, M. Whalen,
and P. Debevec. Immersive light field video with a lay-
ered mesh representation. ACM Transactions on Graph-
ics, 39(4):86:1–86:15, 2020.
[2] A. Chen, Z. Xu, F. Zhao, X. Zhang, F. Xiang, J. Yu, and
H. Su. MVSNeRF: Fast Generalizable Radiance Field
Reconstruction from Multi-View Stereo. arXiv preprint
arXiv:2103.15595, 2021.
[3] J. Chibane, A. Bansal, V. Lazova, and G. Pons-Moll.
Stereo Radiance Fields (SRF): Learning View Synthesis
for Sparse Views of Novel Scenes. In IEEE Conference
on Computer Vision and Pattern Recognition (CVPR),
2021.
[4] H. E. Ives. Parallax panoramagrams made with a large
diameter lens. Journal of the Optical Society of Amer-
ica, 20(6):332–342, 1930.
[5] Y. Jeong, S. Ahn, C. Choy, A. Anandkumar, M. Cho,
and J. Park. Self-Calibrating Neural Radiance Fields.
In IEEE International Conference on Computer Vision
(ICCV), 2021.
[6] J. T. Kajiya and B. P. Von Herzen. Ray Tracing Vol-
ume Densities. ACM Computer Graphics, 18(3):165–
174, 1984.
[7] C.-H. Lin, W.-C. Ma, A. Torralba, and S. Lucey. BARF:
Bundle-Adjusting Neural Radiance Fields. In IEEE
International Conference on Computer Vision (ICCV),
2021.
[8] T. Maugey, L. Guillo, and C. L. Cam. Ftv360: a multi-
view 360◦video dataset with calibration parameters. In
ACM Multimedia Systems Conference, 2019.
[9] E. Miandji, J. Unger, and C. Guillemot. Multi-shot sin-
gle sensor light field camera using a color coded mask.
In European Signal Processing Conference (EUSIPCO),
2018.
[10] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron,
R. Ramamoorthi, and R. Ng. Nerf: Representing scenes
as neural radiance fields for view synthesis. In European
Conference on Computer Vision (ECCV), 2020.
[11] R. Ng, M. Levoy, M. Br´
edif, G. Duval, M. Horowitz,
and P. Hanrahan. Light Field Photography with a Hand-
held Plenoptic Camera. Research Report CSTR 2005-
02, Stanford University, 2005.
[12] J. L. Sch¨
onberger, E. Zheng, J.-M. Frahm, and M. Polle-
feys. Pixelwise view selection for unstructured multi-
view stereo. In European Conference on Computer Vi-
sion (ECCV), 2016.
[13] J. L. Sch ¨
onberger and J.-M. Frahm. Structure-from-
motion revisited. In IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), 2016.
[14] J. Shi, X. Jiang, and C. Guillemot. Learning fused pixel
and feature-based view reconstructions for light fields.
In IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2020.
[15] M. Thoby. Photographic lenses projections: com-
putational models, correction, conversion... Avail-
able: http://michel.thoby.free.fr/ Fisheye history short/
Projections/Fisheye projection-models.html., 2012.
[16] Z. Wang, S. Wu, W. Xie, M. Chen, and V. A. Prisacariu.
NeRF–: Neural Radiance Fields Without Known Cam-
era Parameters. arXiv preprint arXiv:2102.07064, 2021.
[17] B. Wilburn, N. Joshi, V. Vaish, E. V. Talvala, E. An-
tunez, A. Barth, A. Adams, M. Horowitz, and M. Levoy.
High performance imaging using large camera arrays.
ACM Transactions on Graphics, 24(3):765–776, 2005.
[18] K. Zhang, G. Riegler, N. Snavely, and V. Koltun.
NeRF++: Analyzing and Improving Neural Radiance
Fields. ArXiv, abs/2010.07492, 2020.