Content uploaded by Jack Hilliard
Author content
All content in this area was uploaded by Jack Hilliard on Dec 09, 2024
Content may be subject to copyright.
360U-Former: HDR Illumination Estimation with
Panoramic Adapted Vision Transformers
Jack Hilliard1,2, Adrian Hilton1,3, and Jean-Yves Guillemaut1,4
1Unviersity of Surrey, Guildford, Surrey, UK
2jh00695@surrey.ac.uk
3a.hilton@surrey.ac.uk
4j.guillemaut@surrey.ac.uk
Abstract. Recent illumination estimation methods have focused on en-
hancing the resolution and improving the quality and diversity of the gen-
erated textures. However, few have explored tailoring the neural network
architecture to the Equirectangular Panorama (ERP) format utilised in
image-based lighting. Consequently, high dynamic range images (HDRI)
results usually exhibit a seam at the side borders and textures or ob-
jects that are warped at the poles. To address this shortcoming we pro-
pose a novel architecture, 360U-Former, based on a U-Net style Vision-
Transformer which leverages the work of PanoSWIN, an adapted shifted
window attention tailored to the ERP format. To the best of our knowl-
edge, this is the first purely Vision-Transformer model used in the field
of illumination estimation. We train 360U-Former as a GAN to generate
HDRI from a limited field of view low dynamic range image (LDRI). We
evaluate our method using current illumination estimation evaluation
protocols and datasets, demonstrating that our approach outperforms
existing and state-of-the-art methods without the artefacts typically as-
sociated with the use of the ERP format.
Keywords: Illumination Estimation ·Vision-Transformers ·Equirect-
angular Panoramas
1 Introduction
To believably composite an object into a virtual scene it must be lit consis-
tently with the target scene’s illumination conditions. The field of illumination
estimation has sought to capture a scene’s lighting through the constraints of
a Limited Field of View (LFOV) Low Dynamic Range image (LDR(I)) such as
that taken from a mobile phone camera. Several methods have been utilised to
represent these conditions such as regression-based methods, like Spherical Har-
monics (SH) [22] and Spherical Gaussians (SG) [36], and Image Based Light-
ing (IBL) [11] methods that render lighting from an Equirectangular Panorama
(ERP) High Dynamic Range image (HDR(I)). IBL has become the leading way
to represent lighting conditions [10,26, 30,33, 39, 40] due to its ability to capture
high-frequency textures, as well as global illumination, meaning it can be used
arXiv:2410.13566v1 [cs.CV] 17 Oct 2024
2 J. Hilliard et al.
(a) 360U-Former render (b) GT render (c) 360U-Former EM (d) GT EM
Fig. 1: Examples of an object with different surface properties being rendered with an
HDRI environment map (EM) of an indoor (Top) and outdoor (Bottom) scene, from
either the ground truth or generated by our network 360U-Former with PanoSWIN
attention blocks. We also include the EM for each scene and method for reference.
to light a range of surfaces from rough diffuse to mirror reflective. The current
trends in IBL and illumination estimation are to increase the resolution, details
and accuracy of the generated ERP images [10, 26, 39, 40].
Neural networks have used the ERP image format in various tasks, such as
panoramic outpainting and illumination estimation. Due to mapping the surface
of a sphere to a 2D image, warping occurs at the top and bottom of ERP images.
When using ERP images in a standard neural network this warping has to be
learnt by the network or accounted for by adapting the architecture. The sides
of the image also have to be considered connected, unlike regular LFOV images.
Not adapting a neural network architecture to work with the ERP format can
lead to artefacts such as obvious seams at the side borders of the image and badly
generated objects at the top and bottom (poles) of the ERP where it is most
warped. In the field of illumination estimation, only a few models have made
adjustments [2, 9, 10, 33]. These few approaches are typically limited to either
changes to the loss functions or changes to the image at network inference.
Vision Transformers (ViT) have shown their ability to understand relation-
ships of both global and local information in an image for a variety of image
processing tasks such as object classification [32], image restoration [12] and
outpainting [16], as well as being able to change the window attention network
to better process ERPs [32].
We propose 360U-Former, a U-Net style ViT adapted for the ERP format
by leveraging PanoSWIN [32] attention blocks. We use 360U-Former to generate
the ERP HDR environment map of a scene from a single LFOV LDRI. Our
model does not generate a border seam or any form of warping artefacts at
the poles of the output HDRI. It can recreate a variety of both indoor and
outdoor environments, outperforming the state of the art for both specialised
scene types. We compare our model against the current state of the art in LFOV
illumination estimation using the evaluation method outlined by Weber et al.
[40]. We also compare against previous methods from a dataset kept updated by
Dastjerdi et al. [10]. Our model outperforms state-of-the-art models at removing
360U-Former 3
ERP artefacts and illuminating objects with diffuse surfaces. To summarise our
contributions are as follows:
–First use of a purely Vision-Transformer network to approach the illumina-
tion estimation problem;
–The use of a Vision-Transformer network with global self attention creates
more accurate diffuse lighting than current state-of-the-art methods;
–Incorporation of PanoSWIN attention modules and circular padding to bet-
ter encode and generate ERPs by removing the warping artefacts that appear
at the poles of panoramic images.
2 Related Work
We first review the literature for methods that adapt neural networks to the ERP
image format. We then review inverse tone-mapping techniques, a fundamental
component that makes up the large majority of IBL illumination estimation
papers and models. As our proposed network is based on ViTs, we briefly re-
view transformer literature and focus on current Generative Adversarial Network
(GAN) ViT architectures. Lastly, we investigate the current state-of-the-art illu-
mination estimation papers and compare the benefits and compromises of each
method.
2.1 Equirectangular Panoramas and Neural Networks
Various techniques have been proposed to overcome the artefacts created by the
ERP format. Rotating the input image by 180◦, effectively turning it inside-
out, has been used by [1,29] to assist with panoramic outpainting by converting
the problem from an unidirectional outpainting task to a part bidirectional in-
painting task. This also helps reduce the seam generated at the sides of the
panorama. To more reliably reduce the border and homogenise generation at
the sides of the ERP, Akimoto et al. [2] proposed a circular inference method
for their transformer, removing the border seam but increasing inference time.
Other methods have used circular padding either before and after or through-
out the network [25,26, 37] allowing for homogenous generation and continuity
at the sides of the ERP. Several approaches have been adopted to overcome
the warping at the top and bottom of the ERP. Conversion from ERP to cube
map is one approach to remove warping, as used by [8, 15, 21, 24]. However, as
noted by [15, 21] additional requirements are needed to account for the seams
generated at the intersections of the cube’s faces. Feng et al . [15] also observe
that while the cube map format does well with local details it does not per-
form as well at capturing global context. HEALPix [23], a framework to give
equal area weighting to spherical images, has been adopted by [8,42] in their
patch-embedding method to spherically encode the input image shape to a ViT.
PAVER [42] also uses a form of deform patch-embedding for their transformer
architecture. Although these methods have been show to improve ViTs above
4 J. Hilliard et al.
baseline for the ERP format we found they do not remove border and warping
artefacts in our tests. A few papers have adapted loss functions to work with
the ERP format. EnvMapNet [33] uses a spherical warping on the L1 loss and
Akimoto et al. [2] instead use the same spherical warping but on the Perceptual
Loss. Another approach is to train the discriminator to identify ERP artefacts,
by rotating the output ERP by 180◦the border seam can then be detected by
the discriminator. This has been implemented by [9,10]. Other approaches have
aimed to change the architecture of neural networks to better handle ERP. Su
and Grauman [35] develop a Spherical Convolution that adapts traditional CNNs
to work with ERP. PanoSWIN [32], which outperformed Spherical Convolutions
in benchmark metrics, uses a ViT architecture and observes that attention win-
dows at the poles of the ERP should be considered connected. This changes the
shifted window attention layout and implements a pitch attention module to
further account for the ERP warping at the poles.
2.2 Inverse Tone Mapping
Inverse tone mapping is the process of converting an LDRI to an HDRI. It
is commonly used in illumination estimation methods as HDR is an accurate
way of capturing the lighting through an image. IBL illumination estimation
methods perform inverse tone mapping as part of the extrapolation, however,
some methods [6, 20] have used a separate network to do the LDR to HDR
conversion so that the main network can focus on field of view extrapolation.
Neural networks tend to be trained on LDRI with a range of [-1, 1] which
works with the activation functions used, such as tanh activation with an output
range of [-1, 1]. HDRIs have a much larger range (for example [0, 100000]) to
capture the difference in light. A method for handling the range without biasing
the loss functions to the larger values needs to be implemented. Gardner et al. [18]
separate the lighting into LDR and light log intensity. Other methods [19, 44] use
gamma correction, with α= 1/30 and γ= 2.2to map the HDR values visible in
an LDRI to the same scale as an LDR. DeepLight [30] uses the natural log space
to represent the illumination. Similarly, Li et al. [31] use the log(x+1) space.
Hilliard et al. [26] use a gamma compression with α= 1 and γ= 6.6so that the
HDR is compressed to [0, 2] so that it can be better used with a tanh activation
function, whilst increasing the range of the LDR values and decreasing the HDR
values.
2.3 Vision Transformers
The transformer architecture was created to improve natural language processing
by focusing on the relationships between each word and its surrounding words
in a sentence. The Self-Attention and Feed Forward neural networks were later
applied to the field of image processing with the ViT [13], which used global
self-attention to learn the relationship between each section (called a window) of
the image and the image as a whole. Liu et al. proposed the SWIN Transformer,
which used a hierarchical design with shifted attention windows and provided
360U-Former 5
connections between them, allowing the network to use various scales and image
sizes. ViTs were initially used for image classification, object detection and se-
mantic segmentation but were later adapted into the GAN architecture for use
in image generation and manipulation tasks. VQGAN [14] used a CNN-based
vector quantised variational autoencoder to train a ViT to learn a codebook of
context-rich visual parts. RFormer [12] combined ViTs with the U-Net style of
architecture, using a ViT for the encoder and decoder, with shortcut connections
between them, for image restoration. Gao et al. [16] developed a similar U-Net
style ViT called U-Transformer for image outpainting. This method used masked
shortcut connections to connect only the known regions of the input image to
the decoder layers.
2.4 Illumination Estimation
Modern methods for illumination estimation extrapolated from an LFOV LDRI
use deep learning to infer the lighting conditions. The format with which the
lighting conditions are represented can be put into two distinct categories: re-
gression based methods such as SH [7, 19, 20] and SG [3, 17, 31, 43] which use a
basis function to reduce the lighting to a set of coefficients, and, as more recent
methods [10, 26,30, 33, 39, 40] are opting to use, an IBL approach by generating
an ERP HDRI known as an environment map. Regression-based methods have
the advantage of using less memory and are more efficient at rendering. How-
ever, IBL methods have been favoured because of the key advantage that they
can be used to light mirror reflective surfaces due to their ability to represent
high-frequency textures. The recent developments in IBL methods have aimed
to create higher resolution and more plausible images that can generalise to a
variety of indoor and outdoor scenes, whilst retaining accurate light positions
and colour for lighting purposes. The illumination estimation methods that make
adaptions for ERP are [2,9, 10,33]. These methods have only focused on chang-
ing the loss functions rather than the architecture of the network. Consequently,
they are unable to generate the poles of the ERP without noticeable artefacts.
3 Methodology
We present 360U-Former, a U-Net style ViT-based architecture trained as a
GAN to generate HDR 360◦ERP images from LFOV LDR images that can
be applied to both indoor and outdoor scenes. By using PanoSWIN attention
layers our model generates ERP images without warping at the poles and a
seam at the sides of the image. The architecture for the discriminator is based
on RALSGAN [28]. The pipeline for the whole model is featured in Fig. 2.
3.1 Network Architecture
The input to the network is an LFOV LDR image converted to ERP format.
The output of the model is an HDR ERP image. The 360U-Former, Fig. 2, can
be split into three distinct parts: the encoder, the bottleneck and the decoder.
6 J. Hilliard et al.
Fig. 2: Summary of the proposed model. Top: The overall flow of the model. The
generator (ICN) uses the masked LDR ERP as input to generate the HDR ERP en-
vironment map. This is trained as a GAN by the discriminator (DICN ). Bottom: The
360U-Former architecture used by the ICN. The PanoSWIN attention blocks W-MSA,
PSW-MSA and PAM are desribed in Sec. 3.1 and Fig. 3.
(a) W-MSA (b) PSW-MSA (c) PAM
Fig. 3: The ERP rotations that are used as input by each of the three attention layers.
The encoder uses 4 PanoSWIN [32] blocks with layer sizes of 3, 3, 7 and 2. The
PanoSWIN block consists of a standard window multi-head self-attention layer
(W-MSA), Fig. 3a shows the input, followed by a panoramically warped shifted
window multi-head self-attention layer (PSW-MSA), Fig. 3b shows the rotations
done and the final input to the layer, that ensures that the regions along the top
or bottom of the panorama are considered adjacent rather than distant. The final
layer of each block, except the last block, is a Pitch Attention Module (PAM).
The PAM performs cross-attention on the windows from the default orientation
and the associated windows from the image pitched and rotated by 90◦, Fig. 3c.
This method allows the network to learn the spatial distortion at the poles of
the ERP. For the bottleneck, we use a block of two PanoSWIN layers without
the PAM. We found that using the PAM in the bottleneck, the last block of the
encoder and the first block of the decoder would lead to artefacts at the sides of
the generated image. This is likely due to the size of the tensor at these blocks,
360U-Former 7
8 by 16 and 4 by 8, and the way they are interpolated that affects the edge of
the output image.
The decoder is similar to the encoder but uses upsampling instead of down-
sampling to increase the resolution and reduce the channel size. We incorporate
shortcut connections from the encoder by further upsampling the channels of
the tensor from the previous block and concatenating them with the associ-
ated output from the encoder. When reversing the patch embedding to get the
output image we incorporate circular padding on either side of the image and
reflection padding at the top and bottom instead of the default padding used by
convolutional layers. This further ensures that no border artefacts are produced.
3.2 Loss Functions
To train our network we use three loss functions. We choose L1 loss for pixel-
wise accuracy. To measure semantic similarity to the ground truth we use the
perceptual loss [46] Lperc .
Lperc =X
l
1
Pu,v
X
u,v
Jyl
uv −xl
uvK,(1)
where uand vare the positions on the feature (size Hl×Wl) in the lth layer of
the VGG feature extractor.
We leverage the RalSGAN [28] to train our GAN because we found it to
generate higher-quality images with fewer artefacts.
Ladv =E[log D(IH)] + Elog 1−D(GIC N (IL)).(2)
The ICN overall objective with weighted loss functions is
LGICN =λL1LL1+λper cLper c +λadvLadv,(3)
where λrepresents the loss weight for each loss. We found the optimal loss
weights to be λL1= 5,λperc = 5 and λadv = 0.2by testing a range of values
and comparing the performance of the metrics and appearance of the generated
output.
3.3 Implementation Details
We train our network on both indoor and outdoor data. For the indoor dataset,
we use the Structured3D [47] synthetic dataset which contains 21,843 photo-
realistic panoramas. For the outdoor dataset, we use the 360 Sun Positions [5]
dataset which contains 19,093 street view images of both urban roads and rural
environments.
To create the HDR ground truth pairs for these LDR datasets, we convert
to HDR using an LDR-to-HDR network. This network is based on the network
design from Bolduc et al . [4] without the exposure and illuminance branch using
8 J. Hilliard et al.
the ERP LDRI as input and the ERP HDRI as output. It is trained on the Laval
Photometric Indoor HDR dataset [4], Laval Outdoor HDR dataset [27] and the
dataset from Cheng et al . [7]. Due to the difference in calibration between each
dataset, we base the values on the average mean and median values from the
Laval Photometric dataset scaled by a factor 0.01. This scaling factor is based on
the factor needed to create a plausible render when using the Cycles rendering
engine in Blender. Using this method we used a scaling factor of 340 for the Laval
HDR Outdoor dataset and that the other dataset did not need to be scaled.
The network is trained on an image resolution of 256 by 512. We augment
both datasets by horizontally rotating each panorama 8 times randomly between
20◦and 340◦at intervals of 40◦with a 20% chance of being vertically flipped.
After augmentation our dataset consists of 368,370 images pairs which we split
into a train/test ratio of 99:1. We ensure that all augmented versions of each pair
exist only in the train or the test subset to prevent over-fitting. Both networks
are trained with a range of LFOV sizes {40◦,60◦,90◦,120◦}for the masked input.
The mask size is randomly chosen for each input image. The network is trained
for 50 epochs on an A100 GPU at 5.5 hours per epoch. There are a total of
220 million parameters in the generator network. We use the ADAM optimiser
with betas 0 and 0.9, weight decay of 0.0001 and learning rate of 0.0001 for the
generator and 0.0004 for the discriminator.
4 Evaluation
We evaluate our model against the latest state-of-the-art illumination estimation
methods using the protocol first outlined by Weber et al. [40] and used again by
EverLight [10]. We compare quantitative results for indoor and outdoor methods
separately as some previous methods focused on one or the other. This also helps
to highlight the performance in different domains. To highlight the ability of our
network to adapt to the ERP format, we rotate the outputs to show the border
seams and the quality of the poles generated. We also conduct an ablation study
to compare the PanoSWIN against the SWIN attention blocks. The results of
the diffuse render tests have been generously contributed to the community by
Dastjerdi et al . [10] and can be found on the pro ject website of EverLight. 5
4.1 Indoor Images
Quantitative Results We follow the evaluation protocol of Everlight [10],
testing using a two-fold method. First, evaluating the ability of the method to
produce accurate light positions, colours and intensities on a diffuse surface. This
is conducted using the test split of the Laval HDR Indoor dataset [18] and 10
extracted views of 50◦. The generated environment maps are rendered to light
a scene with 9 diffuse spheres on a ground plane. We use RMSE, scale-invariant
RMSE (siRMSE), RGB angular error and PSNR metrics to measure the diffuse
5https://lvsn.github.io/everlight/
360U-Former 9
InputGround Truth
Gardner ’17 [18]Weber ’22 [40]StyleLight [39]EverLight [10]
Ours
Fig. 4: Indoor qualitative comparison of our generated ERPs in LDR with other meth-
ods. For each method and input LFOV image we show the LDR ERP rotated 180◦to
show any potential border seams and the LDR ERP rotated 90◦by 90◦to compare the
generation at the poles of the ERP. We only include a selection of the methods from the
quantitative comparison, the remaining methods can be found in the supplementary
material. For the ground truth, we show the input to the network and include a dotted
box around that area in the panorama.
10 J. Hilliard et al.
Table 1: Indoor and outdoor environment quantitative comparison with various il-
lumination estimation methods. The metrics si-RMSE, RMSE, RGB ang. and PSNR
are evaluated by rendering a diffuse scene and computing the differences between the
tonemapped renders. The FID score is calculated on the generated environment maps.
The best scores for each metric and environment are highlighted in bold and the second
best underlined.
Method si-RMSE↓RMSE↓RGB ang.↓PSNR↑FID↓
INDOOR METHODS
Ours 0.033 0.110 6.11◦11.68 119.91
Hilliard’23 [26] 0.112 0.300 6.50◦10.05 158.60
EverLight [10] 0.087 0.239 5.75◦10.04 65.50
Weber’22 [40] 0.079 0.196 4.08◦12.95 130.13
StyleLight [39] 0.130 0.261 7.05◦12.85 121.60
Gardner’19(1) [17] 0.099 0.229 4.42◦12.21 410.12
Gardner’19(3) [17] 0.105 0.507 4.59◦10.90 386.43
Gardner’17 [18] 0.123 0.628 8.29◦10.22 253.40
Garon’19 [19] 0.096 0.255 8.06◦9.73 324.51
Lighthouse 0.121 0.254 4.56◦9.81 174.52
EMLight [43] 0.099 0.232 3.99◦10.34 135.97
EnvmapNet [33] 0.097 0.286 7.67◦11.74 221.85
ImmerseGAN [9] 0.091 0.215 7.89◦10.87 55.46
OUTDOOR METHODS
Ours 0.049 0.161 4.00◦13.27 102.63
EverLight [10] 0.162 0.385 8.30◦11.01 61.49
ImmerseGAN [9] 0.175 0.341 9.56◦10.91 34.43
Zhang ’19 [45] 0.225 1.058 11.80◦10.91 449.49
lighting accuracy. Second, the plausibility of the generated HDRIs is measured
with Fréchet Inception Distance (FID). We evaluate the FID score in the same
way as [10], using additional datasets to remove the potential bias of using just
one dataset. These are the 305 images from the Laval HDR Indoor test set [18]
and the 192 indoor images from [7]. However, as the 360Cities dataset is not
publicly available we calculate the FID score for our work without these images.
As with the rendered evaluation, we take 10 images with a field of view of 50◦
extracted from the ground truth panorama as input to the network.
Following the protocol, we evaluate against the following methods. Two ver-
sions of [17] are compared: the original (3) where 3 light sources are estimated,
and a version (1) trained to predict a single parametric light. We also compare
to Lighthouse [34], which expects a stereo pair as input, instead a second image
is generated with a small baseline using [41] (visual inspection confirmed this
yields results comparable to the published work). For [19], the coordinates of the
image centre for the object position are selected. For [33], the proposed “Cluster
ID loss”, and tonemapping are used with a pix2pixHD [38] network architecture.
We compare against EMLight [43], StyleLight [39], EverLight [10], [40] and [26].
Finally, [9] as a state-of-the-art (LDR) LFOV extrapolation method.
The results for the Indoor quantitative comparison are shown in Tab. 1. In
terms of si-RMSE and RMSE, our method outperforms all other methods indi-
cating that the colours are more accurate. In terms of light position accuracy,
the methods that incorporate parametric or spherical Gaussian lighting as the
360U-Former 11
InputGround Truth
ImmerseGAN [9]EverLight [10]
Ours
Fig. 5: Outdoor qualitative comparison of our generated ERPs in LDR with other
methods. For each method and input LFOV image we show the LDR ERP rotated
180◦to show any potential border seams and the LDR ERP rotated 90◦by 90◦to
compare the generation at the poles of the ERP.
main form of lighting representation perform better. Our method performs com-
petitively in terms of PSNR and FID scores.
Qualitative Results In Fig. 4, we present a selection of panoramas rotated by
180◦about the vertical axis to observe the side borders of the generated ERP.
We also feature a rotation of 90◦by 90◦to show the poles of the panorama side
by side. To be concise, we do not include all of the methods from the quantitative
comparison and base our choice on the type of network and the quality of the
output to prevent comparing against too many similar methods. These results
show that our model completely removes the side border and any warping at
the ERP’s poles. Other methods are not able to remove the side border and,
12 J. Hilliard et al.
InputGround Truth360U-Former
w/o PanoSWIN
Fig. 6: Qualitative comparison of our generated ERPs in LDR from our network with
and without the PanoSWIN attention blocks. For each method and input LFOV image
we show the LDR ERP rotated 180◦to show any potential border seams and the LDR
ERP rotated 90◦by 90◦to compare the generation at the poles of the ERP.
although it is not obvious, EverLight does not completely remove it. However,
the plausibility of our generated textures and structures does not compete with
the current state of the art. This reflects the quantitative results in Tab. 1. Our
method retains the information from the input LFOV LDRI as does EverLight.
4.2 Outdoor Images
Quantitative Results We conduct the quantitative comparison on the outdoor
scenes similarly to the indoor scenes but change the input to 3 perspective crops
at azimuth spacing {0, 120 and 240} of 90◦field of view. We use the 893 outdoor
panoramas from the [7] dataset giving a total of 2,517 images for evaluation.
Unlike the indoor evaluation, all metrics are tested on this dataset. We compare
our results with the works of Zhang et al . [45], Everlight [10] and ImmerseGAN
[9]. It should be noted that Zhang et al. proposed a method for predicting the
parameters of an outdoor sun+sky light model.
The quantitative results are shown in Tab. 1. Similarly to the indoor com-
parison, our method outperforms all over methods when comparing the diffuse
360U-Former 13
Table 2: Ablation study to prove the effectivness of the PanoSWIN attention blocks
in improving the models understanding of 360◦ERP images. We highlight the best in
bold.
Si-RMSE↓RMSE↓RGB ang.↓PSNR↑FID↓
INDOOR METHODS
360U-Former 0.033 0.110 6.11◦11.68 119.91
360U-Former with SWIN 0.039 0.104 8.19◦12.22 125.70
OUTDOOR METHODS
360U-Former 0.049 0.161 4.00◦13.27 102.63
360U-Former with SWIN 0.047 0.187 4.84◦12.32 105.54
render showing a better understanding of light position and colour. As with the
indoor method our FID scores perform worse than EverLight and ImmerseGAN,
showing that our method does not reproduce plausible textures.
Qualitative Results As with the indoor scenes we choose the same method to
compare the quality of our generated images. We display images from the outdoor
evaluation dataset from urban and natural scenes, similar to the ones in the 360
Sun Positions dataset. We also include some scene types from the evaluation
dataset that are not included in our dataset to demonstrate our model’s ability
to generalise. As with the indoor results our method removes the side border and
warped poles commonly seen in illumination estimation methods. EverLight and
ImmerseGAN have reduced the ERP warping to subtle artefacts only noticed
when zooming in. We can see that the quality of the colours and features are
similar to that of the ground truth. However, textures are not plausible and do
not compete with the current state of the art. It is also worth noting that the
model generates more plausible and accurate outdoor scenes with fewer artefacts
compared to the indoor scenes.
4.3 Ablation Study
We conduct an ablation study to demonstrate the effectiveness of our adaptions,
the PanoSWIN attention blocks and patch embedding with circular padding, at
removing the artefacts produced when using the ERP format. We use the same
methods of comparison as the indoor and outdoor quantitative and qualitative
studies. The results can be seen in Fig. 6 and Tab. 2 and show that not only
does making adaptations to the network architecture remove the artefacts caused
by using ERPs, it also improves the ability of a network to generate quality
images. Most notable is the improvement in the RGB angular error, suggesting
that adapting the network architecture has helped position light sources more
accurately. The results shown in Fig. 6 are from our test set. it should be noted
that there is a significant improvement in the quality of the generation when our
network uses images from the same dataset as the training set compared to the
datasets used for the benchmark. This could be a limitation of the datasets we
use to train the model, suggesting that there is not a diverse enough distribution
of data in the Structured3D [47] and 360 Sun Positions [5] datasets.
14 J. Hilliard et al.
5 Conclusion
This paper proposes a method for removing artefacts caused by using the ERP
format and a Vision-Transformer architecture for estimating the illumination
conditions of indoor and outdoor scenes from an LFOV image. By utilising a
Vision-Transformer architecture with PanoSWIN attention layers the network
can account for the warping at the poles of the ERP and allow for seamless and
homogenised generation at the borders. We demonstrate this through a qual-
itative comparison that rotates the generated environment maps, highlighting
warping at the edges of the image. In general, the results of this paper could
be used as a guide when constructing any neural network that makes use of the
ERP image format to improve the quality of the generation and remove arte-
facts. Although the ERP artefacts have been addressed, the network lacks the
ability to generate plausible high-resolution textures competitive with that of
other state-of-the-art methods. This could potentially be resolved by increasing
the size of the dataset or using a different training method.
Further analysis could be carried out by mapping the attention and latent
space to understand how the PanoSWIN attention layers improve the ViT net-
work. We could also compare our models’ complexity with other approaches.
Based on the direction the field of illumination estimation is heading towards
we highlight three key areas that could extend the functionality of our method.
Firstly it would be useful for a user to have a variety of mediums to edit the
generated output, such as providing a diffuse environment map to the ICN to
change the lighting conditions or by using a text prompt to change the details
of the environment map. Secondly, adapting the method to incorporate spatial
variance to shift the panorama so that it can accurately light objects not at its
centre. This feature would be integral to XR applications. Finally, a large-scale
standardised HDR panoramic dataset that has calibrated values for light sources
and a range of outdoor and indoor scenes with various textures.
Acknowledgements
This work was supported by the UKRI EPSRC Doctoral Training Partnership
Grants EP/N509772/1 and EP/R513350/1 (studentship reference 2437074)
360U-Former 15
References
1. Akimoto, N., Kasai, S., Hayashi, M., Aoki, Y.: 360-Degree Image Completion by
Two-Stage Conditional Gans. Proceedings - International Conference on Image
Processing, ICIP 2019-Septe, 4704–4708 (sep 2019). https://doi.org/10.1109/
ICIP.2019.8803435
2. Akimoto, N., Matsuo, Y., Aoki, Y.: Diverse Plausible 360-Degree Image Outpaint-
ing for Efficient 3DCG Background Creation. Proceedings of the IEEE Computer
Society Conference on Computer Vision and Pattern Recognition 2022-June,
11431–11440 (mar 2022). https://doi.org/10.1109/CVPR52688.2022.01115
3. Bai, J., Guo, J., Wang, C., Chen, Z., He, Z., Yang, S., Yu, P., Zhang, Y., Guo, Y.:
Deep graph learning for spatially-varying indoor lighting prediction. Science China
Information Sciences 66(3), 1–15 (mar 2023). https://doi.org/10.1007/s11432-
022-3576-9
4. Bolduc, C., Giroux, J., Hébert, M., Demers, C., Lalonde, J.F.: Beyond the Pixel:
a Photometrically Calibrated HDR Dataset for Luminance and Color Prediction.
International Conference on Computer Vision (ICCV), 2023 (apr 2023), https:
//arxiv.org/abs/2304.12372v3
5. Chang, S.H., Chiu, C.Y., Chang, C.S., Chen, K.W., Yao, C.Y., Lee, R.R., Chu,
H.K.: Generating 360 outdoor panorama dataset with reliable sun position esti-
mation. SIGGRAPH Asia 2018 Posters, SA 2018 22, 1 – 2 (dec 2018). https:
// doi . org / 10 . 1145/3283289.3283348,https: / / dl . acm . org/doi/10. 1145 /
3283289.3283348
6. Chen, Z., Wang, G., Liu, Z.: Text2Light: Zero-Shot Text-Driven HDR Panorama
Generation. ACM Transactions on Graphics 41(6) (sep 2022). https://doi.org/
10.1145/3550454.3555447
7. Cheng, D., Shi, J., Chen, Y., Deng, X., Zhang, X.: Learning Scene Illumination by
Pairwise Photos from Rear and Front Mobile Cameras. Computer Graphics Forum
37(7), 213–221 (oct 2018). https://doi.org/10.1111/cgf.13561
8. Chou, S.H., Chao, W.L., Lai, W.S., Sun, M., Yang, M.H.: Visual Question An-
swering on 360{\deg} Images. Proceedings - 2020 IEEE Winter Conference on
Applications of Computer Vision, WACV 2020 pp. 1596–1605 (jan 2020). https:
//doi.org/10.1109/WACV45572.2020.9093452
9. Dastjerdi, M.R.K., Hold-Geoffroy, Y., Eisenmann, J., Khodadadeh, S., Lalonde,
J.F.: Guided Co-Modulated GAN for 360{\deg} Field of View Extrapolation. Pro-
ceedings - 2022 International Conference on 3D Vision, 3DV 2022 pp. 475–485 (apr
2022). https://doi.org/10.1109/3dv57658.2022.00059
10. Dastjerdi, M.R.K., Hold-Geoffroy, Y., Eisenmann, J., Lalonde, J.F.: EverLight:
Indoor-Outdoor Editable HDR Lighting Estimation. In: Proceedings of the
IEEE/CVF International Conference on Computer Vision (ICCV). pp. 7420–7429
(October 2023)
11. Debevec, P.: Rendering synthetic objects into real scenes: Bridging traditional
and image-based graphics with global illumination and high dynamic range pho-
tography. In: Proceedings of the 25th Annual Conference on Computer Graph-
ics and Interactive Techniques, SIGGRAPH 1998. pp. 189–198 (1998). https:
//doi.org/10.1145/280814.280864,http://www.cs.berkeley.edu/
12. Deng, Z., Cai, Y., Chen, L., Gong, Z., Bao, Q., Yao, X., Fang, D., Yang, W.,
Zhang, S., Ma, L.: RFormer: Transformer-based Generative Adversarial Network
for Real Fundus Image Restoration on A New Clinical Benchmark. IEEE Journal
of Biomedical and Health Informatics 26(9), 4645–4655 (jan 2022). https://doi.
org/10.1109/JBHI.2022.3187103
16 J. Hilliard et al.
13. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.:
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
ArXiv abs/2010.11929 (oct 2020). https: //doi.org/10.48550/arxiv.2010.
11929,https://api.semanticscholar.org/CorpusID:225039882
14. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image
synthesis. Proceedings of the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition pp. 12868–12878 (2021). https://doi.org/10.
1109/CVPR46437.2021.01268
15. Feng, Q., Shum, H.P., Morishima, S.: 360 Depth Estimation in the Wild – The
Depth360 Dataset and the SegFuse Network. Proceedings - 2022 IEEE Conference
on Virtual Reality and 3D User Interfaces, VR 2022 pp. 664–673 (feb 2022). https:
//doi.org/10.1109/VR51125.2022.00087
16. Gao, P., Yang, X., Zhang, R., Goulermas, J.Y., Geng, Y., Yan, Y., Huang, K.:
Generalized image outpainting with U-transformer. Neural Networks 162, 1–10
(jan 2023). https://doi.org/10.1016/j.neunet.2023.02.021
17. Gardner, M.A., Hold-Geoffroy, Y., Sunkavalli, K., Gagne, C., Lalonde, J.F.: Deep
parametric indoor lighting estimation. Proceedings of the IEEE International
Conference on Computer Vision 2019-Octob, 7174–7182 (oct 2019). https:
//doi.org/10.1109/ICCV.2019.00727
18. Gardner, M.A., Sunkavalli, K., Yumer, E., Shen, X., Gambaretto, E., Gagné, C.,
Lalonde, J.F.: Learning to predict indoor illumination from a single image. ACM
Transactions on Graphics 36(6), 1–14 (mar 2017). https:// doi.org/10.1145/
3130800.3130891
19. Garon, M., Sunkavalli, K., Hadap, S., Carr, N., Lalonde, J.F.: Fast spatially-varying
indoor lighting estimation. In: Proceedings of the IEEE Computer Society Confer-
ence on Computer Vision and Pattern Recognition. vol. 2019-June, pp. 6901–6910
(2019). https://doi.org/10.1109/CVPR.2019.00707
20. Gkitsas, V., Zioulis, N., Alvarez, F., Zarpalas, D., Daras, P.: Deep lighting envi-
ronment map estimation from spherical panoramas. In: IEEE Computer Society
Conference on Computer Vision and Pattern Recognition Workshops. vol. 2020-
June, pp. 2719–2728 (2020). https://doi.org/10.1109/CVPRW50498.2020.00328
21. Gond, M., Zerman, E., Knorr, S., Sjöström, M.: LFSphereNet: Real Time Spher-
ical Light Field Reconstruction from a Single Omnidirectional Image. Proceed-
ings - CVMP 2023: 20th ACM SIGGRAPH European Conference on Visual Me-
dia Production 23 (nov 2023). https: //doi. org/10 . 1145/3626495 . 3626500,
https://dl.acm.org/doi/10.1145/3626495.3626500
22. Green, R.: Spherical Harmonic Lighting: The Gritty Details. Tech. rep. (2003),
https:// www.cse.chalmers .se /~uffe/xjobb/ Readings/GlobalIllumination/
Spherical%20Harmonic%20Lighting%20-%20the%20gritty%20details.pdf
23. Górski, K.M., Hivon, E., Banday, A.J., Wandelt, B.D., Hansen, F.K., Reinecke, M.,
Bartelmann, M.: Healpix: A framework for high-resolution discretization and fast
analysis of data distributed on the sphere. The Astrophysical Journal 622(2), 759
(apr 2005). https://doi.org/10.1086/427976,https://dx.doi.org/10.1086/
427976
24. Han, S.W., Suh, D.Y.: PIINET: A 360-degree Panoramic Image Inpainting Network
Using a Cube Map. Computers, Materials and Continua 66(1), 213–228 (oct 2020).
https://doi.org/10.32604/cmc.2020.012223
25. Hara, T., Harada, T.: Spherical Image Generation from a Single Normal Field of
View Image by Considering Scene Symmetry. Thirty-Fifth {AAAI} Conference
360U-Former 17
on Artificial Intelligence, {AAAI} 2021, Thirty-Third Conference on Innovative
Applications of Artificial Intelligence, {IAAI} 2021, The Eleventh Symposium on
Educational Advances in Artificial Intelligence, {EAAI} 2021, Vir pp. 1513–1521
(jan 2020)
26. Hilliard, J.O., Hilton, A., Guillemaut, J.Y.: HDR Illumination Outpainting with
a Two-Stage GAN Model. Proceedings of the 20th ACM SIGGRAPH European
Conference on Visual Media Production pp. 1–9 (nov 2023). https://doi.org/
10.1145/3626495.3626510
27. Hold-Geoffroy, Y., Athawale, A., Lalonde, J.F.: Deep sky modeling for single image
outdoor lighting estimation. Proceedings of the IEEE Computer Society Conference
on Computer Vision and Pattern Recognition 2019-June, 6920–6928 (may 2019).
https://doi.org/10.1109/CVPR.2019.00709
28. Jolicoeur-Martineau, A.: The relativistic discriminator: a key element missing from
standard GAN. 7th International Conference on Learning Representations, ICLR
2019 (jul 2018)
29. Kim, K., Yun, Y., Kang, K.W., Kong, K., Lee, S., Kang, S.J.: Painting out-
side as inside: Edge guided image outpainting via bidirectional rearrangement
with progressive step learning. In: Proceedings - 2021 IEEE Winter Confer-
ence on Applications of Computer Vision, WACV 2021. pp. 2121–2129 (2021).
https://doi.org/10.1109/WACV48630.2021.00217
30. Legendre, C., Ma, W.C., Fyffe, G., Flynn, J., Charbonnel, L., Busch, J., Debevec,
P.: Deeplight: Learning illumination for unconstrained mobile mixed reality. In:
Proceedings of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition. vol. 2019-June, pp. 5911–5921 (2019). https://doi.org/10.
1109/CVPR.2019.00607
31. Li, M., Guo, J., Cui, X., Pan, R., Guo, Y., Wang, C., Yu, P., Pan, F.: Deep spherical
Gaussian illumination estimation for indoor. 1st ACM International Conference
on Multimedia in Asia, MMAsia 2019 19 (dec 2019). https://doi.org/10.1145/
3338533.3366562
32. Ling, Z., Xing, Z., Zhou, X., Cao, M., Zhou, G.: Panoswin: a pano-style swin
transformer for panorama understanding. In: 2023 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR). pp. 17755–17764 (aug 2023).
https://doi.org/10.1109/CVPR52729.2023.01703
33. Somanath, G., Kurz, D.: HDR Environment Map Estimation for Real-Time
Augmented Reality. Proceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition pp. 11293–11301 (nov 2021). https:
//doi.org/10.1109/CVPR46437.2021.01114
34. Srinivasan, P.P., Mildenhall, B., Tancik, M., Barron, J.T., Tucker, R., Snavely, N.:
Lighthouse: Predicting Lighting Volumes for Spatially-Coherent Illumination. Pro-
ceedings of the IEEE Computer Society Conference on Computer Vision and Pat-
tern Recognition pp. 8077–8086 (2020). https://doi.org/10.1109/CVPR42600.
2020.00810
35. Su, Y.C., Grauman, K.: Learning Spherical Convolution for 360 °Recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence 44(11), 8371–
8386 (nov 2022). https://doi.org/10.1109/TPAMI.2021.3113612
36. Tsai, Y.T., Shih, Z.C.: All-frequency precomputed radiance transfer using spherical
radial basis functions and clustered tensor approximation. ACM Transactions on
Graphics 25(3), 967–976 (jul 2006). https://doi.org/10.1145/1141911.1141981,
https://dl.acm.org/doi/abs/10.1145/1141911.1141981
18 J. Hilliard et al.
37. Wang, H., Xiang, X., Fan, Y., Xue, J.H.: Customizing 360-Degree Panoramas
through Text-to-Image Diffusion Models. In: Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision (WACV). pp. 4933–4943 (January
2024)
38. Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-
Resolution Image Synthesis and Semantic Manipulation with Conditional GANs.
Proceedings of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition pp. 8798–8807 (dec 2018). https://doi.org/10.1109/CVPR.
2018.00917
39. Wang, Z., Cun, X., Bao, J., Zhou, W., Liu, J., Li, H.: Uformer: A General U-
Shaped Transformer for Image Restoration. Proceedings of the IEEE Computer
Society Conference on Computer Vision and Pattern Recognition 2022-June,
17662–17672 (jun 2022). https://doi.org/10.1109/CVPR52688.2022.01716
40. Weber, H., Garon, M., Lalonde, J.F.: Editable Indoor Lighting Estimation. ECCV
(nov 2022). https://doi.org/10.1007/978-3-031- 20068-7_39
41. Wiles, O., Gkioxari, G., Szeliski, R., Johnson, J.: Synsin: End-to-end view synthesis
from a single image. Proceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition pp. 7465–7475 (2020). https://doi.
org/10.1109/CVPR42600.2020.00749
42. Yun, H., Lee, S., Kim, G.: Panoramic Vision Transformer for Saliency Detection
in 360 •Videos. In: ECCV (2022), https://github.com/hs-yn/PAVER
43. Zhan, F., Zhang, C., Yu, Y., Chang, Y., Lu, S., Ma, F., Xie, X.: EMLight: Lighting
Estimation via Spherical Distribution Approximation. 35th AAAI Conference on
Artificial Intelligence, AAAI 2021 4B, 3287–3295 (dec 2021). https://doi.org/
10.1609/aaai.v35i4.16440,https://arxiv.org/abs/2012.11116v1
44. Zhang, J., Lalonde, J.F.: Learning High Dynamic Range from Outdoor Panora-
mas. Proceedings of the IEEE International Conference on Computer Vision 2017-
Octob, 4529–4538 (2017). https://doi.org/10.1109/ICCV.2017.484
45. Zhang, J., Sunkavalli, K., Hold-Geoffroy, Y., Hadap, S., Eisenman, J., Lalonde,
J.F.: All-weather deep outdoor lighting estimation. Proceedings of the IEEE Com-
puter Society Conference on Computer Vision and Pattern Recognition 2019-
June, 10150–10158 (jun 2019). https://doi.org/10.1109/CVPR.2019.01040
46. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The Unreason-
able Effectiveness of Deep Features as a Perceptual Metric. Proceedings of the
IEEE Computer Society Conference on Computer Vision and Pattern Recog-
nition pp. 586–595 (jan 2018). https: //doi. org/10 . 1109/CVPR .2018.00068,
https://arxiv.org/abs/1801.03924v2
47. Zheng, J., Zhang, J., Li, J., Tang, R., Gao, S., Zhou, Z.: Structured3D: A Large
Photo-realistic Dataset for Structured 3D Modeling. Lecture Notes in Computer
Science (including subseries Lecture Notes in Artificial Intelligence and Lecture
Notes in Bioinformatics) 12354 LNCS, 519–535 (aug 2019). https://doi.org/
10.1007/978-3-030-58545- 7_30