Available via license: CC0
Content may be subject to copyright.
Highlights
Sphere2Vec: A General-Purpose Location Representation Learning over a Spherical Surface for
Large-Scale Geospatial Predictions
Gengchen Mai,Yao Xuan,Wenyun Zuo,Yutong He,Jiaming Song,Stefano Ermon,Krzysztof Janowicz,Ni Lao
•We propose a general-purpose spherical location encoder, Sphere2Vec, which, as far as we know, is the first location
encoder which aims at preserving spherical distance.
•We provide a theoretical proof about the spherical-distance-kept nature of Sphere2Vec.
•We provide theoretical proof to show why the previous 2D location encoders and NeRF-style 3D location encoders
cannot model spherical distance correctly.
•We first construct 20 synthetic datasets based on the mixture of von Mises-Fisher (MvMF) distributions and show
that Sphere2Vec can outperform all baseline models including the state-of-the-art (SOTA) 2D location encoders and
NeRF-style 3D location encoders on all these datasets with an up to 30.8% error rate reduction.
•Next, we conduct extensive experiments on seven real-world datasets for three geo-aware image classification tasks.
Results show that Sphere2Vec outperforms all baseline models on all datasets.
•Further analysis shows that Sphere2Vec is able to produce finer-grained and compact spatial distributions, and does
significantly better than 2D and 3D Euclidean location encoders in the polar regions and areas with sparse training
samples.
arXiv:2306.17624v1 [cs.CV] 30 Jun 2023
Sphere2Vec: A General-Purpose Location Representation Learning
over a Spherical Surface for Large-Scale Geospatial Predictions
Gengchen Maia,d,e,f,∗,1,Yao Xuanb,1,Wenyun Zuoc,Yutong Hed,Jiaming Songd,Stefano Ermond,
Krzysztof Janowicze,f,g and Ni Laoh,2
aSpatially Explicit Artificial Intelligence Lab, Department of Geography, University of Georgia, Athens, Georgia, 30602, USA
bDepartment of Mathematics, University of California Santa Barbara, Santa Barbara, California, 93106, USA
cDepartment of Biology, Stanford University, Stanford, California, 94305, USA
dDepartment of Computer Science, Stanford University, Stanford, California, 94305, USA
eSTKO Lab, Department of Geography, University of California Santa Barbara, Santa Barbara, California, 93106, USA
fCenter for Spatial Studies, University of California Santa Barbara, Santa Barbara, California, 93106, USA
gDepartment of Geography and Regional Research, University of Vienna, Vienna, 1040, Austria
hGoogle, Mountain View, California, 94043, USA
ARTICLE INFO
Keywords:
Spherical Location Encoding
Spatially Explicit Artificial Intelligence
Map Projection Distortion
Geo-Aware Image Classification
Fine-grained Species Recognition
Remote Sensing Image Classification
ABSTRACT
Generating learning-friendly representations for points in space is a fundamental and long-standing
problem in machine learning. Recently, multi-scale encoding schemes (such as Space2Vec and NeRF)
were proposed to directly encode any point in 2D or 3D Euclidean space as a high-dimensional vector,
and has been successfully applied to various (geo)spatial prediction and generative tasks. However,
all current 2D and 3D location encoders are designed to model point distances in Euclidean space. So
when applied to large-scale real-world GPS coordinate datasets (e.g., species or satellite images taken
all over the world), which require distance metric learning on the spherical surface, both types of
models can fail due to the map projection distortion problem (2D) and the spherical-to-Euclidean
distance approximation error (3D). To solve these problems, we propose a multi-scale location
encoder called Sphere2Vec which can preserve spherical distances when encoding point coordinates
on a spherical surface. We developed a unified view of distance-reserving encoding on spheres based
on the Double Fourier Sphere (DFS). We also provide theoretical proof that the Sphere2Vec encoding
preserves the spherical surface distance between any two points, while existing encoding schemes
such as Space2Vec and NeRF do not. Experiments on 20 synthetic datasets show that Sphere2Vec
can outperform all baseline models including the state-of-the-art (SOTA) 2D location encoder (i.e.,
Space2Vec) and 3D encoder NeRF on all these datasets with up to 30.8% error rate reduction. We then
apply Sphere2Vec to three geo-aware image classification tasks - fine-grained species recognition,
Flickr image recognition, and remote sensing image classification. Results on 7 real-world datasets
show the superiority of Sphere2Vec over multiple 2D and 3D location encoders on all three tasks.
Further analysis shows that Sphere2Vec outperforms other location encoder models, especially in the
polar regions and data-sparse areas because of its nature for spherical surface distance preservation.
Code and data of this work are available at https://gengchenmai.github.io/sphere2vec-website/.
1. Introduction
The fact that the Earth is round but not planar should sur-
prise nobody (Chrisman,2017). However, studying geospa-
tial problems on a flat map with the plane analytical ge-
ometry (Boyer,2012) is still the common practice adopted
by most of the geospatial community and well supported
by all the softwares and technology of geographic infor-
mation systems (GIS). Moreover, over the years, certain
programmers and researchers have blurred the distinction
between a (spherical) geographic coordinate system and
a (planar) projected coordinate system (Chrisman,2017),
and directly treated latitude-longitude pairs as 2D Cartesian
∗Corresponding author
gengchen.mai25@uga.edu (G. Mai); yxuan@ucsb.edu (Y. Xuan)
https://gengchenmai.github.io/ (G. Mai)
ORCID (s): 0000-0002-7818-7309 (G. Mai)
1Both authors contribute equally to this work.
2Work done while working at mosaix.ai
coordinates for analytical purpose. This distorted pseudo-
projection results, so-called Plate Carrée, although remain-
ing meaningless, have been unconsciously used in many
scientific work across different disciplines. This blindness
to the obvious round Earth and ignorance of the distortion
brought by various map projections have led to tremendous
negative effects and major mistakes. For example, typical
mistakes brought by the Mercator projection are that it leads
people to believe that Greenland is in the same size of Africa
or Alaska looms larger than Mexico (Sokol,2021). In fact,
Greenland is no bigger than the Democratic Republic of
Congo (Morlin-Yron,2017) and Alaska is smaller than Mex-
ico. A more extreme case about France was documented by
Harmel (2009) during the period of the single area payment.
After converting from the old national coordinate system (a
Lambert conformal conic projection) to the new coordinate
system (RGF 93), subsidies to the agriculture sector were
reduced by 17 million euros because of the reduced scale
error in the map projection.
Mai et al.: Preprint submitted to Elsevier Page 1 of 29
Sphere2Vec
...
Arctic Fox
Bat-Eard Fox
...
Species Recognition
Factory/powerplant
Multi-unit residential
...
...
Remote Sensing
Image Classification
Location Encoder Location
Embedding
Pretrained Image
EncoderF()
Prediction
Image Embedding
F(I)
Supervised
Loss
Image Encoder Supervised Training
Location Encoder Supervised Training
Inference
Point
or
Species Image Satellite Image
or
ImageI
Location Classifier
T
Image Classifier
Q
Figure 1: Applying Sphere2Vec to geo-aware image classification task. Here, we use the fine-grained species recognition and
remote sensing (RS) image classification as examples. Given a species image 𝐈, it is very difficult to decide whether it is an Arctic
fox or a gray fox just based on the appearance information. However, if we know this image is taken from the Arctic area, then
we have more confidence to say this is an Arctic fox. Similarly, an overhead remote sensing image of factories and multi-unit
residential buildings might look similar. However, they locate in different neighborhoods with different land use types which can
be estimated as geographic priors by a location encoder. So the idea of geo-aware image classification is to combine (the red
box) the predictions from an image encoder (the orange box) and a location encoder (the blue box). The image encoder (the
orange box) can be a pretrained model such as an InceptionV3 network (Mac Aodha et al.,2019) for species recognition or a
MoCo-V2+TP (Ayush et al.,2020) for the RS image classification. We can append a separated image classifier 𝐐at the end
of the image encoder 𝐅() and supervised fine-tune the whole image classification model on the corresponding training dataset to
obtain the probability distribution of image labels for a given image 𝐈, i.e., 𝑃(𝑦𝐈). The location encoder (the blue box) can be
Sphere2Vec or any other inductive location encoders (Chu et al.,2019;Mac Aodha et al.,2019;Mai et al.,2020b;Mildenhall
et al.,2020). Supervised training of the location encoder 𝐸𝑛𝑐() together with a location classifier 𝐓can yield the geographic
prior distributions of image labels 𝑃(𝑦𝐱). The predictions from both components are combined (multiplied) to make the final
prediction (the red box). The dotted lines indicates that there is no back-propagation through these lines.
Subsequently, this practice of ignoring the round Earth
has been adopted by many recent geospatial artificial in-
telligence (GeoAI) (Hu et al.,2019;Janowicz et al.,2020)
research on problems such as climate extremes forecasting
(Ham et al.,2019), species distribution modeling (Berg et al.,
2014), location representation learning (Mai et al.,2020b),
and trajectory prediction (Rao et al.,2020). Due to the lack
of interpretability of these deep neural network models,
this issue has not attracted much attentions by the whole
geospatial community.
It is acceptable that the projection errors might be
neglectable in small-scale (e.g., neighborhood-level or city-
level) geospatial studies. However, they become non-negligible
when we conduct research at a country scale or even global
scale. Meanwhile, demand on representation and predic-
tion learning at a global scale grows dramatically due to
emerging global scale issues, such as the transition path
of the latest pandemic (Chinazzi et al.,2020), long lasting
issue for malaria (Caminade et al.,2014), under threaten
global biodiversity (Di Marco et al.,2019;Ceballos et al.,
2020), and numerous ecosystem and social system responses
for climate change (Hansen and Cramer,2015). This trend
urgently calls for GeoAI models that can avoid map pro-
jection errors and directly perform calculation on a round
planet (Chrisman,2017). To achieve this goal, we need a
representation learning model which can directly encode
point coordinates on a spherical surface into the embedding
space such that the resulting location embeddings preserve
the spherical distances (e.g., great circle distance3) between
two points. With such a representation, existing neural
network architectures can operate on spherical-distance-kept
location embeddings to enable the ability of calculating on
a round planet.
In fact, such location representation learning models
are usually termed location encoders which were originally
developed to handle 2D or 3D Cartesian coordinates (Chu
et al.,2019;Mac Aodha et al.,2019;Mai et al.,2020b;
Zhong et al.,2020;Mai et al.,2022b;Mildenhall et al.,2021;
Schwarz et al.,2020;Niemeyer and Geiger,2021;Barron
et al.,2021;Marí et al.,2022;Xiangli et al.,2022). Location
encoders represent a point in a 2D or 3D Euclidean space
(Zhong et al.,2020;Mildenhall et al.,2021;Schwarz et al.,
2020;Niemeyer and Geiger,2021) into a high dimensional
3https://en.wikipedia.org/wiki/Great-circle_distance
Mai et al.: Preprint submitted to Elsevier Page 2 of 29
Sphere2Vec
(a) Image (b) Arctic Fox (c) 𝑤𝑟𝑎𝑝 (d) 𝑔𝑟𝑖𝑑 (e) 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 +
(f) Image (g) Bat-Eared Fox (h) 𝑤𝑟𝑎𝑝 (i) 𝑔𝑟𝑖𝑑 (j) 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 +
(k) Image (l) Factory or powerplant (m) 𝑤𝑟𝑎𝑝 (n) 𝑔𝑟𝑖𝑑 (o) 𝑑 𝑓 𝑠
(p) Image (q) Multi-unit residential (r) 𝑤𝑟𝑎𝑝 (s) 𝑔𝑟𝑖𝑑 (t) 𝑑 𝑓 𝑠
Figure 2: Applying location encoders to differentiate two visually similar species ((a)-(j)) or two visually similar land use types
((k)-(t)). Arctic fox and bat-eared fox might look very similar visually as shown in (a) and (f). However, they have different spatial
distributions. (b) and (g) show their distinct patterns in species image locations. (c)-(e): The predicted distributions of Arctic fox
from different location encoders (without images as input). (h)-(j): The predicted distributions of bat-eared fox. Similarly, it might
be hard to differentiate factories/powerplants from multi-unit residential buildings only based on their overhead satellite imgeries
as shown in (k) and (p). However, as shown in (l) and (q), they have very different global spatial distributions. (m)-(o) and
(r)-(t) show the predicted spatial distributions of factories/powerplants and multi-unit residential buildings from different location
encoders. We can see that while 𝑤𝑟𝑎𝑝 (Mac Aodha et al.,2019) produces a over-generalized spatial distribution, 𝑠𝑝ℎ𝑒𝑟𝑒𝐶+and
𝑑𝑓 𝑠 (our model) produces more compact and fine-grained distributions on the polar region and in data sparse areas such as Africa
(See Figure 2g-2j). 𝑔𝑟𝑖𝑑 (Mai et al.,2020b) is between the two. For more examples, please see Figure 13 and 14.
embedding such that the representations are more learning-
friendly for downstream machine learning models. For
example, Space2Vec (Mai et al.,2020b,a) was developed
for POI type classification, geo-aware image classification,
and geographic question answering which can accurately
model point distributions in a 2D Euclidean space. Recently,
several popular location/position encoders widely used in
the computer vision domain are also called neural implicit
functions (Anokhin et al.,2021a;He et al.,2021;Chen
et al.,2021;Niemeyer and Geiger,2021) which follow the
idea of Neural Radiance Fields (NeRF) (Mildenhall et al.,
2020) to map a 2D or 3D point coordinates to visual signals
via a Fourier input mapping (Tancik et al.,2020;Anokhin
et al.,2021a;He et al.,2021), or so-called Fourier position
encoding (Mildenhall et al.,2020;Schwarz et al.,2020;
Niemeyer and Geiger,2021), followed by a Multi-Layer
Perception (MLP). Until now, those 2D/3D Euclidean loca-
tion encoders have already shown promising performances
on multiple tasks across different domains including geo-
aware image classification (Chu et al.,2019;Mac Aodha
et al.,2019;Mai et al.,2020b), POI classification (Mai et al.,
2020b), trajectory prediction (Xu et al.,2018), geographic
question answering (Mai et al.,2020a), 2D image superres-
olution(Anokhin et al.,2021a;Chen et al.,2021;He et al.,
2021), 3D protein structure reconstruction (Zhong et al.,
2020), 3D scenes representation for view synthesis (Milden-
hall et al.,2020;Barron et al.,2021;Tancik et al.,2022;
Marí et al.,2022;Xiangli et al.,2022) and novel image/view
generation (Schwarz et al.,2020;Niemeyer and Geiger,
2021). However, similarly to above mentioned France case,
when applying the state-of-the-art (SOTA) 2D Euclidean
location encoders (Mac Aodha et al.,2019;Mai et al.,2020b)
to large-scale real-world GPS coordinate datasets such as
remote sensing images taken all over the world which require
distance metric learning on the spherical surface, a map
projection distortion problem (Williamson and Browning,
1973;Chrisman,2017) emerges, especially in the polar
areas. On the other hand, the NeRF-style 3D Euclidean loca-
tion encoders (Mildenhall et al.,2020;Schwarz et al.,2020;
Niemeyer and Geiger,2021) are commonly used to model
point distances in the 3D Euclidean space, but not capable
of accurately modeling the distances on a complex manifold
such as spherical surfaces. Directly applying NeRF-style
models on these datasets means these models have to approx-
imate the spherical distances with 3D Euclidean distances
which leads to a distance metric approximation error. This
highlights the necessity of such a spherical location encoder
discussed above.
In this work, we propose a multi-scale spherical location
encoder, Sphere2Vec, which can directly encode spherical
coordinates while avoiding the map projection distortion and
spherical-to-Euclidean distance approximation error. The
multi-scale encoding method utilizes 2D Discrete Fourier
Mai et al.: Preprint submitted to Elsevier Page 3 of 29
Sphere2Vec
Transform4basis (𝑂(𝑆2)terms) or a subset (𝑂(𝑆)terms)
of it while still being able to correctly measure the spherical
distance. Following previous work we use location encoding
to learn the geographic prior distribution of different image
labels so that given an image and its associated location, we
can combine the prediction of the location encoder and that
from the state-of-the-art image classification models, e.g.,
inception V3 (Szegedy et al.,2016), to improve the image
classification accuracy. Figure 1illustrates the whole archi-
tecture. We demonstrate the effectiveness of Sphere2Vec on
geo-aware image classification tasks including fine-grained
species recognition (Chu et al.,2019;Mac Aodha et al.,
2019;Mai et al.,2020b), Flickr image recognition (Tang
et al.,2015;Mac Aodha et al.,2019), and remote sensing
image classification (Christie et al.,2018;Ayush et al.,
2020). Figure 2c-2e and 2h-2j show the predicted species
distributions of Arctic fox and bat-eared fox from three
different models. Figure 2m-2o and 2r-2t show the predicted
land use distributions of factory or powerplant and multi-
unit residential building from three different models. In
summary, the contributions of our work are:
1. We propose a spherical location encoder, Sphere2Vec,
which, as far as we know, is the first inductive em-
bedding encoding scheme which aims at preserving
spherical distance. We also developed a unified view
of distant reserving encoding methods on spheres
based on Double Fourier Sphere (DFS) (Merilees,
1973;Orszag,1974).
2. We provide theoretical proof that Sphere2Vec encod-
ings can preserve spherical surface distances between
points. As a comparison, we also prove that the 2D
location encoders (Gao et al.,2019;Mai et al.,2020b,
2023c) model latitude and longitude differences sepa-
rately, and NeRF-style 3D location encoders (Milden-
hall et al.,2020;Schwarz et al.,2020;Niemeyer and
Geiger,2021) model axis-wise differences between
two points in 3D Euclidean space separately – none
of them can correctly model spherical distances.
3. We first conduct experiments on 20 synthetic datasets
generated based on the mixture of von Mises–Fisher
distribution (MvMF). We show that Sphere2Vec is
able to outperform all baselines including the state-of-
the-art (SOTA) 2D location encoders and NeRF-style
3D location encoders on all 20 synthetic datasets with
an up to 30.8% error rate reduction. Results show that
2D location encoders are more powerful than NeRF-
style 3D location encoders on all synthetic datasets.
And compared with those 2D location encoders,
Sphere2Vec is more effective when the dataset has a
large data bias toward the polar area.
4. We also conduct extensive experiments on seven real-
world datasets for three geo-aware image classifica-
tion tasks. Results show that due to its spherical dis-
tance preserving ability, Sphere2Vec outperforms both
4http://fourier.eng.hmc.edu/e101/lectures/Image_Processing/node6.
html
the SOTA 2D location encoder models and NeRF-
style 3D location encoders.
5. Further analysis shows that compared with 2D loca-
tion encoders, Sphere2Vec is able to produce finer-
grained and compact spatial distributions, and does
significantly better in the polar regions and areas with
sparse training samples.
The rest of this paper is structured as follows. In Section
2, we motivate our work by highlighting the importance
of the idea of calculating on the round planet. Then, we
provide a formal problem formulation of spherical location
representation learning in Section 3. Next, we briefly sum-
marize the related work in Section 4. The main contribution
-Sphere2Vec - is detailed discussed in Section 5. Then, Sec-
tion 6lists all baseline models we consider in this work. The
theoretical limitations of 2D location encoder 𝑔𝑟𝑖𝑑 as well as
NeRF style 3D location encoders are discussed in Section 7.
Section 8presents the experimental results on the synthetic
datasets. Then, Section 9presents our experimental results
on 7 real-world datasets for geo-aware image classification.
Finally, we conclude this paper in Section 10. Code and data
of this work are available at https://gengchenmai.github.io/
sphere2vec-website/.
2. Calculating on a Round Planet
The blindness to the round Earth or the inappropriate
usage of map projections can lead to tremendous and unex-
pected effects especially when we study a global scale prob-
lem since map projection distortion is unavoidable when
projecting spherical coordinates into 2D space.
There are no map projection can preserve distances at
all direction. The so-called equidistant projection can only
preserve distance on one direction, e.g., the longitude direc-
tion for the equirectangular projection (See Figure 3d), while
the conformal map projections (See Figure 3a) can preserve
directions while resulting in a large distance distortion. For
a comprehensive overview of map projections and their
distortions, see Mulcahy and Clarke (2001).
When we estimate probability distributions at a global
scale (e.g., species distributions or land use types over
the world) with a neural network architecture, using 2D
Euclidean-based GeoAI models with projected spatial data
instead of directly modeling these distributions on a spher-
ical surface will lead to unavoidable map projection distor-
tions and suboptimal results. This highlights the importance
of calculating on a round planet (Chrisman,2017) and
necessity of a spherical distance-kept location encoder.
3. Problem Formulation
Distributed representation of point-features on the
spherical surface can be formulated as follows. Given a set
of points = {𝐱𝑖}on the surface of a sphere 𝕊2, e.g.,
locations of remote sensing images taken all over the world,
where 𝐱𝑖= (𝜆𝑖, 𝜙𝑖) ∈ 𝕊2indicates a point with longitude
𝜆𝑖∈ [−𝜋, 𝜋 )and latitude 𝜙𝑖∈ [−𝜋∕2, 𝜋∕2]. Define a
Mai et al.: Preprint submitted to Elsevier Page 4 of 29
Sphere2Vec
(a) Mercator (b) Miller (c) Sinusoidal (d) Equirectangular
Figure 3: An illustration for map projection distortion: (a)-(d): Tissot indicatrices for four projections. The equal area circles are
putted in different locations to show how the map distortion affect its shape.
function 𝐸𝑛𝑐,𝜃 (𝐱) ∶ 𝕊2→ℝ𝑑, which is parameterized
by 𝜃and maps any coordinate 𝐱in a spherical surface 𝕊2to
a vector representation of 𝑑dimension. In the following, we
use 𝐸𝑛𝑐(𝐱)as an abbreviation for 𝐸𝑛𝑐,𝜃 (𝐱).
Let 𝐸𝑛𝑐(𝐱) = 𝐍𝐍(𝑃 𝐸𝑆(𝐱)) where 𝐍𝐍() is a learnable
multi-layer perceptron with ℎhidden layers and 𝑘neurons
per layer. We want to find a position encoding function
𝑃 𝐸𝑆(𝐱)which does a one-to-one mapping from each point
𝐱𝑖= (𝜆𝑖, 𝜙𝑖) ∈ 𝕊2to a multi-scale representation with 𝑆be
the total number of scales.
We expect to find a function 𝑃 𝐸𝑆(𝐱)such that the result-
ing multi-scale representation of 𝐱preserves the spherical
surface distance while it is more learning-friendly for the
downstream neuron network model 𝐍𝐍(). More concretely,
we’d like to use position encoding functions which satisfy
the following requirement:
𝑃 𝐸𝑆(𝐱1), 𝑃 𝐸𝑆(𝐱2)=𝑓(Δ𝐷),∀𝐱1,𝐱2∈𝕊2,(1)
where ⋅,⋅is the cosine similarity function between two
embeddings. Δ𝐷∈ [0, 𝜋𝑅]is the spherical surface distance
between 𝐱1,𝐱2,𝑅is the radius of this sphere, and 𝑓(𝑥)is a
strictly monotonically decreasing function for 𝑥∈ [0, 𝜋𝑅].
4. Related Work
4.1. Neural Implicit Functions and NeRF
As an increasingly popular family of models in the
computer vision domain, neural implicit functions (Anokhin
et al.,2021a;He et al.,2021;Chen et al.,2021;Niemeyer
and Geiger,2021) refer to the neural network architectures
that directly map a 2D or 3D coordinates into visual signals
via a Fourier input mapping/position encoding (Tancik et al.,
2020;Anokhin et al.,2021a;He et al.,2021;Mildenhall
et al.,2020;Schwarz et al.,2020;Niemeyer and Geiger,
2021), followed by a Multi-Layer Perception (MLP).
A good example is Neural Radiance Fields (NeRF)
(Mildenhall et al.,2020), which combines neural implicit
functions and volume rendering for novel view synthesis for
3D complex scenes. The idea of NeRF becomes very pop-
ular and many follow-up works have been done to revise the
𝑁𝑒𝑅𝐹 model in order to achieve more accurate view syn-
thesis. For example, NeRF in the Wild (NeRF-W) (Martin-
Brualla et al.,2021) was proposed to learn separate transient
phenomena from each static scene to make the model robust
to radiometric variation and transient objects. Shadow NeRF
(S-NeRF) (Derksen and Izzo,2021) was proposed to exploit
the direction of solar rays to obtain a more realistic view
synthesis on multi-view satellite photogrammetry. Similarly,
Satellite NeRF (Sat-NeRF) (Marí et al.,2022) combines
NeRF with native satellite camera models to achieve robust-
ness to transient phenomena that cannot be explained by the
position of the sun to solve the same task. A more noticeable
example is GIRAFFE (Niemeyer and Geiger,2021) which
is a NeRF-based deep generative model which achieves a
more controllable image synthesis. All these NeRF varia-
tions mentioned above use the same NeRF Fourier position
encoding. And they all use this position encoding in the same
generative task – novel image synthesis. Moreover, although
S-NeRF and Sat-NeRF work on geospatial data, i.e., satellite
images, they focus on rather small geospatial scales, e.g., city
scales, in which map projection distortion can be ignored.
In contrast, we investigate the advantages and drawbacks of
various location encoders in large-scale (e.g., global-scale)
geospatial prediction tasks which are discriminative tasks.
We use NeRF position encoding as one of our baselines.
Several works also discussed the possibility to revise
NeRF position encoding. The original encoding method
takes a single 3D point as input which ignores both the
relative footprint of the corresponding image pixel and the
length of the interval along the ray which leads to aliasing
artifacts when rendering novel camera trajectories (Tancik
et al.,2022). To fix this issue, Mip-NeRF (Barron et al.,
2021) proposed a new Fourier position encoding called
integrated positional encoding (IPE). Instead of encoding
one single 3D point, IPE encodes 3D conical frustums
approximated by multivariate Gaussian distributions which
are sampled along the ray based on the projected pixel
footprints. Block-NeRF (Tancik et al.,2022) adopted the
IPE idea and showed how to scale NeRF to render city-
scale scenes. Similarly, BungeeNeRF (Xiangli et al.,2022)
also used the IPE model to develop a progressive NeRF that
can do multi-scale rendering for satellite images in different
Mai et al.: Preprint submitted to Elsevier Page 5 of 29
Sphere2Vec
spatial scales. In this work, we focus on encoding a single
point on the spherical surface, not a 3D conical frustums. So
IPE is not considered as one of the baselines.
Neural implicit functions are also popular for other com-
puter vision tasks such as image superresolution (Anokhin
et al.,2021a;Chen et al.,2021;He et al.,2021) and image
compression (Dupont et al.;Strümpler et al.,2022).
4.2. Location Encoder
Location encoders (Chu et al.,2019;Mac Aodha et al.,
2019;Mai et al.,2020b;Zhong et al.,2020;Mai et al.,
2023c) are neural network architectures which encode points
in low-dimensional (2D or 3D) spaces (Zhong et al.,2020))
into high dimensional embeddings. There has been much
research on developing inductive learning-based location
encoders. Most of them directly apply Multi-Layer Percep-
tron (MLP) to 2D coordinates to get a high dimensional
location embedding for downstream tasks such as pedestrian
trajectory prediction (Xu et al.,2018) and geo-aware image
classification (Chu et al.,2019). Recently, Mac Adoha et al.
(Mac Aodha et al.,2019) apply sinusoid functions to encode
the latitude and longitude of each image before feeding into
MLPs. All of the above approaches deploy location encoding
at a single-scale.
Inspired by the position encoder in Transformer (Vaswani
et al.,2017) and Neuroscience research on grid cells (Banino
et al.,2018;Cueva and Wei,2018) of mammals, Mai et al.
(2020b) proposed to apply multi-scale sinusoid functions to
encode locations in 2D Euclidean space before feeding into
MLPs. The multi-scale representations have advantage of
capturing spatial feature distributions with different char-
acteristics. Similarly, Zhong et al. (2020) utilized a multi-
scale location encoder for the position of proteins’ atoms
in 3D Euclidean space for protein structure reconstruction
with great success. Location encoders can be incorporated
into the state-of-art models for many tasks to make them
spatially explicit (Yan et al.,2019a;Janowicz et al.,2020;
Mai et al.,2022a,2023c).
Compared with well-established kernel-based approaches
(Schölkopf,2001;Xu et al.,2018) such as Radius Based
Function (RBF) which requires memorizing the training
examples as the kernel centers for a robust prediction,
inductive-learning-based location encoders (Chu et al.,2019;
Mac Aodha et al.,2019;Mai et al.,2020b;Zhong et al.,
2020) have many advantages: 1) They are more memory ef-
ficient since they do not need to memorize training samples;
2) Unlike RBF, the performance on unseen locations does
not depend on the number and distribution of kernels. More-
over, Gao et al. (2019) have shown that grid-like periodic
representation of locations can preserve absolute position
information, relative distance, and direction information in
2D Euclidean space. Mai et al. (2020b) further show that it
benefits the generalizability of down-stream models. For a
comprehensive survey of different location encoders, please
refer to Mai et al. (2022b).
Despite all these successes in location encoding re-
search, none of them consider location representation learn-
ing on a spherical surface which is in fact critical for a global
scale geospatial study. Our work aims at filling this gap.
4.3. Machine Learning Models on Spheres
Recently, there has been an increasing amount of work
on designing machine learning models for prediction tasks
on spherical surfaces. For the omnidirectional image classi-
fication task, both Cohen et al. (2018) and Coors et al. (2018)
designed different spherical versions of the traditional con-
volutional neural network (CNN) models in which the CNN
filters explicitly consider map projection distortion. In terms
of image geolocalization (Izbicki et al.,2019a) and text
geolocalization (Izbicki et al.,2019b), a loss function based
on the mixture of von Mises-Fisher distributions (MvMF)– a
spherical analog of the Gaussian mixture model (GMM)– is
used to replace the traditional cross-entropy loss for geolo-
calization models (Izbicki et al.,2019a,b). All these works
are closely related to geometric deep learning (Bronstein
et al.,2017). They show the importance to consider the
spherical geometry instead of projecting it back to a 2D
plane, yet none of them considers representation learning of
spherical coordinates in the embedding space.
4.4. Spatially Explicit Artificial Intelligence
There has been much work in improving the perfor-
mance of current state-of-the-art artificial intelligence and
machine learning models by using spatial features or spatial
inductive bias – so-called spatially explicit artificial intelli-
gence (Yan et al.,2017;Mai et al.,2019;Yan et al.,2019a,b;
Janowicz et al.,2020;Li et al.,2021;Zhu et al.,2021;Janow-
icz et al.,2022;Liu and Biljecki,2022;Zhu et al.,2022;
Mai et al.,2022a,2023b;Huang et al.,2023), or SpEx-AI.
The spatial inductive bias in these models includes: spatial
dependency (Kejriwal and Szekely,2017;Yan et al.,2019a),
spatial heterogeneity (Berg et al.,2014;Chu et al.,2019;
Mac Aodha et al.,2019;Mai et al.,2020b;Zhu et al.,2021;
Gupta et al.,2021;Xie et al.,2021), map projection (Cohen
et al.,2018;Coors et al.,2018;Izbicki et al.,2019a,b), scale
effect (Weyand et al.,2016;Mai et al.,2020b), and so on.
4.5. Pseudospectral Methods on Spheres
Multiple studies have been focused on the numerical
solutions on spheres, for example, in weather prediction
(Orszag,1972,1974;Merilees,1973). The main idea is
so-called pseudospectral methods which leverage truncated
discrete Fourier transformation on spheres to achieve com-
putation efficiency while avoiding the error caused by map
projection distortion. The particular set of basis functions to
be used depends on the particular problem. However, they do
not aim at learning good representations in machine learning
models. In this study, we try to make connections to these
approaches and explore how their insights can be realized in
a deep learning model.
Mai et al.: Preprint submitted to Elsevier Page 6 of 29
Sphere2Vec
(a) 𝑑𝑓 𝑠 (b) 𝑔𝑟𝑖𝑑 (c) 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 (d) 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 +(e) 𝑠𝑝ℎ𝑒𝑟𝑒𝑀 (f) 𝑠𝑝ℎ𝑒𝑟𝑒𝑀 +
Figure 4: Patterns of different encoders, blue points at (𝜆(𝑚), 𝜙(𝑛))mean interaction terms of trigonometric functions of 𝜆(𝑚)and
𝜙(𝑛)are included in the encoder, 𝜆and 𝜙axis correspond to single terms with no interactions.
5. Method
Our main contribution - the design of spherical distance-
kept location encoder 𝐸𝑛𝑐(𝐱),Sphere2Vec will be presented
in Section 5.1. We developed a unified view of distance-
reserving encoding on spheres based on Double Fourier
Sphere (DFS) (Merilees,1973;Orszag,1974). The resulting
location embedding 𝐩[𝐱] = 𝐸𝑛𝑐 (𝐱)is a general-purpose
embedding which can be utilized in different decoder archi-
tectures for various tasks. In Section 5.2, we briefly show
how to utilize the proposed 𝐸𝑛𝑐(𝐱)in the geo-aware image
classification task.
5.1. Sphere2Vec
The multi-scale location encoder defined in Section 3
is in the form of 𝐸𝑛𝑐(𝐱) = 𝐍𝐍(𝑃 𝐸𝑆(𝐱)).𝑃 𝐸𝑆(𝐱)is a
concatenation of multi-scale spherical spatial features of 𝑆
levels. In the following, we call 𝐸𝑛𝑐(𝐱)location encoder and
its component 𝑃 𝐸𝑆(𝐱)position encoder.
𝑑𝑓 𝑠 Double Fourier Sphere (DFS) (Merilees,1973;Orszag,
1974) is a simple yet successful pseudospectral method,
which is computationally efficient and have been applied
to analysis of large scale phenomenons such as weather
(Sun et al.,2014) and blackholes (Bartnik and Norton,
2000). Our first intuition is to use the base functions of
DFS, which preserve periodicity in both the longitude and
latitude directions, to help decompose 𝐱= (𝜆, 𝜙)into a high
dimensional vector:
𝑃 𝐸𝑑 𝑓 𝑠
𝑆(𝐱) =
𝑆−1
𝑛=0
[sin 𝜙(𝑛),cos 𝜙(𝑛)] ∪
𝑆−1
𝑚=0
[sin 𝜆(𝑚),cos 𝜆(𝑚)]∪
𝑆−1
𝑛=0
𝑆−1
𝑚=0
[cos 𝜙(𝑛)cos 𝜆(𝑚),cos 𝜙(𝑛)sin 𝜆(𝑚),
sin 𝜙(𝑛)cos 𝜆(𝑚),sin 𝜙(𝑛)sin 𝜆(𝑚)],
(2)
where 𝜆(𝑚)=𝜆
𝑟(𝑚),𝜙(𝑛)=𝜙
𝑟(𝑛).𝑟(𝑚)and 𝑟(𝑛)are scaling
factors controlled by the current scale 𝑚and 𝑛. Let 𝑟𝑚𝑖𝑛, 𝑟𝑚𝑎𝑥
be the minimum and maximum scaling factor, and 𝑔=
𝑟𝑚𝑎𝑥
𝑟𝑚𝑖𝑛
.5𝑟(𝑠)=𝑟𝑚𝑖𝑛 ⋅𝑔𝑠∕(𝑆−1) where 𝑠is either 𝑚or 𝑛.
∪means vector concatenation and 𝑆−1
𝑠=0 indicates vector
concatenation through different scales. It basically lets all the
𝑆scales of 𝜙terms interact with all the 𝑆scales of 𝜆terms
in the encoder. This would introduce a position encoder
with a 𝑂(𝑆2)dimension output which increases the memory
5In practice we fix 𝑟𝑚𝑎𝑥 = 1 meaning no scaling of 𝜆, 𝜙.
burden in training and hurts generalization. See Figure 4a for
an illustration of the used 𝑂(𝑆2)terms. An encoder might
achieve better results by only using a subset of these terms.
In comparison, the state-of-the-art 𝑔𝑟𝑖𝑑 (Mai et al.,
2020b) encoder defines its position encoder as:
𝑃 𝐸𝑔 𝑟𝑖𝑑
𝑆(𝐱) =
𝑆−1
𝑠=0
[sin 𝜙(𝑠),cos 𝜙(𝑠),sin 𝜆(𝑠),cos 𝜆(𝑠)].(3)
Here, 𝜆(𝑠)and 𝜙(𝑠)have similar definitions as 𝜆(𝑚)and
𝜙(𝑛)in Equation 2. Figure 4b illustrates the used terms of
𝑔𝑟𝑖𝑑. We can see that 𝑔𝑟𝑖𝑑 employs a subset of terms from
𝑑𝑓 𝑠. However, as we explained earlier, 𝑔𝑟𝑖𝑑 performs poorly
at a global scale due to its inability to preserve spherical
distances.
In the following we explore different subsets of DFS
terms while achieving two goals: 1) efficient representation
with 𝑂(𝑆)dimensions 2) preserving distance measures on a
spherical surface.
𝑠𝑝ℎ𝑒𝑟𝑒𝐶 Inspired by the fact that any point (𝑥, 𝑦, 𝑧)in 3D
Cartesian coordinate can be expressed by 𝑠𝑖𝑛 and 𝑐𝑜𝑠 basis
of spherical coordinates (𝜆,𝜙plus radius) 6, we define the
basic form of Sphere2Vec, namely 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 encoder:
𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
𝑆(𝐱) =
𝑆−1
𝑠=0
[sin 𝜙(𝑠),cos 𝜙(𝑠)cos 𝜆(𝑠),cos 𝜙(𝑠)sin 𝜆(𝑠)].
(4)
Figure 4c illustrates the used terms of 𝑠𝑝ℎ𝑒𝑟𝑒𝐶. To illus-
trate that 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 is good at capturing spherical distance,
we take a close look at its basic case 𝑆= 1. When 𝑆= 1
and 𝑟𝑚𝑎𝑥 = 1, there is only one scale 𝑠=𝑆− 1 = 0 and
we define 𝑟(𝑠)=𝑟𝑚𝑖𝑛 ⋅𝑔𝑠∕(𝑆−1) =𝑟𝑚𝑎𝑥 = 1. The multi-scale
encoder degenerates to
𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱) = [sin(𝜙),cos(𝜙) cos(𝜆),cos(𝜙) sin(𝜆)].(5)
These three terms are included in the multi-scale version
(𝑆 > 1) and serve as the main terms at the largest scale
and also the lowest frequency (when 𝑠=𝑆− 1). The high
frequency terms are added to help the downstream neuron
network to learn the point-feature more efficiently (Tancik
et al.,2020). Interestingly, 𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1captures the spherical
distance in a very explicit way:
6https://en.wikipedia.org/wiki/Spherical_coordinate_system
Mai et al.: Preprint submitted to Elsevier Page 7 of 29
Sphere2Vec
Theorem 1. Let 𝐱1,𝐱2be two points on the same sphere 𝕊2
with radius 𝑅, then
𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱1), 𝑃 𝐸 𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱2)= cos( Δ𝐷
𝑅),(6)
where Δ𝐷is the great circle distance between 𝐱1and 𝐱2.
Under this metric,
𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱1) − 𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱2)= 2 sin( Δ𝐷
2𝑅).(7)
Moreover, 𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱1) − 𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱2)≈Δ𝐷
𝑅,when
Δ𝐷is small w.r.t. 𝑅.
See the proof in Appendix A.1.
Since the central angle Δ𝛿=Δ𝐷
𝑅∈ [0, 𝜋]and cos(𝑥)
is strictly monotonically decrease for 𝑥∈ [0, 𝜋],Theorem 1
shows that 𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱)directly satisfies our expectation in
Equation 1where 𝑓(𝑥) = cos( 𝑥
𝑅).
𝑠𝑝ℎ𝑒𝑟𝑒𝑀 Considering the fact that many geographical
patterns are more sensitive to either latitude (e.g., tem-
perature, sunshine duration) or longitude (e.g., timezones,
geopolitical borderlines), we might want to focus on increas-
ing the resolution of either 𝜙or 𝜆while holding the other
relatively at a large scale. Therefore, we introduce a multi-
scale position encoder 𝑠𝑝ℎ𝑒𝑟𝑒𝑀, where interaction terms
between 𝜙and 𝜆always have one of them fixed at the top
scale:
𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝑀
𝑆(𝐱) =
𝑆−1
𝑠=0
[sin 𝜙(𝑠),cos 𝜙(𝑠)cos 𝜆, cos 𝜙cos 𝜆(𝑠),
cos 𝜙(𝑠)sin 𝜆, cos 𝜙sin 𝜆(𝑠)].
(8)
This new encoder ensures that the 𝜙term interact with all the
scales of 𝜆terms (i.e., 𝜆(𝑠)terms) and 𝜆term interact with
all the scales of 𝜙terms (i.e., 𝜙(𝑠)terms). See Figure 4e for
the used terms of 𝑠𝑝ℎ𝑒𝑟𝑒𝑀 . Both 𝑃 𝐸 𝑠𝑝ℎ𝑒𝑟𝑒𝐶
𝑆and 𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝑀
𝑆
are multi-scale versions of a spherical distance-kept encoder
(See Equation 5) and keep that as the main term in their
multi-scale representations.
𝑠𝑝ℎ𝑒𝑟𝑒𝐶+and 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+From the above analysis of
the two proposed position encoders and the SOTA 𝑔𝑟𝑖𝑑
encoders, we know that 𝑔𝑟𝑖𝑑 pays more attention to the
sum of cos difference of latitudes and longitudes, while
our proposed encoders pay more attention to the spherical
distances. In order to capture both information, we consider
merging 𝑔𝑟𝑖𝑑 with each proposed encoders to get more
powerful models that encode geographical information from
different angles.
𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶 +
𝑆(𝐱) = 𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
𝑆(𝐱) ∪ 𝑃 𝐸𝑔 𝑟𝑖𝑑
𝑆(𝐱),(9)
𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝑀 +
𝑆(𝐱) = 𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝑀
𝑆(𝐱) ∪ 𝑃 𝐸𝑔 𝑟𝑖𝑑
𝑆(𝐱).(10)
We hypothesize that encoding these terms in the multi-scale
representation would make the training of the encoder easier
and the order of output dimension is still 𝑂(𝑆). See Figure
4d and 4f for the used terms of 𝑠𝑝ℎ𝑒𝑟𝑒𝐶+and 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+.
In location encoding, the uniqueness of the encoding
results (i.e., no two different points on a sphere having the
same position encoding) is very important. 𝑃 𝐸𝑆(𝐱)in the
five proposed methods are by design one-to-one mapping.
Theorem 2. ∀ ∗∈ {𝑑𝑓 𝑠, 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 , 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 +, 𝑠𝑝ℎ𝑒𝑟𝑒𝑀 ,
𝑠𝑝ℎ𝑒𝑟𝑒𝑀+},𝑃 𝐸∗
𝑆(𝐱)is an injective function.
See the proof in Appendix A.2.
5.2. Applying Sphere2Vec to Geo-Aware Image
Classification
Follow the practice of Mac Aodha et al. (2019) and Mai
et al. (2020b), we formulate the geo-aware image classifica-
tion task (Chu et al.,2019;Mac Aodha et al.,2019) as follow:
Given an image 𝐈taken from location/point 𝐱, we estimate
which category 𝑦it belongs to. If we assume that 𝐈and 𝐱are
independent given 𝑦and an even-prior 𝑃(𝑦), then we have
𝑃(𝑦𝐈,𝐱) = 𝑃(𝐈,𝐱𝑦)𝑃(𝑦)
𝑃(𝐈,𝐱)=𝑃(𝐈𝑦)𝑃(𝐱𝑦)𝑃(𝑦)
𝑃(𝐈,𝐱)(11)
=𝑃(𝑦𝐈)𝑃(𝐈)
𝑃(𝑦)
𝑃(𝑦𝐱)𝑃(𝐱)
𝑃(𝑦)
𝑃(𝑦)
𝑃(𝐈,𝐱)(12)
=𝑃(𝑦𝐱)𝑃(𝑦𝐈)𝑃(𝐈)𝑃(𝐱)
𝑃(𝑦)𝑃(𝐈,𝐱)∝𝑃(𝑦𝐱)𝑃(𝑦𝐈)(13)
𝑃(𝑦𝐈)can be obtained by fine-tuning the state-of-the-art
image classification model for a specific task, such as a
pretrained InceptionV3 network (Mac Aodha et al.,2019) for
species recognition, or a pretrained MoCo-V2+TP (Ayush
et al.,2020) for RS image classification. To be more spe-
cific, we use a pretrained image encoder 𝐅() to extract the
embedding for each input image, i.e., 𝐅(𝐈). Then in order
to compute 𝑃(𝑦𝐈), we can either 1) fine-tune an image
classifier 𝐐based on these frozen image embeddings, or
2) fine-tune the whole image encoder architecture 𝐐(𝐅(𝐈)).
Here, 𝐐is a multilayer perceptron (MLP) followed by a
softmax activation function. Both Mac Aodha et al. (2019)
and Mai et al. (2020b) adopted the second approach which
fine-tunes the whole image classification architecture. We
also adopt the second approach to have a fair comparison
with all these previous methods. Please refer to Section 9.4.3
for an ablation study on this. The idea is illustrated in the
orange box in Figure 1.
In this work, we focus on the second component –
estimating the geographic prior distribution of image la-
bel 𝑦over the spherical surface 𝑃(𝑦𝐱)(the blue box in
Figure 1). This probability distribution can be estimated
by using a location encoder 𝐸𝑛𝑐(). We can use either our
proposed Sphere2Vec or some existing 2D (Mai et al.,2020b;
Mac Aodha et al.,2019;Chu et al.,2019) or 3D (Marí et al.,
2022;Martin-Brualla et al.,2021) Euclidean location en-
coders. More concretely, we have 𝑃(𝑦𝐱) ∝ 𝜎(𝐸𝑛𝑐(𝐱)𝐓∶,𝑦 )
where 𝜎() is a sigmoid activation function. 𝐓∈ℝ𝑑×𝑐is
a class embedding matrix (the location classifier in Figure
1) where the 𝑦𝑡ℎ column 𝐓∶,𝑦 ∈ℝ𝑑indicates the class
Mai et al.: Preprint submitted to Elsevier Page 8 of 29
Sphere2Vec
embedding for class 𝑦.𝑑indicates the dimension of location
embedding 𝐩[𝐱] = 𝐸𝑛𝑐(𝐱)and 𝑐is the total number of image
classes.
The major objective is to learn 𝑃(𝑦𝐱) ∝ 𝜎(𝐸𝑛𝑐(𝐱)𝐓∶,𝑦)
such that all observed species occurrences (all image lo-
cations 𝐱as well as their associated species class 𝑦) have
maximum probabilities. Mac Aodha et al. (2019) used a loss
function which is based on maximum likelihood estimation
(MLE). Given a set of training samples - data points and
their associated class labels 𝕏= {(𝐱, 𝑦)}, the loss function
𝑖𝑚𝑎𝑔𝑒(𝕏)is defined as:
𝑖𝑚𝑎𝑔𝑒(𝕏) =
(𝐱,𝑦)∈𝕏
𝐱−∈(𝐱)
𝛽log(𝜎(𝐸𝑛𝑐(𝐱)𝐓∶,𝑦))
+
𝑐
𝑖=1,𝑖≠𝑦
log(1 − 𝜎(𝐸𝑛𝑐(𝐱)𝐓∶,𝑖))
+
𝑐
𝑖=1
log(1 − 𝜎(𝐸𝑛𝑐(𝐱−)𝐓∶,𝑖))
(14)
Here, 𝛽is a hyperparameter to increase the weight of
positive samples. (𝐱)represents the negative sample set of
point 𝐱in which 𝐱−∈(𝐱)is a negative sample uniformly
generated from the spherical surface given each data point 𝐱.
Equation 14 can be seen as a modified version of the cross-
entropy loss used in binary classification. The first term is
the positive sample term weighted by 𝛽. The second term
is the normal negative term used in cross-entropy loss. The
third term is added to consider uniformly sampled locations
as negative samples.
Figure 1illustrates the whole workflow. During training
time, the image classification module (the orange box) and
location classification module (the blue box) are supervised
trained separately. During the inference time, the probabil-
ities 𝑃(𝑦𝐈)and 𝑃(𝑦𝐱)computed from these two modules
are multiplied to yield the final prediction.
6. Baselines
In order to understand the advantage of spherical-distance-
kept location encoders, we compare different versions of
Sphere2Vec with multiple baselines:
•𝑡𝑖𝑙𝑒 divides the study area 𝐴(e.g., the earth’s surface)
into grids with equal intervals along the latitude and
longitude direction. Each grid has an embedding to
be used as the encoding for every location 𝐱fall into
this grid. This is a common practice adopted by many
previous works when dealing with coordinate data
(Berg et al.,2014;Adams et al.,2015;Tang et al.,
2015).
•𝑤𝑟𝑎𝑝 is a location encoder model introduced by
Mac Aodha et al. (2019). Given a location 𝐱= (𝜆, 𝜙),
it uses a coordinate wrap mechanism to convert each
dimension of 𝐱into 2 numbers :
𝑃 𝐸𝑤𝑟𝑎𝑝
1(𝐱) = [sin(𝜆),cos(𝜆),sin(2𝜙),cos(2𝜙)].(15)
Then the results are passed through a multi-layered
fully connected neural network 𝐍𝐍𝑤𝑟𝑎𝑝() which con-
sists of an initial fully connected layer, followed by
a series of ℎresidual blocks, each consisting of two
fully connected layers (𝑘hidden neurons) with a
dropout layer in between. We adopt the official code of
Mac Aodha et al. (2019)7for this implementation. We
can see that 𝑤𝑟𝑎𝑝 still follows our general definition
of location encoders 𝐸𝑛𝑐(𝐱) = 𝐍𝐍(𝑃 𝐸𝑆(𝐱)) where
𝑆= 1.
•𝑤𝑟𝑎𝑝 +𝑓𝑓𝑛 is similar to 𝑤𝑟𝑎𝑝 except that it replaces
𝐍𝐍𝑤𝑟𝑎𝑝() with 𝐍𝐍𝑓𝑓𝑛(), a simple learnable multi-
layer perceptron with ℎhidden layers and 𝑘neurons
per layer as that Sphere2Vec has. 𝑤𝑟𝑎𝑝 +𝑓𝑓𝑛 is
used to exclude the effect of different 𝐍𝐍() on the
performance of location encoders. In the following, all
location encoder baselines use 𝐍𝐍𝑓𝑓𝑛() as the learn-
able neural network component so that we can directly
compare the effect of different position encoding 𝑃 𝐸∗
𝑆
on the model performance.
•𝑥𝑦𝑧 first converts 𝐱𝑖= (𝜆𝑖, 𝜙𝑖) ∈ 𝕊2into 3D Cartesian
coordinates (𝑥, 𝑦, 𝑧)centered at the sphere center by
following Equation 16 before feeding into a multilayer
perceptron 𝐍𝐍(). Here, we let (𝑥, 𝑦, 𝑧)to locate on a
unit sphere with radius 𝑅= 1. As we can see, 𝑥𝑦𝑧
is just a special case of 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 when 𝑆= 1, i.e.,
𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1.
𝑃 𝐸𝑥𝑦𝑧
𝑆(𝐱)=[𝑧, 𝑥, 𝑦] = 𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1
= [sin 𝜙, cos 𝜙cos 𝜆, cos 𝜙sin 𝜆](16)
•𝑟𝑏𝑓 randomly samples 𝑀points from the training
dataset as RBF anchor points {𝐱𝑎𝑛𝑐ℎ𝑜𝑟
𝑚, 𝑚 = 1...𝑀 },
and use gaussian kernels exp −∥𝐱𝑖−𝐱𝑎𝑛𝑐ℎ𝑜𝑟
𝑚∥2
2𝜎2
on each anchor points, where 𝜎is the kernel size.
Each input point 𝐱𝑖is encoded as a 𝑀-dimension RBF
feature vector, i.e., 𝑃 𝐸𝑟𝑏𝑓
𝑀, which is fed into 𝐍𝐍𝑓𝑓𝑛()
to obtain the location embedding. This is a strong
baseline for representing floating number features in
machine learning models used by Mai et al. (2020b).
•𝑟𝑓 𝑓 , i.e., Random Fourier Features (Rahimi and
Recht,2008;Nguyen et al.,2017), first encodes
location 𝐱into a 𝐷dimension vector - 𝑃 𝐸𝑟𝑓 𝑓
𝐷(𝐱) =
𝜑(𝐱) = 2
𝐷𝐷
𝑖=1[cos (𝜔𝑇
𝑖𝐱+𝑏𝑖)] where 𝜔𝑖
𝑖.𝑖.𝑑
∼
(𝟎, 𝛿2𝐼)is a direction vector whose each dimension
is independently sampled from a normal distribution.
𝑏𝑖is uniformly sampled from [0,2𝜋].𝐼is an identity
matrix. Each component of 𝜑(𝐱)first projects 𝐱into
a random direction 𝜔𝑖and makes a shift by 𝑏𝑖. Then
it wraps this line onto the unit cirle in ℝ2with the
cosine function. Rahimi and Recht (2008) show that
7http://www.vision.caltech.edu/~macaodha/projects/geopriors/
Mai et al.: Preprint submitted to Elsevier Page 9 of 29
Sphere2Vec
𝜑(𝐱)𝑇𝜑(𝐱′)is an unbiased estimator of the Gaussian
kernal 𝐾(𝐱,𝐱′).𝜑(𝐱)is consist of 𝐷different esti-
mates to produce an approximation with a further
lower variance. To make 𝑟𝑓 𝑓 comparable to other
baselines, we feed 𝜑(𝐱)into 𝐍𝐍𝑓𝑓𝑛() to produce the
final location embedding.
•𝑔𝑟𝑖𝑑 is a multi-scale location encoder on 2D Eu-
clidean space proposed by Mai et al. (2020b). Here,
we simply treat 𝐱= (𝜆, 𝜙)as 2D coordinate. It first use
𝑃 𝐸𝑔 𝑟𝑖𝑑
𝑆(𝐱)shown in Equation 3to encode location 𝐱
into a multi-scale representation and then feed it into
𝐍𝐍𝑓𝑓𝑛() to produce the final location embedding.
•𝑡ℎ𝑒𝑜𝑟𝑦 is another multi-scale location encoder on 2D
Euclidean space proposed by Mai et al. (2020b). It
use a position encoder 𝑃 𝐸𝑡ℎ𝑒𝑜𝑟𝑦
𝑆(𝐱)shown in Equation
17. Here, 𝐱(𝑠)= [𝜆(𝑠), 𝜙(𝑠)]=[𝜆
𝑟(𝑠),𝜙
𝑟(𝑠)]and 𝐚1=
[1,0]𝑇,𝐚2= [−1∕2,3∕2]𝑇,𝐚3= [−1∕2,−3∕2]𝑇∈
ℝ2are three unit vectors which orient 2𝜋∕3 apart from
each other. The encoding results are feed into 𝐍𝐍𝑓𝑓𝑛()
to produce the final location embedding.
𝑃 𝐸𝑡ℎ𝑒𝑜𝑟𝑦
𝑆(𝐱) =
𝑆−1
𝑠=0
3
𝑗=1
[sin(𝐱(𝑠),𝐚𝑗),cos(𝐱(𝑠),𝐚𝑗)].
(17)
•𝑁𝑒𝑅𝐹 indicates a multiscale location encoder adapted
from the positional encoder 𝑃 𝐸𝑁 𝑒𝑅𝐹
𝑆(𝐱)used by
Neural Radiance Fields (NeRF) (Mildenhall et al.,
2020) and many NeRF variations such as NeRF-
W (Martin-Brualla et al.,2021), S-NeRF (Derksen
and Izzo,2021), Sat-NeRF (Marí et al.,2022), GI-
RAFFE (Niemeyer and Geiger,2021), etc., which
was proposed for novel view synthesis for 3D scenes.
Here, 𝑁𝑒𝑅𝐹 can be treated as a multiscale version
of 𝑥𝑦𝑧. It first converts 𝐱= (𝜆, 𝜙) ∈ 𝕊2into 3D
Cartesian coordinates (𝑥, 𝑦, 𝑧)centered at the unit
sphere center. Here, (𝑥, 𝑦, 𝑧)are normalized to lie in
[−1,1], i.e., 𝑅= 1. Different from 𝑥𝑦𝑧, it uses NeRF-
style positional encoder 𝑃 𝐸𝑁 𝑒𝑅𝐹
𝑆(𝐱)in Equation 18
to process (𝑥, 𝑦, 𝑧)into a multiscale representation. To
make it comparable with other location encoders, we
further feed 𝑃 𝐸𝑁 𝑒𝑅𝐹
𝑆(𝐱)into 𝐍𝐍𝑓𝑓𝑛() to get the final
location embedding.
𝑃 𝐸𝑁 𝑒𝑅𝐹
𝑆(𝐱) =
𝑆−1
𝑠=0
𝑝∈{𝑧,𝑥,𝑦}
[sin(2𝑠𝜋𝑝),cos(2𝑠𝜋𝑝)],
𝑤ℎ𝑒𝑟𝑒 [𝑧, 𝑥, 𝑦] = [sin 𝜙, cos 𝜙cos 𝜆, cos 𝜙sin 𝜆].
(18)
All types of Sphere2Vec as well as all baseline models
we compared except 𝑡𝑖𝑙𝑒 share the same model set up -
𝐸𝑛𝑐(𝐱) = 𝐍𝐍(𝑃 𝐸𝑆(𝐱)). The main difference is the position
encoder 𝑃 𝐸𝑆(𝐱)used in different models. 𝑃 𝐸𝑆(𝐱)used by
𝑔𝑟𝑖𝑑,𝑡ℎ𝑒𝑜𝑟𝑦,𝑁 𝑒𝑅𝐹 , and different types of Sphere2Vec
encode the input coordinates in a multi-scale fashion by us-
ing different sinusoidal functions with different frequencies.
Many previous work call this practice “Fourier input map-
ping” (Rahaman et al.,2019;Tancik et al.,2020;Basri et al.,
2020;Anokhin et al.,2021b). The difference is that 𝑔𝑟𝑖𝑑 and
𝑡ℎ𝑒𝑜𝑟𝑦 use the Fourier features from 2D Euclidean space,
𝑁𝑒𝑅𝐹 uses the predefined Fourier scales to directly encode
the points in 3D Euclidean space, while our Sphere2Vec
uses all or the subset of Double Fourier Sphere Features to
take into account the spherical geometry and the distance
distortion it brings.
All models are implemented in PyTorch. We use the
original implementation of 𝑤𝑟𝑎𝑝 from Mac Aodha et al.
(2019) and the implementation of 𝑔𝑟𝑖𝑑 and 𝑡ℎ𝑒𝑜𝑟𝑦 from Mai
et al. (2020b). Since the original implementation of NeRF8
(Mildenhall et al.,2020) is in TensorFlow, we reimplement
𝑁𝑒𝑅𝐹 in PyTorch Framework by following their codes. We
train and evaluate each model on a Ubuntu machine with 2
GeForce GTX Nvidia GPU cores, each of which has 10GB
memory.
7. Theoretical Limitations of 𝑔𝑟𝑖𝑑 and 𝑁 𝑒𝑅𝐹
7.1. Theoretical Limitations of 𝑔𝑟𝑖𝑑
We first provide mathematic proofs to demonstrate why
𝑔𝑟𝑖𝑑 is not suitable to model spherical distances.
Theorem 3. Let 𝐱1,𝐱2be two points on the same sphere 𝕊2
with radius 𝑅, then we have
𝑃 𝐸𝑔 𝑟𝑖𝑑
𝑆(𝐱1), 𝑃 𝐸 𝑔𝑟𝑖𝑑
𝑆(𝐱2)
=
𝑆−1
𝑠=0 cos(𝜙(𝑠)
1−𝜙(𝑠)
2) + cos(𝜆(𝑠)
1−𝜆(𝑠)
2)
=
𝑆−1
𝑠=0 cos( 𝜙1−𝜙2
𝑟(𝑠)) + cos( 𝜆1−𝜆2
𝑟(𝑠)),
(19)
When 𝑆= 1, we have
𝑃 𝐸𝑔 𝑟𝑖𝑑
1(𝐱1), 𝑃 𝐸 𝑔𝑟𝑖𝑑
1(𝐱2)= cos(𝜙1−𝜙2) + cos(𝜆1−𝜆2),
(20)
Theorem 3is very easy to prove based on the angle
difference formula, so we skip its proof. This result indicates
that 𝑔𝑟𝑖𝑑 models the latitude and longitude differences of
𝐱1and 𝐱2independently rather than spherical distance. This
introduces problems when encoding locations in the polar
area. Let’s consider data pairs 𝐱1= (𝜆1, 𝜙)and 𝐱2= (𝜆2, 𝜙),
the distance between them in output space of 𝑃 𝐸𝑔𝑟𝑖𝑑
𝑆is:
𝑃 𝐸𝑔 𝑟𝑖𝑑
𝑆(𝐱1) − 𝑃 𝐸𝑔 𝑟𝑖𝑑
𝑆(𝐱2)2
=𝑃 𝐸𝑔 𝑟𝑖𝑑
𝑆(𝐱1)2+𝑃 𝐸𝑔 𝑟𝑖𝑑
𝑆(𝐱2)2
− 2𝑃 𝐸𝑔 𝑟𝑖𝑑
𝑆(𝐱1), 𝑃 𝐸 𝑔𝑟𝑖𝑑
𝑆(𝐱2)
= 2 − 2
𝑆−1
𝑠=0
cos( 𝜆1−𝜆2
𝑟(𝑠))
(21)
8https://github.com/bmild/nerf
Mai et al.: Preprint submitted to Elsevier Page 10 of 29
Sphere2Vec
This distance stays as a constant for any values of 𝜙. How-
ever, when 𝜙varies from −𝜋
2to 𝜋
2, the actual spherical
distance changes in a wide range, e.g., the actual distance
between the data pair at 𝜙= − 𝜋
2(South Pole) is 0 while the
distance between the data pair at 𝜙= 0 (Equator), gets the
maximum value. This problem in measuring distances also
has a negative impact on 𝑔𝑟𝑖𝑑’s ability to model distributions
in areas with sparse sample points because it is hard to learn
the true spherical distances.
In fact, in our experiments (𝑆 > 1), we observe that 𝑔𝑟𝑖𝑑
reaches peak performance at much smaller 𝑟𝑚𝑖𝑛 than that
of Sphere2Vec encodings. Moreover, 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 outperforms
𝑔𝑟𝑖𝑑 near polar regions where 𝑔𝑟𝑖𝑑 produces large distances
though the spherical distances are small (A, B in Figure 1).
7.2. Theoretical Limitations of 𝑁 𝑒𝑅𝐹
Since 𝑁𝑒𝑅𝐹 is widely used for 3D representation
learning (Mildenhall et al.,2020;Niemeyer and Geiger,
2021), a natural question is why not just use 𝑁𝑒𝑅𝐹 for the
geographic prediction tasks on the spherical surface, which
can be embedded in the 3D space. In this section, we discuss
the theoretical limitations of 𝑁𝑒𝑅𝐹 3D multiscale encoding
in the scenario of spherical encoding.
Theorem 4. Let 𝐱1,𝐱2∈𝕊2be two points on the spherical
surface. Given their 3D Euclidean representations, i.e., 𝐱1=
(𝑧1, 𝑥1, 𝑦1),𝐱2= (𝑧2, 𝑥2, 𝑦2), we define Δ𝐱=𝐱1−𝐱2=
[𝑧1−𝑧2, 𝑥1−𝑥2, 𝑦1−𝑦2] = [Δ𝐱𝑧,Δ𝐱𝑥,Δ𝐱𝑦]as the difference
between them in the 3D Euclidean space. Under 𝑁𝑒𝑅𝐹
encoding (Equation 18), the distance between them satisfies
𝑃 𝐸𝑁 𝑒𝑅𝐹
𝑆(𝐱1) − 𝑃 𝐸𝑁 𝑒𝑅𝐹
𝑆(𝐱2)2
=
𝑆−1
𝑠=0 4 sin2(2𝑠−1𝜋Δ𝐱𝑧) + 4 sin2(2𝑠−1𝜋Δ𝐱𝑥)
+ 4 sin2(2𝑠−1 𝜋Δ𝐱𝑦)
=
𝑆−1
𝑠=0
4𝐘𝑠2,
(22)
where 𝐘𝑠= [sin(2𝑠−1𝜋Δ𝐱𝑧),sin(2𝑠−1𝜋Δ𝐱𝑥),sin(2𝑠−1 𝜋Δ𝐱𝑦)].
See the proof in Appendix A.2.
Theorem 5. 𝑁𝑒𝑅𝐹 is not an injective function.
Theorem 5is very easy to prove based on Theorem
4. Since 𝑁𝑒𝑅𝐹 requires 𝑅= 1, when 𝐱1= (1,0,0) and
𝐱2= (−1,0,0), i.e., they are the north and south pole, we
have Δ𝐱= [2,0,0]. The distance between their multiscale
𝑁𝑒𝑅𝐹 encoding is,
𝑃 𝐸𝑁 𝑒𝑅𝐹
𝑆(𝐱1) − 𝑃 𝐸𝑁 𝑒𝑅𝐹
𝑆(𝐱2)2=
𝑆−1
𝑠=0
4 sin2(2𝑠𝜋)=0,
(23)
Since Equation 22 is symmetrical for the x,y, and z axis,
we will have the same problems when 𝐱1= (0,1,0),
𝐱2= (0,−1,0) or 𝐱1= (0,0,1),𝐱2= (0,0,−1). This
indicates that even though these three pairs of points have
the largest spherical distances, they have identical 𝑁𝑒𝑅𝐹
multiscale representations. This illustrates that 𝑁𝑒𝑅𝐹 is not
an injective function.
Theorem 4shows that, unlike Sphere2Vec, the distance
between two 𝑁𝑒𝑅𝐹 location embedding is not a monotonic
increasing function of Δ𝐷, but a non-monotonic function of
the coordinates of Δ𝐱, the axis-wise differences between two
points in 3D Euclidean space. So 𝑁𝑒𝑅𝐹 does not preserve
spherical distance for spherical points, but rather models
Δ𝐱𝑧,Δ𝐱𝑥,Δ𝐱𝑦separately.
8. Experiments with Synthetic Datasets
Theorem 1and 2provide theoretical guarantees of
Sphere2Vec for spherical distance preservation. To empir-
ically verify the effectiveness of Sphere2Vec in a controlled
setting, we construct a set of synthetic datasets and evaluate
our Sphere2Vec and all baseline models on these datasets.
To make a simpler task, different from the setting shown in
Figure 1, we skip the image encoder component and only fo-
cus on the location encoder training and evaluation. For each
synthetic dataset, we simulate a set of spherical coordinates
as the geo-locations of images to train different location
encoders. And in the evaluation step, the performances of
different models are computed directly based on 𝑃(𝑦𝐱)only,
but not 𝑃(𝑦𝐱)𝑃(𝑦𝐈).
8.1. Synthetic Dataset Generation
We utilize the von Mises–Fisher distribution (𝑣𝑀 𝐹 )
(Izbicki et al.,2019a), an analogy of the 2D Gaussian distri-
bution on the spherical surface 𝕊2to generate synthetic data
points9. The probability density function of 𝑣𝑀𝐹 is defined
as
𝑣𝑀 𝐹 (𝐱;𝜇, 𝜅) = 𝜅
2𝜋sinh(𝜅)exp(𝜅𝜇𝑇𝜒(𝐱)) (24)
where 𝜒(𝐱) = [𝑥, 𝑦, 𝑧] = [cos 𝜙cos 𝜆, cos 𝜙sin 𝜆, sin 𝜙],
which converts 𝐱into a coordinates in the 3D Euclidean
space on the surface of a unit sphere. A 𝑣𝑀𝐹 distribution is
controlled by two parameters – the mean direction 𝜇∈ℝ3
and concentration parameter 𝜅∈ℝ+.𝜇indicates the center
of a 𝑣𝑀𝐹 distribution which is a 3D unit vector. 𝜅is a posi-
tive real number which controls the concentration of 𝑣𝑀𝐹 .
A higher 𝜅indicates more compact 𝑣𝑀𝐹 distribution, while
𝜅= 1 correspond to a 𝑣𝑀𝐹 distribution with standard
deviation covering half of the unit sphere.
To simulate multi-modal distributions, we generate spher-
ical coordinates based on a mixture of von Mises–Fisher
distributions (MvMF). We assume classes with even prior,
and each classes follows a 𝑣𝑀𝐹 distribution. To create a
dataset we first sample sets of parameters {(𝜇𝑖, 𝜅𝑖)} (=
9https://www.tensorflow.org/probability/api_docs/python/tfp/
distributions/VonMisesFisher#sample
Mai et al.: Preprint submitted to Elsevier Page 11 of 29
Sphere2Vec
(a) U1 dataset in 2D degree Space (𝜅𝑚𝑎𝑥 = 16) (b) U2 dataset in 2D degree Space (𝜅𝑚𝑎𝑥 = 32)
(c) U3 dataset in 2D degree Space (𝜅𝑚𝑎𝑥 = 64) (d) U4 dataset in 2D degree Space (𝜅𝑚𝑎𝑥 = 128)
(e) U4 dataset in 3D space (𝜅𝑚𝑎𝑥 = 128)
Figure 5: The data distributions of four synthetic datasets (U1, U2, U3, and U4) generated from the uniform sampling method.
(e) shows the U4 dataset in a 3D Euclidean space. We can see that if we treat these datasets as 2D data points as 𝑔𝑟𝑖𝑑 and
𝑡ℎ𝑒𝑜𝑟𝑦, the 𝑣𝑀𝐹 distributions in the polar areas will be stretched and look like 2D aniostropic multivariate Gaussian distributions.
However, this kind of systematic bias can be avoided if we use a spherical location encoder as 𝑆𝑝ℎ𝑒𝑟𝑒2𝑉 𝑒𝑐 .
50). Then we draw samples, i.e., spherical coordinates,
for each class ( = 100). So in total, each generated
synthetic dataset has 5000 data points for 50 balanced
classes.
The concentration parameter 𝜅𝑖is sampled by first draw-
ing 𝑟from an uniform distribtuion 𝑈(𝜅𝑚𝑖𝑛, 𝜅𝑚𝑎𝑥 ), and then
take the square 𝑟2. The square helps to avoid sampling many
large 𝜅𝑖which yield very concentrated 𝑣𝑀𝐹 distributions
that are rather easy to be classified. We fix 𝜅𝑚𝑖𝑛 = 1 and
vary 𝜅𝑚𝑎𝑥 in [16,32,64,128].
For the center parameter 𝜇𝑖we adopt two sampling
approaches:
Mai et al.: Preprint submitted to Elsevier Page 12 of 29
Sphere2Vec
(a) S1.3 dataset in 2D degree Space (𝑁𝜇= 5) (b) S2.3 dataset in 2D degree Space (𝑁𝜇= 10)
(c) S3.3 dataset in 2D degree Space (𝑁𝜇= 25) (d) S4.3 dataset in 2D degree Space (𝑁𝜇= 50)
Figure 6: The data distributions of four synthetic datasets (S1.3, S2.3, S3.3, and S4.3) generated from the stratified sampling
method with 𝜅𝑚𝑎𝑥 = 64. We can see that when 𝑁𝜇increases, a more fine-grain stratified sampling is carried out. The resulting
dataset has a larger data bias toward the polar areas.
1. Uniform Sampling: We uniformly sample centers
(𝜇𝑖) from the surface of a unit sphere. We generate
four synthetic datasets (for different values of 𝜅𝑚𝑎𝑥)
and indicate them as U1, U2, U3, U4. See Table 1for
the parameters we use to generate these datasets.
2. Stratified Sampling: We first evenly divide the lat-
itude range [−𝜋∕2, 𝜋∕2] into 𝑁𝜇intervals. Then we
uniformly sample 𝜇centers (𝜇𝑖) from the spherical
surface defined by each latitude interval. Since the
latitude intervals in polar regions have smaller spheri-
cal surface area, this stratified sampling method has
higher density in the polar regions. We keep 𝑁𝜇×
𝜇== 50 fixed and varies 𝑁𝜇in [5,10,25,50].
Combined with the 4 𝜅𝑚𝑎𝑥 choices, this procedure
yields 16 different synthetic datasets. We denote them
as 𝑆𝑖.𝑗 . See Table 1for the parameters we use to
generate these datasets.
Figure 5a-5d visualize the data point distributions of
U1, U2, U3, U4 which derived from the uniform sampling
method in 2D space. Figure 5e visualized the U4 dataset in
a 3D Euclidean space. We can see that when 𝜅𝑚𝑎𝑥 is larger,
the variation of point density among different 𝑣𝑀𝐹 distribu-
tions becomes larger. Some 𝑣𝑀𝐹 are very concentrated and
the resulting data points are easier to be classified. Moreover,
if we treat these datasets as 2D data points as 𝑔𝑟𝑖𝑑 and 𝑡ℎ𝑒𝑜𝑟𝑦
do, 𝑣𝑀𝐹 distributions in the polar areas will be stretched to
very extended shapes making model learning more difficult.
However, this kind of systematic bias can be avoided if we
use a spherical location encoder as Sphere2Vec.
Figure 6visualizes the data distributions of four syn-
thetic datasets with stratified sampling method. They have
different 𝑁𝜇but the same 𝜅𝑚𝑎𝑥. We can see that when 𝑁𝜇
increases, a more fine-grain stratified sampling is carried out.
The resulting dataset has a larger data bias toward the polar
areas.
8.2. Synthetic Dataset Evaluation Results
We evaluate all baseline models as well as 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+
on those generated 20 syhthetic datasets as described above.
For each model, we do grid search on their hyperparameters
for each dataset including supervised learning rate 𝑙𝑟, the
number of scales 𝑆, the minimum scaling factor 𝑟𝑚𝑖𝑛, the
number of hidden layers and number of neurons used in
𝐍𝐍𝑓𝑓𝑛(⋅)–ℎand 𝑘. The best performance of each model
is reported in Table 1. We use Top1 as the evaluation metric.
Mai et al.: Preprint submitted to Elsevier Page 13 of 29
Sphere2Vec
Table 1
Compare 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+to baselines on synthetic datasets. We use Top1 as the evaluation metric. U1 - U4 indicate 4 synthetic datasets
generated based on the uniform sampling approach (see Section 8.1). S1.1 - S4.4 indicate 16 synthetic datasets generated based
on the stratified sampling apprach. For all datasets have = 50 and = 100. For each model, we perform grid search on
its hyperparameters for each dataset and report the best Top1 accuracy. The Δ𝑇 𝑜𝑝1column shows the absolute performance
improvement of 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+over the best baseline model (bolded) for each dataset. The 𝐸𝑅 column shows the relative reduction
of error compared to the best baseline model (bolded). We can see that 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+can outperform all other baseline models on
all of these 20 synthetic datasets. The absolute Top1 accuracy improvement can be as much as 2.0% for datasets with lower
precisions, and the error rate deduction can be as much as 30.8% for datasets with high precisions.
ID Method 𝑁𝜇𝜇𝜅𝑚𝑖𝑛 𝜅𝑚𝑎𝑥 𝑥𝑦𝑧 𝑤𝑟𝑎𝑝 𝑤𝑟𝑎𝑝+ffn 𝑟𝑓 𝑓 𝑟𝑏𝑓 𝑔𝑟𝑖𝑑 𝑡ℎ𝑒𝑜𝑟𝑦 𝑁 𝑒𝑅𝐹 𝑠𝑝ℎ𝑒𝑟𝑒𝑀 + Δ𝑇 𝑜𝑝1𝐸𝑅
U1
uniform - - 1
16 67.2 67.0 66.9 66.8 46.6 68.6 67.8 62.7 69.2 0.6 -1.9
U2 32 73.1 75.1 73.9 72.3 58.4 76.2 76.5 72.5 77.4 0.9 -3.8
U3 64 86.1 90.1 88.3 89.0 91.7 92.3 92.7 90.1 93.3 0.6 -8.2
U4 128 91.8 94.9 92.3 92.5 97.4 97.5 97.7 95.7 98.0 0.3 -13.0
S1.1
stratified
5 10 1
16 68.7 69.7 68.8 68.6 70.5 69.5 69.4 66.5 72.3 1.8 -6.1
S1.2 32 76.7 79.1 78.1 78.4 81.1 81.2 79.2 76.1 82.9 1.7 -9.0
S1.3 64 91.2 92.5 92.9 92.6 94.7 94.8 94.9 92.1 95.4 0.5 -9.8
S1.4 128 86.5 91.6 88.3 92.4 93.5 95.2 94.9 92.4 96.1 0.9 -18.7
S2.1
10 5 1
16 70.5 71.3 70.7 70.4 46.6 72.0 70.7 67.0 74.0 2.0 -7.1
S2.2 32 76.1 79.7 78.2 78.6 61.2 80.9 80.5 77.6 82.3 1.4 -7.3
S2.3 64 88.0 89.9 88.2 88.5 80.0 92.5 91.9 89.0 93.3 0.8 -10.7
S2.4 128 94.4 96.6 96.7 95.5 94.0 97.6 97.6 96.2 98.1 0.5 -20.8
S3.1
25 2 1
16 66.2 66.3 64.7 65.6 67.1 66.7 66.7 61.3 68.3 1.2 -3.6
S3.2 32 80 82.5 80.7 81.6 83.4 84.5 82.1 80.3 85.9 1.4 -9.0
S3.3 64 85.4 86.0 85.7 86.2 89.1 89.6 88.6 86.1 91.0 1.4 -13.5
S3.4 128 93.2 96.0 94.8 95.7 97.2 97.3 97.4 96.7 98.0 0.6 -23.1
S4.1
50 1 1
16 64.8 67.4 66.0 66.3 66.9 67.1 64.5 62.9 68.4 1 -3.1
S4.2 32 75.6 78.2 77.4 77.4 78.4 80.1 78.3 75.7 81.0 0.9 -4.5
S4.3 64 91.3 93.9 93.7 93.8 95.0 95.2 94.0 92.5 96.1 0.9 -18.7
S4.4 128 94.3 95.5 94.4 94.7 95.4 97.4 96.5 95.2 98.2 0.8 -30.8
The Topk classification accuracy is defined as follow
𝑇 𝑂𝑃𝑘=1
𝑖=1
𝟏(𝑅𝑎𝑛𝑘(𝐱𝑖, 𝑦𝑖)⩾𝑘)(25)
where = {(𝐱𝑖, 𝑦𝑖)} is a set of location 𝐱𝑖and label 𝑦𝑖
tuples which indicates the whole validation or testing set.
denotes the total number of samples in .𝑅𝑎𝑛𝑘(𝐱𝑖, 𝑦𝑖)
indicates the rank of the ground truth label 𝑦𝑖in the ranked
listed of all classes based on the probability score 𝑃(𝑦𝑖𝐱𝑖)
given by a specific location encoder. A lower rank indicates a
better model prediction. 𝟏(∗) is a function return 1 when the
condition ∗is true and 0 otherwise. A higher Topk indicates
a better performance.
Some observations can be made from Table 1:
1. 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+is able to outperform all baselines on all
20 synthetic datasets. The absolute Top1 improvement
can go up to 2% and the error rate deduction can go up
to 30.8%. This shows the robustness of 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+.
2. When the dataset is fairly easy to classify (i.e., all
baseline models can produce 95+% Top1 accuracy),
𝑠𝑝ℎ𝑒𝑟𝑒𝑀+is still able to further improve the per-
formance and gives a very large error rate reduction
(up to 30.8%). This indicates that 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+is very
robust and reliable for datasets with different distribu-
tion characteristics.
3. Comparing the error rate of different stratified sam-
pling generated datasets (S1.j - S4.j) we can see that
when we keep 𝜅𝑚𝑎𝑥 fixed and increase 𝑁𝜇, the relative
error reduction 𝐸𝑅 become larger. Increasing 𝑁𝜇
means we do a more fine-grain stratified sampling.
The resulting datasets should sample more 𝑣𝑀𝐹
distributions in the polar regions. This phenomenon
shows that when the dataset has a larger data bias
towards the polar area, we expect 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+to be
more effective.
4. From Table 1, we can also see that among all the base-
line methods, 𝑔𝑟𝑖𝑑 achieves the best performances on
most datasets (12 out of 20), followed by 𝑡ℎ𝑒𝑜𝑟𝑦 (5 out
of 20). This observation aligns the experiment results
from Mai et al. (2020b) which shows the advantages of
multiscale location representation versus single-scale
representations.
5. It is interesting to see that although 𝑁𝑒𝑅𝐹 is also a
multiscale location encoding approach, it underper-
forms 𝑔𝑟𝑖𝑑 and 𝑡ℎ𝑒𝑜𝑟𝑦 on all synthetic datasets. We
guess the reasons are 1) 𝑁𝑒𝑅𝐹 treats geo-coordinates
as 3D Euclidean coordinates and ignores the fact that
they are all on the spherical surface which yields more
modeling freedom and makes it more difficult for
𝐍𝐍𝑓𝑓𝑛 to learn; 2) 𝑁𝑒𝑅𝐹 uses predefined Fourier
scales, i.e., {20,21, ..., 2𝑠, ..., 2𝑆−1}, while 𝑔𝑟𝑖𝑑,𝑡ℎ𝑒𝑜𝑟𝑦,
and 𝑆𝑝ℎ𝑒𝑟𝑒2𝑉 𝑒𝑐 are more flexible in terms of Fourier
scale choices which are controlled by 𝑟𝑚𝑎𝑥 and 𝑟𝑚𝑖𝑛.
Mai et al.: Preprint submitted to Elsevier Page 14 of 29
Sphere2Vec
Table 2
Dataset statistics on different geo-aware image classification
datasets. "Train", "Val", and "Test" column indicates the
number of data samples in each dataset. "#Class" column
indicates the total number of classes for each dataset.
Task Dataset Train Val Test #Class
Species Recog.
BirdSnap 19133 443 443 500
BirdSnap†42490 980 980 500
NABirds†22599 1100 1100 555
iNat2017 569465 93622 - 5089
iNat2018 436063 24343 - 8142
Flickr YFCC 66739 4449 4449 100
RS fMoW 363570 53040 - 62
9. Experiment with Geo-Aware Image
Classification
Next, we empirically evaluate the performances of our
Sphere2Vec as well as all 9 baseline methods on 7 real-world
datasets for the geo-aware image classification task.
9.1. Dataset
More specifically, we test the performances of different
location encoders on seven datasets from three different
problems: fine-grained species recognition, Flickr image
recognition, and remote sensing image classification. The
statistics of these seven datasets are shown in Table 2.
Figure 7and 8show the spatial distributions of the training,
validation/testing data of these datasets.
Fine-Grained Species Recognition We use five widely
used fine-grained species recognition image datasets in
which each data sample is a tuple of an image 𝐈, a location
𝐱, and its ground truth class 𝑦:
1. BirdSnap: An image dataset about bird species based
on BirdSnap dataset (Berg et al.,2014) which consists
of 500 bird species that are commonly found in the
North America. The original BirdSnap dataset (Berg
et al.,2014) did not provided the location metadata.
Mac Aodha et al. (2019) recollected the images and
location data based on the original image URLs.
2. BirdSnap†: An enriched BirdSnap dataset constructed
by Mac Aodha et al. (2019) by simulating locations,
dates, and photographers from the eBrid dataset (Sul-
livan et al.,2009).
3. NABirds†: Another image dataset about North Amer-
ican bird species constructed by Mac Aodha et al.
(2019) based on the NABirds dataset (Van Horn et al.,
2015) in which the location metadata were also simu-
lated from the eBrid dataset (Sullivan et al.,2009).
4. iNat2017: The species recognition dataset used in the
iNaturalist 2017 challenges10 (Van Horn et al.,2018)
with 5089 unique categories.
10https://github.com/visipedia/inat_comp/tree/master/2017
5. iNat2018: The species recognition dataset used in the
iNaturalist 2018 challenges11 (Van Horn et al.,2018)
with 8142 unique categories.
Flickr Image Classification We use the Yahoo Flickr
Creative Commons 100M dataset12 (YFCC100M-GEO100
dataset) which is a set of geo-tagged Flickr photos collected
by Yahoo! Research. Here, we denote this dataset as YFCC.
YFCC has been used in Tang et al. (2015); Mac Aodha et al.
(2019) for geo-aware image classification. See Figure 8a and
8b for the spatial distributions of the training and test dataset
of YFCC.
Remote Sensing Image Classification We use the Func-
tional Map of the World dataset (denoted as fMoW) (Klocek
et al.,2019) as one representative remote sensing (RS)
image classification dataset. The fMoW dataset contains
about 363K training and 53K validation remote sensing
images which are classfied into 62 different land use types.
They are 4-band or 8-band multispectral remote sensing
images. 4-band images are collected from the QuickBird-
2 or GeoEye-1 satellite systems while 8-band images are
from WorldView-2 or WorldView-3. We use the fMoW-
rgb version of fMoW dataset which are JPEG compressed
version of these remote sensing images with only the RGB
bands. The reason we pick fMoM is that 1) the fMoW
dataset contains RS images with diverse land use types
collected all over the world (see Figure 8c and 8d); 2) it is
a large RS image dataset with location metadata available.
In contrast, the UC Merced dataset (Yang and Newsam,
2010) consist of RS images collected from only 20 US
cities. The EuroSAT dataset (Helber et al.,2019) contained
RS images collected from 30 European countries. And the
location metadata of the RS images from these two datasets
are not publicly available. Global coverage of the RS images
is important in our experiment since we focus on studying
how the map projection distortion problem and spherical-
to-Euclidean distance approximation error can be solved by
Sphere2Vec on a global scale geospatial problem. The reason
we use the RGB version is that this dataset version has an
existing pretrained image encoder – MoCo-V2+TP (Ayush
et al.,2020) available to use. We do not need to train our own
remote sensing image encoder.
9.2. Geo-Aware Image Classification
To test the effectiveness of Sphere2Vec, we conduct geo-
aware image classification experiments on seven large-scale
real-world datasets as we described in Section 9.1.
Beside the baselines described in Section 6, we also con-
sider 𝑁 𝑜 𝑃 𝑟𝑖𝑜𝑟, which represents an full supervised trained
image classifier without using any location information, i.e.,
predicting image labels purely based on image information
𝑃(𝑦𝐈).
11https://github.com/visipedia/inat_comp/tree/master/2018
12https://yahooresearch.tumblr.com/post/89783581601/
one-hundred- million-creative- commons- flickr-images
Mai et al.: Preprint submitted to Elsevier Page 15 of 29
Sphere2Vec
(a) BirdSnap Train (b) BirdSnap†Train (c) NABirds†Train (d) iNat2017 Train (e) iNat2018 Train
(f) BirdSnap Test (g) BirdSnap†Test (h) NABirds†Test (i) iNat2017 Val (j) iNat2018 Val
Figure 7: Training, validation/testing locations of different fine-grained species recognition datasets. Different datasets use either
validation or testing dataset to evaluate model performance. So we plot their corresponding image geographic distributions.
(a) YFCC Train (b) YFCC Test (c) fMoW Train (d) fMoW Val
Figure 8: Training and validation/testing locations of Flickr image recognition (YFCC) and RS image classification (fMoW).
Table 3compares the Top1 classification accuracy of five
variants of Sphere2Vec models against those of nine baseline
models as we discussed in Section 6.
Similar to Equation 25, the Topk classification accuracy
on geo-aware image classification task is defined as follow
𝑇 𝑂𝑃𝑘=1
𝑖=1
𝟏(𝑅𝑎𝑛𝑘(𝐱𝑖,𝐈𝑖, 𝑦𝑖)⩾𝑘)(26)
Table 3
The Top1 classification accuracy of different geo-aware image classification models over three tasks: species recognition, Flickr
image classification (YFCC), and remote sensing (RS) image classification (fMOW (Christie et al.,2018)). See Section 6for
the description of each baseline. 𝑡𝑖𝑙𝑒 indicates the results reported by Mac Aodha et al. (2019). 𝑤𝑟𝑎𝑝 ∗indicates the original
results reported by Mac Aodha et al. (2019) while 𝑤𝑟𝑎𝑝 is the best results we obtained by rerunning their code. Since the test
sets for iNat2017, iNat2018, and fMoW are not open-sourced, we report results on validation sets. The best performance of the
baseline models and Sphere2Vec are highlighted as bold. All compared models use location only while ignoring time. The original
result reported by Ayush et al. (2020) for No Prior on fMOW is 69.05. We obtain 69.84 by retraining their implementation.
"Avg" column indicates the average performance of each model on all five species recognition datasets. See Section 9.3 for
hyperparameter tuning details.
Task Species Recognition Flickr RS
Dataset BirdSnap BirdSnap†NABirds†iNat2017 iNat2018 Avg YFCC fMOW
P(y|x) - Prior Type Test Test Test Val Val - Test Val
No Prior (i.e. image model) 70.07 70.07 76.08 63.27 60.20 67.94 50.15 69.84
𝑡𝑖𝑙𝑒 (Tang et al.,2015) 70.16 72.33 77.34 66.15 65.61 70.32 50.43 -
𝑥𝑦𝑧 71.85 78.97 81.20 69.39 71.75 74.63 50.75 70.18
𝑤𝑟𝑎𝑝 ∗(Mac Aodha et al.,2019) 71.66 78.65 81.15 69.34 72.41 74.64 50.70 -
𝑤𝑟𝑎𝑝 71.87 79.06 81.62 69.22 72.92 74.94 50.90 70.29
𝑤𝑟𝑎𝑝 +𝑓𝑓𝑛 71.99 79.21 81.36 69.40 71.95 74.78 50.76 70.28
𝑟𝑏𝑓 (Mai et al.,2020b) 71.78 79.40 81.32 68.52 71.35 74.47 51.09 70.65
𝑟𝑓 𝑓 (Rahimi et al.,2007) 71.92 79.16 81.30 69.36 71.80 74.71 50.67 70.27
Space2Vec-𝑔𝑟𝑖𝑑 (Mai et al.,2020b) 71.70 79.72 81.24 69.46 73.02 75.03 51.18 70.80
Space2Vec-𝑡ℎ𝑒𝑜𝑟𝑦 (Mai et al.,2020b) 71.88 79.75 81.30 69.47 73.03 75.09 51.16 70.81
𝑁𝑒𝑅𝐹 (Mildenhall et al.,2020) 71.66 79.66 81.32 69.45 73.00 75.02 50.97 70.64
Sphere2Vec-𝑠𝑝ℎ𝑒𝑟𝑒𝐶 72.11 79.80 81.88 69.68 73.29 75.35 51.34 71.00
Sphere2Vec-𝑠𝑝ℎ𝑒𝑟𝑒𝐶 +72.41 80.11 81.97 69.75 73.31 75.51 51.28 71.03
Sphere2Vec-𝑠𝑝ℎ𝑒𝑟𝑒𝑀 72.06 79.84 81.94 69.72 73.25 75.36 51.35 70.99
Sphere2Vec-𝑠𝑝ℎ𝑒𝑟𝑒𝑀 +72.24 80.57 81.94 69.67 73.80 75.64 51.24 71.10
Sphere2Vec-𝑑 𝑓 𝑠 71.75 79.18 81.39 69.65 73.24 75.04 51.15 71.46
Mai et al.: Preprint submitted to Elsevier Page 16 of 29
Sphere2Vec
where = {(𝐱𝑖,𝐈𝑖, 𝑦𝑖)} is a set of location 𝐱𝑖, image 𝐈𝑖,
and label 𝑦𝑖tuples which indicates the whole validation or
testing set. denotes the total number of samples in .
𝑅𝑎𝑛𝑘(𝐱𝑖,𝐈𝑖, 𝑦𝑖)indicates the rank of the ground truth label
𝑦𝑖in the ranked listed of all classes based on the probability
score 𝑃(𝑦𝑖𝐱𝑖)𝑃(𝑦𝑖𝐈𝑖)given by a specific geo-aware image
classification model. 𝟏(∗) is defined the same as that in
Equation 25.
From Table 3, we can see that the Sphere2Vec models
outperform baselines on all seven datasets, and the variants
with linear number of DFS terms (𝑠𝑝ℎ𝑒𝑟𝑒𝐶,𝑠𝑝ℎ𝑒𝑟𝑒𝐶+,
𝑠𝑝ℎ𝑒𝑟𝑒𝑀, and 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+) works as well as or even better
than 𝑑𝑓 𝑠. This clearly show the advantages of Sphere2Vec to
handle large-scale geographic datasets. On the five species
recognition datasets, 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+achieves the best perfor-
mance while 𝑠𝑝ℎ𝑒𝑟𝑒𝑀 and 𝑑𝑓 𝑠 achieve the best perfor-
mance on YFCC and fMoW correspondingly. Similar to
our findings in the synthetic dataset experiments, 𝑔𝑟𝑖𝑑 and
𝑡ℎ𝑒𝑜𝑟𝑦 also outperform or are comaprable to 𝑁𝑒𝑅𝐹 on all
7 real-world datasets.
9.3. Hyperparameter Analysis
In order to find the best hyperparameter combinations for
each model on each dataset, we use grid search to do hyper-
parameter tuning including supervised training learning rate
𝑙𝑟 = [0.01,0.005,0.002,0.001,0.0005,0.00005], the num-
ber of scales 𝑆= [16,32,64], the minimum scaling factor
𝑟𝑚𝑖𝑛 = [0.10.050.020.010.0050.0010.0001], the number of
hidden layers and number of neurons used in 𝐍𝐍𝑓𝑓𝑛(⋅)–
ℎ= [1,2,3,4] and 𝑘= [256,512,1024], the dropout rate in
𝐍𝐍𝑓𝑓𝑛(⋅)–𝑑𝑟𝑜𝑝𝑜𝑢𝑡 = [0.1,0.2,0.3,0.4,0.5,0.6,0.7]. We
also test multiple options for the nonlinear function used for
𝐍𝐍𝑓𝑓𝑛(⋅)including ReLU, LeakyReLU, and Sigmoid. The
maximum scaling factor 𝑟𝑚𝑎𝑥 can be determined based on the
range of latitude 𝜙and longitude 𝜆. For 𝑔𝑟𝑖𝑑 and 𝑡ℎ𝑒𝑜𝑟𝑦, we
use 𝑟𝑚𝑎𝑥 = 360 and for all Sphere2Vec, we use 𝑟𝑚𝑎𝑥 = 1.
As for 𝑟𝑏𝑓 and 𝑟𝑓 𝑓 , we also tune their hyperparamaters
including kernel size 𝜎= [0.5,1,2,10] as well as the number
of kernels 𝑀= [100,200,500].
Based on hyperparameter tuning, we find out using 0.5
as the dropout rate and ReLU as the nonlinear activation
function for 𝐍𝐍𝑓𝑓𝑛(⋅)works best for every location encoder.
Moreover, we find out 𝑙𝑟 and 𝑟𝑚𝑖𝑛 are the most important
hyperparameters. Table 4shows the best hyperparameter
combinations of different Sphere2Vec models on different
geo-aware image classification datasets. We use a smaller
𝑆for 𝑑𝑓 𝑠 since it has 𝑂(𝑆2)terms while the other models
have 𝑂(𝑆)terms. 𝑑𝑓 𝑠 with 𝑆= 8 yield a similar number of
terms to the other models with 𝑆= 32 (see Table 5). Inter-
estingly, all five Sphere2Vec models (𝑠𝑝ℎ𝑒𝑟𝑒𝐶,𝑠𝑝ℎ𝑒𝑟𝑒𝐶+,
𝑠𝑝ℎ𝑒𝑟𝑒𝑀,𝑠𝑝ℎ𝑒𝑟𝑒𝑀+, and 𝑑𝑓 𝑠) show the best performance
on the first six datasets with the same hyperparamter combi-
nations. On the fMoW dataset, five Sphere2Vec achieve the
best performances with different 𝑟𝑚𝑖𝑛 but sharing other hy-
perparameters. This indicates that the proposed Sphere2Vec
models show similar performance over different hyperpa-
rameter combinations.
Table 4
The best hyperparameter combinations of Sphere2Vec models
on different geo-aware image classification datasets. The best
𝑆is 8 for 𝑑𝑓 𝑠 and 32 for all others; and we fix the maximum
scale 𝑟𝑚𝑎𝑥 as 1. Here, 𝑟𝑚𝑖𝑛 indicates the minimum scale. ℎand
𝑘are the number of hidden layers and the number of neurons
in 𝐍𝐍() respectively.
Dataset Model 𝑙𝑟 𝑟𝑚𝑖𝑛 𝑘
BirdSnap All 0.001 10−6 512
BirdSnap†All 0.001 10−4 1024
NABirds†All 0.001 10−4 1024
iNat2017 All 0.0001 10−2 1024
iNat2018 All 0.0005 10−3 1024
YFCC All 0.001 5 × 10−3 512
fMoW
sphereC
0.01
10−3
512
sphereC+ 10−4
sphereM 10−3
sphereM+ 5 × 10−4
dfs 10−4
Table 5
Dimension of position encoding for different models in terms
of total scales 𝑆
Model 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 𝑠𝑝ℎ𝑒𝑟𝑒𝐶+𝑠𝑝ℎ𝑒𝑟𝑒𝑀 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+𝑑𝑓 𝑠
Dim. 3𝑆6𝑆5𝑆8𝑆4𝑆2+ 4𝑆
We also find out that using a deeper MLP as 𝐍𝐍𝑓𝑓𝑛(⋅),
i.e., a larger ℎdoes not necessarily lead to better classifi-
cation accuracy. In many cases, one hidden layer – ℎ= 1
achieves the best performance for many kinds of location
encoders. We discuss this in detail in Section 9.4.2.
Based on the hyperparameter tuning, the best hyperpa-
rameter combinations are selected for different models on
different datasets. The best results are reported in Table 3.
Note that each model has been running for 5 times and its
mean Top1 score is reported. Due to the limit of space,
the standard deviation of each model’s performance on each
dataset is not included in Table 3. However, we report the
standard deviations of all models’ performance on three
datasets in Section 9.4.2.
9.4. Model Performance Sensitivity Analysis
9.4.1. Model Performance Distribution Comparison
To have a better understanding of the performance dif-
ference between Sphere2Vec and all baseline models, we vi-
sualize the distributions/histograms of Top1 accuracy scores
of different models on the BirdSnap†, NABirds†, iNat2018,
and YFCC dataset under different hyperparameter combi-
nations. More specifically, after the hyperparameter tuning
process described in Section 9.3, for each location encoder
and each dataset we get a collection of trained models with
different hyperparameter combinations. They correspond to
a distribution/histograms of Top1 accuracy scores for this
model on the respective dataset. Figure 9compares the
histogram of 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+and all baseline models on four
datasets. We can see that the histogram of 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+is
Mai et al.: Preprint submitted to Elsevier Page 17 of 29
Sphere2Vec
(a) BirdSnap†dataset (b) NABirds†dataset
(c) iNat2018 dataset (d) YFCC dataset
Figure 9: The model performance (Top1 accuracy) distributions/histograms of different models under different hyperparameter
combinations on (a) the BirdSnap†dataset, (b) the NABirds†dataset, (c) the iNat2018 dataset, and (d) the YFCC dataset. X
axis: the Top1 accuracy scores of the respective model; Y axis: the frequency of different hyperparameter combinations of the
same model falling in the same Top1 Accuracy bin. In all four plots, each color indicate a Top1 accuracy histogram of one specific
model on a specific dataset. This histogram shows the model’s sensitivity towards different hyperparameter combinations. We
can see that in all four plots, the histogram of 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+(the blue histogram) are clearly different from all baseline models’
histograms. This shows the clear advantage of 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+over all baselines.
clearly above those of all baselines. This further demon-
strates the superiority of Sphere2Vec over all baselines.
9.4.2. Performance Sensitivity to the Depth of MLP
To further understand how the performances of differ-
ent location encoders vary according to the depth of the
multi-layer perceptron 𝐍𝐍𝑓𝑓𝑛(), we conduct a performance
sensitivity analysis. Table 6is a complementary of Table
3which compares the performance of 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+with
all baseline models on the geo-aware image classification
task. The results on three datasets are shown here including
BirdSnap†, NABirds†, and iNat2018. For each model, we
vary the depth of its 𝐍𝐍𝑓𝑓𝑛(), i.e., ℎ= [1,2,3,4]. The best
evaluation results with each ℎare reported. Moreover, we
run each model with one specific ℎ5 times and report the
standard deviation of the Top1 accuray, indicated in “()”.
Several observations can be made based on Table 6:
1. Although the absolute performance improvement be-
tween 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+and the best baseline model is
not very large – 0.91%, 0.62%, and 0.77% for three
datasets respectively, these performance improve-
ments are statistically significant given the stan-
dard deviations of these Top1 scores.
2. These performance improvements are comparable
to those from the previous studies on the same
tasks. In other words, the small margin is due to
Mai et al.: Preprint submitted to Elsevier Page 18 of 29
Sphere2Vec
the nature of these datasets. For example, Mai et al.
(2020b) showed that 𝑔𝑟𝑖𝑑 or 𝑡ℎ𝑒𝑜𝑟𝑦 has 0.79%, 0.44%
absolute Top1 accuracy improvement on BirdSnap†
and NABirds†dataset respectively. Mac Aodha et al.
(2019) showed that 𝑤𝑟𝑎𝑝 has 0.09%, 0%, 0.04% abso-
lute Top1 accuracy improvement on BirdSnap, Bird-
Snap†and NABirds†dataset. Here, we only con-
sider the results of 𝑤𝑟𝑎𝑝 that only uses location in-
formation, but not temporal information. Although
Mac Aodha et al. (2019) showed that compared with
𝑡𝑖𝑙𝑒 and nearest neighbor methods, 𝑤𝑟𝑎𝑝 has 3.19%
and 3.71% performance improvement on iNat2017
and iNat2018 datatset, these large margins are mainly
because the baselines they used are rather weak. When
we consider the typical 𝑟𝑏𝑓 and 𝑟𝑓 𝑓 (Rahimi et al.,
2007) used in our study, their performance improve-
ments are down to -0.02% and 0.61%.
3. By comparing the performances of the same model
with different depths of its 𝐍𝐍𝑓𝑓𝑛(), i.e., ℎ, we can
see that the model performance is not sensitive to ℎ.
In fact, in most cases, one layer 𝐍𝐍𝑓𝑓𝑛() achieves the
best result. This indicates that the depth of the MLP
does not significantly affect the model performance
and a deeper MLP does not necessarily lead to a
better performance. In other words, the systematic
bias (i.e., distance distortion) introduced by 𝑔𝑟𝑖𝑑,
𝑡ℎ𝑒𝑜𝑟𝑦, and 𝑁𝑒𝑅𝐹 can not later be compensated by
a deep MLP. It shows the importance of designing a
spherical-distance-aware location encoder.
9.4.3. Ablation Studies on Approaches for Image and
Location Fusion
In Section 5.2, we discuss how we fusion the predictions
from the image encoder and location encoder together for
the final model prediction. However, there are other ways
to fuse the image and location information. In this section,
we conduct ablation studies on different image and location
fusion approaches:
•Post Fusion is the method we adopt from Mac Aodha
et al. (2019) which is illustrated in Figure 1. The image
encoder 𝐅(⋅)and location encoder 𝐸𝑛𝑐(⋅)are trained
separately and their final predictions are combined.
•Concat (Img. Finetune) indicates a method in which
the image embedding 𝐅(𝐈)and the location embed-
ding 𝐸𝑛𝑐(𝐱)are concatenated together and fed into
a softmax layer for the final prediction. The whole
architecture is trained end-to-end.
•Concat (Img. Frozen) indicates the same model ar-
chitecture as Concat (Img. Finetune). The only differ-
ence is that 𝐅(⋅)is initialized by a pretrained weight
and its learnable parameters are frozen during the
image and location join training.
We conduct experiments on iNat2018 dataset and the
results are shown in Table 7. We can see that:
Table 6
The impact of the depth ℎof multi-layer perceptrons 𝐍𝐍𝑓𝑓𝑛()
on Top1 accuracy for various models. The numbers in “()”
indicates the standard deviations estimated from 5 independent
train/test runs. We find that the model performances are not
very sensitive to 𝐍𝐍𝑓𝑓𝑛(), and, in most cases, one layer 𝐍𝐍𝑓𝑓𝑛()
achieve the best result. In other words, the larger performance
gaps in fact come from different 𝑃 𝐸𝑆(⋅)we use. Moreover,
given the performance variance of each model, we can see
that 𝑠𝑝ℎ𝑒𝑟𝑒𝑀+outperforms other baseline models on all these
three datasets and the margins are statistically significant. The
same conclusion can be drawn based on our experiments on
other datasets. Here, we only show results on three datasets
as an illustrative example.
Dataset BirdSnap†NABirds†iNat2018
ℎTest Test Val
𝑥𝑦𝑧
1 78.81 (0.10) 81.08 (0.05) 71.60 (0.08)
2 78.83 (0.10) 81.20 (0.09) 71.70 (0.02)
3 78.97 (0.06) 81.11 (0.06) 71.75 (0.04)
4 78.84 (0.09) 81.02 (0.03) 71.71 (0.03)
𝑤𝑟𝑎𝑝
1 79.04 (0.13) 81.60 (0.04) 72.89 (0.08)
2 78.94 (0.13) 81.62 (0.04) 72.84 (0.07)
3 79.08 (0.15) 81.53 (0.02) 72.92 (0.05)
4 79.06 (0.11) 81.51 (0.09) 72.77 (0.06)
𝑤𝑟𝑎𝑝 +𝑓𝑓𝑛
1 78.97 (0.09) 81.23 (0.06) 71.90 (0.05)
2 79.02 (0.15) 81.36 (0.04) 71.95 (0.05)
3 79.21 (0.14) 81.35 (0.05) 71.94 (0.04)
4 79.06 (0.09) 81.27 (0.13) 71.93 (0.04)
𝑟𝑏𝑓
1 79.40 (0.13) 81.32 (0.08) 71.02 (0.18)
2 79.38 (0.12) 81.22 (0.11) 71.29 (0.20)
3 79.40 (0.04) 81.31 (0.07) 71.35 (0.21)
4 79.25 (0.05) 81.30 (0.07) 71.21 (0.19)
𝑟𝑓 𝑓
1 78.96 (0.18) 81.27 (0.07) 71.76 (0.06)
2 78.97 (0.04) 81.28 (0.05) 71.71 (0.09)
3 79.07 (0.12) 81.30 (0.11) 71.80 (0.04)
4 79.16 (0.13) 81.22 (0.11) 71.46 (0.05)
𝑔𝑟𝑖𝑑
1 79.72 (0.07) 81.24 (0.06) 73.02 (0.02)
2 79.05 (0.06) 81.09 (0.07) 72.87 (0.05)
3 79.23 (0.12) 80.95 (0.14) 72.69 (0.05)
4 78.97 (0.10) 80.71 (0.10) 72.51 (0.07)
𝑡ℎ𝑒𝑜𝑟𝑦
1 79.75 (0.17) 81.23 (0.02) 73.03 (0.09)
2 79.08 (0.20) 81.30 (0.11) 72.70 (0.02)
3 78.94 (0.19) 81.00 (0.09) 72.49 (0.08)
4 79.07 (0.14) 80.64 (0.14) 72.35 (0.07)
𝑁𝑒𝑅𝐹
179.66 (0.00) 81.27 (0.00) 73.00 (0.01)
2 79.65 (0.02) 81.29 (0.00) 72.97 (0.03)
3 79.40 (0.05) 81.32 (0.01) 72.88 (0.02)
4 79.24 (0.04) 81.23 (0.00) 72.80 (0.02)
𝑠𝑝ℎ𝑒𝑟𝑒𝑀+
180.57 (0.08) 81.87 (0.02) 73.80 (0.05)
2 79.82 (0.14) 81.83 (0.04) 73.42 (0.06)
3 80.03 (0.08) 81.94 (0.04) 73.40 (0.05)
4 79.90 (0.15) 81.84 (0.09) 73.20 (0.04)
•Post Fusion, the method we adopt in our study,
achieves the best Top1 score and outperforms both
Concat approaches. This result is aligned with the
results of Chu et al. (2019).
Mai et al.: Preprint submitted to Elsevier Page 19 of 29
Sphere2Vec
Table 7
Ablation Studies on different ways to combine image and
location information on the iNat2018 dataset. “Fusion” column
indicates different methods to fuse image and location informa-
tion. “𝐅(⋅)” and “ 𝐸 𝑛𝑐(⋅)” indicates the type of image encoder
and location encoder used for each model. “ 𝐅(⋅)Train” denotes
different ways to train the image encoder. “Frozen” means we
use an InceptionV3 network pre-trained on ImageNet as an
image feature extractor and freeze its learnable parameters
while only finetuning the last softmax layer. “Finetune” means
we finetune the whole image encoder 𝐅(⋅).
Model Concat (Frozen) Concat (Finetune) Post Fusion
𝐅(⋅)InceptionV3 InceptionV3 InceptionV3
𝐅(⋅)Train Frozen Finetune Finetune
𝐸𝑛𝑐(⋅)𝑠𝑝ℎ𝑒𝑟𝑒𝑀+𝑠𝑝ℎ𝑒𝑟𝑒𝑀+𝑠𝑝ℎ𝑒𝑟𝑒𝑀+
Top1 48.74 73.35 73.72
•Concat (Img. Frozen) shows a significantly lower per-
formance than Concat (Img. Finetune). This is under-
standable and consistent with the existing literature
(Ayush et al.,2020) since the linear probing method,
Concat (Img. Frozen), usually underperforms a fully
fine-tuning method, Concat (Img. Finetune).
•Although Post Fusion only shows a small margin
over Concat (Img. Finetune), the training process of
Post Fusion is much easier since we can separate
the training process of the image encoder 𝐅(⋅)and
location encoder 𝐸𝑛𝑐 (⋅). In contrast, Concat (Img.
Finetune) has to train a large network which is hard
to do hyperparameter tuning.
9.5. Understand the Superiority of Sphere2Vec
Based on the theoretical analysis of Sphere2Vec in Sec-
tion 10, we make two hypotheses to explain the superiority
of Sphere2Vec over 2D Euclidean location encoders such as
𝑡ℎ𝑒𝑜𝑟𝑦,𝑔𝑟𝑖𝑑:
A: Our spherical-distance-kept Sphere2Vec have a sig-
nificant advantage over 2D location encoders in the
polar area where we expect a large map projection
distortion.
B: Sphere2Vec outperforms 2D location encoders in ar-
eas with sparse sample points because it is difficult for
𝑔𝑟𝑖𝑑 and 𝑡ℎ𝑒𝑜𝑟𝑦 to learn spherical distances in these
areas with less samples but Sphere2Vec can handle it
due to its theoretical guarantee for spherical distance
preservation.
To validate these two hypotheses, we use iNat2017 and
fMoW to conduct multiple empirical analyses. Table 3uses
Top1 classification accuracy as the evaluation metric to be
aligned with several previous works (Mac Aodha et al.,
2019;Mai et al.,2020b;Ayush et al.,2020). However, Top1
only considers the samples whose ground truth labels are
top-ranked while ignoring all the other samples’ ranks. In
contrast, mean reciprocal rank (MRR) considers the ranks
of all samples. Equation 27 shows the definition of MRR:
𝑀𝑅𝑅 =1
𝑖=1
1
𝑅𝑎𝑛𝑘(𝐱𝑖,𝐈𝑖, 𝑦𝑖).(27)
where and 𝑅𝑎𝑛𝑘(𝐱𝑖,𝐈𝑖, 𝑦𝑖)have the same definition as
those in Equation 26. A higher MRR indicates better model
performance. Because of the advantage of MRR, we use
MRR as the evaluation metric to compare different models.
9.5.1. Analysis on the iNat2017 Dataset
Figure 10 and 11 show the analysis results on the
iNat2017 dataset. Figure 10a shows the image locations in
the iNat2017 validation dataset. We split this dataset into
different latitude bands as indicated by the black lines in
Figure 10a. The numbers of samples in each latitude band for
the training and validation dataset of iNat2017 are visualized
in Figure 10b. We can see that more samples are available in
the North hemisphere, especially when 𝜙 > 10◦.
We compare the MRR scores of different models in dif-
ferent geographic regions to see how the differences in MRR
change across space. We compute MRR difference between
𝑠𝑝ℎ𝑒𝑟𝑒𝐶+to 𝑔𝑟𝑖𝑑, i.e., Δ𝑀𝑅𝑅 =𝑀𝑅𝑅(𝑠𝑝ℎ𝑒𝑟𝑒𝐶+) −
𝑀𝑅𝑅(𝑔𝑟𝑖𝑑), in different latitude-longitude cell and visu-
alize them in Figure 10c. Here, the color of cells is pro-
portional to Δ𝑀 𝑅𝑅. Red and blue color indicates positive
and negative Δ𝑀𝑅𝑅 and white color indicates nearly zero
MRR. Darker color corresponds to a high absolute Δ𝑀𝑅𝑅
value. Numbers in cells indicate the total number of val-
idation samples in this cell. We can see that 𝑠𝑝ℎ𝑒𝑟𝑒𝐶+
outperforms 𝑔𝑟𝑖𝑑 in almost all cells near the North Pole since
all these cells are in red color. This observation confirms our
Hypothesis A. However, we also see two blue cells at the
South Pole. But given the fact that these cells only contain 5
and 7 samples, we assume these two blue cells attributed to
the stochasticity involved during the neural network training.
To further validate Hypothesis A, we compute MRR
scores of different models in different latitude bands. The
Δ𝑀𝑅𝑅 between each model to 𝑔𝑟𝑖𝑑 in different latitude
bands are visualized in Figure 10d. We can clearly see that 4
Sphere2Vec models have larger Δ𝑀 𝑅𝑅 near the North Pole
which validates Hypothesis A. Moreover, Sphere2Vec has
advantages on bands with less data samples, e.g. 𝜙∈
[−30◦,−20◦). This observation also confirms Hypothesis B.
To further understand the relation between the model
performance and the number of data samples in different
geographic regions, we contrast the number of samples with
Δ𝑀𝑅𝑅. Figure 11a contrasts the number of samples per cell
with the Δ𝑀𝑅𝑅 =𝑀 𝑅𝑅(𝑠𝑝ℎ𝑒𝑟𝑒𝐶+) − 𝑀 𝑅𝑅(𝑔𝑟𝑖𝑑 )per
cell (denoted as blues dots). We classify latitude-longitude
cells into different groups based on the number of samples
and an average MRR is computed for each group (denoted as
the yellow dots). We can see 𝑠𝑝ℎ𝑒𝑟𝑒𝐶+has more advantages
over 𝑔𝑟𝑖𝑑 on cells with fewer data samples. This shows
the robustness of 𝑠𝑝ℎ𝑒𝑟𝑒𝐶+on data sparse area. Similarly,
Figure 11b contrasts the number of samples in each latitude
band with Δ𝑀𝑅𝑅 between different models and 𝑔𝑟𝑖𝑑 per
Mai et al.: Preprint submitted to Elsevier Page 20 of 29
Sphere2Vec
(a) Validation Locations (b) Samples per 𝜙band
(c) Δ𝑀𝑅𝑅 per cell (d) Δ𝑀𝑅𝑅 per 𝜙band
Figure 10: The data distribution of the iNat2017 dataset and model performance comparison on it: (a) Sample locations for
validation set of the iNat2017 dataset where the dashed and solid lines indicates latitudes; (b) The number of training and
validation samples in different latitude intervals. (c) Δ𝑀𝑅𝑅 =𝑀𝑅𝑅(𝑠𝑝ℎ𝑒𝑟𝑒𝐶+) − 𝑀 𝑅𝑅(𝑔𝑟𝑖𝑑)for each latitude-longitude cell.
Red and blue color indicates positive and negative Δ𝑀𝑅𝑅 while darker color means high absolute value. The number on each cell
indicates the number of validation data points while "1K+" means there are more than 1K points in a cell. (d) Δ𝑀𝑅𝑅 between
a model and baseline 𝑔𝑟𝑖𝑑 on the validation dataset in different latitude bands.
band. We can see that 4 Sphere2Vec show advantages over
𝑔𝑟𝑖𝑑 in bands with fewer samples. 𝑟𝑏𝑓 is particularly bad
in data sparse bands which is a typical drawback for kernel-
based methods. The observations from Figure 11a and 11b
confirm our Hypothesis B.
9.5.2. Analysis on the fMoW Dataset
Following the same practice of Figure 10, Figure 12
shows similar analysis results on the fMoW dataset. Figure
12a visualizes the sample locations in the fMoW validation
dataset and Figure 12b shows the numbers of training and
validation samples in each latitude band. Similar to the
iNat2017 dataset, we can see that for the fMoW dataset more
samples are available in the North hemisphere, especially
when 𝜙 > 20◦.
Similar to Figure 10c, Figure 12c shows the Δ𝑀𝑅𝑅 =
𝑀𝑅𝑅(𝑑𝑓 𝑠) − 𝑀 𝑅𝑅(𝑔 𝑟𝑖𝑑)for each latitude-longitude cell.
Red and blue color indicates positive and negative Δ𝑀𝑅𝑅.
Similar observations can be seen from Figure 10c.𝑑𝑓 𝑠 has
advantages over 𝑔𝑟𝑖𝑑 in most cells near the North pole
and South Pole. 𝑔𝑟𝑖𝑑 only wins in a few pole cells with
small numbers of samples. This observation confirms our
Hypothesis A.
Similar to Figure 10d, Figure 12d visualizes the Δ𝑀𝑅𝑅
between each model to 𝑔𝑟𝑖𝑑 in different latitude bands on
the fMoW dataset. We can see that all Sphere2Vec models
can outperform 𝑔𝑟𝑖𝑑 on all latitude bands. 𝑑 𝑓 𝑠 has a clear
advantage over all the other models on all bands. Moreover,
all Sphere2Vec models have clear advantages over 𝑔𝑟𝑖𝑑 near
the North pole and South pole which further confirms our
Hypothesis A. In latitude band 𝜙∈ [0◦,10◦)where we have
fewer training samples (see Figure 12b), 𝑑 𝑓 𝑠 has clear ad-
vantages over other models which confirms our Hypothesis
B.
9.6. Visualize Estimated Spatial Distributions
To have a better understanding of how well different
location encoders model the geographic prior distributions
of different image labels, we use iNat2018 and fMoW data
as examples and plot the predicted spatial distributions of
different example species/land use types from different lo-
cation encoders, and compare them with the training sample
Mai et al.: Preprint submitted to Elsevier Page 21 of 29
Sphere2Vec
(a) Δ𝑀𝑅𝑅 per cell (b) Δ𝑀𝑅𝑅 per 𝜙band
Figure 11: The number of sample v.s. the model performance improvements on the iNat2017 dataset: (a) The number of
validation samples v.s. Δ𝑀𝑅𝑅 =𝑀𝑅𝑅(sphereC+) - MRR(𝑔𝑟𝑖𝑑)per latitude-longitude cell defined in Figure 10c. The orange
dots represent moving averages. (b) The number of validation samples v.s. Δ𝑀𝑅𝑅 per latitude band defined in Figure 10d.
locations of the corresponding species or land use types (see
Figure 13 and 14).
9.6.1. Predicted Species Distribution for iNat2018
From Figure 13, we can see that 𝑤𝑟𝑎𝑝 (Mac Aodha et al.,
2019) produces rather over-generalized species distributions
due to the fact that it is a single-scale location encoder.
𝑠𝑝ℎ𝑒𝑟𝑒𝐶+(our model) produces a more compact and fine-
grained distribution in each geographic region, especially in
the polar region and in data-sparse areas such as Africa and
Asia. The distributions produced by 𝑔𝑟𝑖𝑑 (Mai et al.,2020b)
are between these two. However, 𝑔𝑟𝑖𝑑 has limited spatial
distribution modeling ability in the polar area (e.g., Figure
13d and 13s) as well as data-sparse regions.
For example, in the white-browed wagtail example,
𝑤𝑟𝑎𝑝 produces an over-generalized spatial distribution which
covers India, East Saudi Arabia, and the Southwest of China
(See Figure 13m). However, according to the training sample
locations (Figure 13l), white-browed wagtails only occur
in India. 𝑔𝑟𝑖𝑑 is better than 𝑤𝑟𝑎𝑝 but still produces a
distribution covering the Southwest of China. 𝑠𝑝ℎ𝑒𝑟𝑒𝐶+
produces the best compact distribution estimation. Sim-
ilarly, for the red-striped leafwing, the sample locations
are clustered in a small region in West Africa while 𝑤𝑟𝑎𝑝
produces an over-generalized distribution (see Figure 13ab).
𝑔𝑟𝑖𝑑 produces a better distribution estimation (see Figure
13ac) but it still has a over-generalized issue. Our 𝑠𝑝ℎ𝑒𝑟𝑒𝐶+
produces the best estimation among these three models –
a compact distribution estimation covering the exact West
Africa region (See Figure 13ad).
9.6.2. Predicted Land Use Distribution for fMoW
Similar visualizations are made for some example land
use types in the fMoW dataset, i.e., Figure 14. Facto-
ries/powerplants (Figure 14b) might look similar to multi-
unit residential buildings (Figure 14f) from overhead satel-
lite imageries. But they have very different geographic
distributions (Figure 14b and 14g). A similarly situation can
be seen for parks (Figure 14k and 14l) and archaeological
sites (Figure 14p and 14q).
The estimated spatial distributions of these four land use
types from three location encoders, i.e., 𝑤𝑟𝑎𝑝,𝑔𝑟𝑖𝑑, and
𝑑𝑓 𝑠 are visualized. Just like what we see from Figure 13,
similar observations can be made. 𝑤𝑟𝑎𝑝 usually produces
over-generalized distributions. 𝑑 𝑓 𝑠 generates more compact
and accurate distributions while 𝑔𝑟𝑖𝑑 is between these two.
We also find out that 𝑔𝑟𝑖𝑑 will generate some grid-like
patterns due to the use of sinusoidal functions. 𝑑 𝑓 𝑠 suffers
less from it and produces more accurate distributions.
9.7. Location Embedding Clustering
To show how the trained location encoders learn the
image label distributions, we divide the globe into small
latitude-longitude cells and use a location encoder (e.g.,
Sphere2Vec or other baseline location encoders) trained on
the iNat2017 or iNat2018 dataset to produce a location
embedding for the center of each cell. Then we do agglom-
erative clustering13 on all these embeddings to produce a
clustering map. Figure 15 and 16 show the clustering results
for different models with different hyperparameters on the
iNat2017 and iNat2018 datasets.
From Figure 15, we can see that:
1. In all these clustering maps, nearby locations are
clustered together which indicates their location em-
beddings are similar to each other. This confirms that
the learned location encoder can preserve distance
information.
2. In the 𝑟𝑏𝑓 clustering map shown in Figure 15d, ex-
cept North America, almost all the other regions are
in the same cluster. This is because compared with
North America, all other regions have fewer training
samples. This indicates that 𝑟𝑏𝑓 can not generate a
reliable spatial distribution estimation in data-sparse
regions.
13https://scikit- learn.org/stable/modules/generated/sklearn.
cluster.AgglomerativeClustering.html
Mai et al.: Preprint submitted to Elsevier Page 22 of 29
Sphere2Vec
(a) Validation Locations (b) Samples per 𝜙band
(c) Δ𝑀𝑅𝑅 per cell (d) Δ𝑀𝑅𝑅 per 𝜙band
Figure 12: The data distribution of the fMoW dataset and model performance comparison on it: (a) Sample locations for
validation set of the fMoW dataset; (b) The number of training and validation samples in different latitude intervals. (c)
Δ𝑀𝑅𝑅 =𝑀𝑅𝑅(𝑑𝑓 𝑠) − 𝑀 𝑅𝑅(𝑔𝑟𝑖𝑑)for each latitude-longitude cell. Red and blue color indicates positive and negative Δ𝑀 𝑅𝑅
while darker color means high absolute value. The number on each cell indicates the number of validation data points while
"1K+" means there are more than 1K points in a cell. (d) Δ𝑀𝑅𝑅 between a model and baseline 𝑔𝑟𝑖𝑑 on the validation dataset
in different latitude bands.
3. The clustering maps of 𝑔𝑟𝑖𝑑 (Figure 15b and 15c)
show horizontal strip-like clusters. More specifically,
in Figure 15c, the boundaries of many clusters are
parallel to the longitude and latitude lines. We hypoth-
esize that these kinds of artifacts are created because
𝑔𝑟𝑖𝑑 measures the latitude and longitude differences
separately (see Theorem 3) which cannot measure the
spherical distance correctly.
4. 𝑤𝑟𝑎𝑝 (Figure 15a), 𝑠𝑝ℎ𝑒𝑟𝑒𝑀 (Figure 15h,𝑠𝑝ℎ𝑒𝑟𝑒𝐶
(Figure 15j), 𝑠𝑝ℎ𝑒𝑟𝑒𝐶+(Figure 15k), 𝑠𝑝ℎ𝑒𝑟𝑒𝑀 +
(Figure 15l), and 𝑑𝑓 𝑠 (Figure 15m) show reasonable
geographic clustering maps. Each cluster has rather
naturally looked curvilinear boundaries rather than
linear boundaries. We think this reflects the true mix-
ture of different species distributions. However, as we
showed in Section 9.6, the single-scale 𝑤𝑟𝑎𝑝 produces
over-generalized distribution while Sphere2Vec can
produce more compact distribution estimation.
Similar conclusions can be drawn from Figure 16. We
believe those figures visually demonstrate the superiority of
Sphere2Vec.
10. Conclusion
In this work, we propose a general-purpose multi-scale
spherical location encoder - Sphere2Vec which can encode
any location on the spherical surface into a high dimen-
sional vector which is learning-friendly for downstream
neuron network models. We provide theoretical proof that
Sphere2Vec is able to preserve the spherical surface distance
between points. As a comparison, we also prove that the
2D location encoders such as 𝑔𝑟𝑖𝑑 (Gao et al.,2019;Mai
et al.,2020b) model the latitude and longitude difference of
two points separately. And NeRF-style 3D location encoders
(Mildenhall et al.,2020;Schwarz et al.,2020;Niemeyer and
Geiger,2021) model the axis-wise differences between two
points in 3D Euclidean space separately. Both of them cannot
model the true spherical distance. To verify the superiority
of Sphere2Vec in a controlled setting, we generate 20 syn-
thetic datasets and evaluate Sphere2Vec and all baselines on
them. Results show that Sphere2Vec can outperform all base-
lines on all 20 sythetic datasets and the error rate reduction
can go up to 30.8%. The results indicate that when the under-
lying dataset has a larger data bias towards the polar area, we
Mai et al.: Preprint submitted to Elsevier Page 23 of 29
Sphere2Vec
(a) Image (b) Feather duster worm (c) 𝑤𝑟𝑎𝑝 (d) 𝑔𝑟𝑖𝑑 (e) 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 +
(f) Image (g) African pied hornbill (h) 𝑤𝑟𝑎𝑝 (i) 𝑔𝑟𝑖𝑑 (j) 𝑠𝑝ℎ𝑒𝑟𝑒𝐶+
(k) Image (l) White-browed wagtail (m) 𝑤𝑟𝑎𝑝 (n) 𝑔𝑟𝑖𝑑 (o) 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 +
(p) Image (q) Arctic Fox (r) 𝑤𝑟𝑎𝑝 (s) 𝑔𝑟𝑖𝑑 (t) 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 +
(u) Image (v) Bat-Eared Fox (w) 𝑤𝑟𝑎𝑝 (x) 𝑔𝑟𝑖𝑑 (y) 𝑠𝑝ℎ𝑒𝑟𝑒𝐶+
(z) Image (aa) Red-striped leafwing (ab) 𝑤𝑟𝑎𝑝 (ac) 𝑔𝑟𝑖𝑑 (ad) 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 +
(ae) Image (af) False Tiger Moth (ag) 𝑤𝑟𝑎𝑝 (ah) 𝑔𝑟𝑖𝑑 (ai) 𝑠𝑝ℎ𝑒𝑟𝑒𝐶+
Figure 13: Comparison of the predicted spatial distributions of example species in the iNat2018 dataset from different location
encoders. Each row indicates one specific species. We show one marine polychaete worm species, two bird species, two fox species,
and two butterfly species. The first and second figure of each row show an example figure as well as the data points of this species
from iNat2018 training data.
expect a bigger performance improvement of Sphere2Vec.
We further conduct experiments on three geo-aware image
classification tasks with 7 large-scale real-world datasets.
Results shows that Sphere2Vec can outperform the state-
of-the-art 2D location encoders on all 7 datasets. Further
analysis shows that Sphere2Vec is especially excel at polar
regions as well data-sparse areas.
Encoding point-features on a spherical surface is a fun-
damental problem, especially in geoinformatics, geography,
meteorology, oceanography, geoscience, and environmen-
tal science. Our proposed Sphere2Vec is a general-purpose
spherical-distance-reserving encoding which realizes our
idea of directly calculating on the round planet. It can
be utilized in a wide range of geospatial prediction tasks.
In this work, we only conduct experiments on geo-aware
image classification and spatial distribution estimation. Ex-
cept for the tasks we discussed above, the potential ap-
plications include areas like public health, epidemiology,
agriculture, economy, ecology, and environmental engineer-
ing, and researches like large-scale human mobility and
trajectory prediction (Xu et al.,2018), geographic question
answering (Mai et al.,2020a), global biodiversity hotspot
prediction (Myers et al.,2000;Di Marco et al.,2019;Ce-
ballos et al.,2020), weather forecasting and climate change
(Dupont et al.,2021;Ham et al.,2019), global pandemic
study and its relation to air pollution (Wu et al.,2020), and so
on. In general, we expect our proposed Sphere2Vec will ben-
efit various AI for social goods14 applications which involve
predictive modeling at global scales. Moreover, Sphere2Vec
can also contribute to the idea of developing a foundation
14https://ai.google/social- good/
Mai et al.: Preprint submitted to Elsevier Page 24 of 29
Sphere2Vec
(a) Image (b) Factory or powerplant (c) 𝑤𝑟𝑎𝑝 (d) 𝑔𝑟𝑖𝑑 (e) 𝑑 𝑓 𝑠
(f) Image (g) Multi-unit residential (h) 𝑤𝑟𝑎𝑝 (i) 𝑔𝑟𝑖𝑑 (j) 𝑑 𝑓 𝑠
(k) Image (l) Park (m) 𝑤𝑟𝑎𝑝 (n) 𝑔𝑟𝑖𝑑 (o) 𝑑 𝑓 𝑠
(p) Image (q) Archaeological site (r) 𝑤𝑟𝑎𝑝 (s) 𝑔 𝑟𝑖𝑑 (t) 𝑑 𝑓 𝑠
Figure 14: Comparison of the predicted spatial distributions of example land use types in the fMoW dataset from different location
encoders. Each row indicates one specific land use type. The first and second figure of each row show an example figure as well
as the data points of this land use types from the fMoW training data. As shown in Figure (a) and (f), although factories or
powerplants and multi-unit residential type look very similar from overhead satellite imageries, they have very distinct spatial
distribution (Figure (b) and (g)). Similarly, parks and archaeological sites look similar from satellites imageries (Figure (k) and (p))
which are usually covered by vegetation. However, they have very distinct spatial distribution (Figure (l) and (q)). We compare
the predicted spatial distribution of each land use type from three different location encoders: 𝑤𝑟𝑎𝑝,𝑔𝑟𝑖𝑑 , and 𝑑 𝑓 𝑠.
model for geospatial artificial intelligence (Mai et al.,2022c,
2023a) in general.
Declaration of Competing Interest
The authors declare that they have no known competing
financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
Acknowledgement
We would like to thank Prof. Keith Clarke for his sug-
gestions on different map projection distortion errors and his
help on generating Figure 3.
This work is mainly funded by the National Science
Foundation under Grant No. 2033521 A1 – KnowWhere-
Graph: Enriching and Linking Cross-Domain Knowledge
Graphs using Spatially-Explicit AI Technologies. Gengchen
Mai acknowledges the support of UCSB Schmidt Summer
Research Accelerator Award, Microsoft AI for Earth Grant,
the Office of the Director of National Intelligence (ODNI),
Intelligence Advanced Research Projects Activity (IARPA)
via 2021-2011000004, and the Office of Research Inter-
nal Research Support Co-funding Grant at University of
Georgia. Stefano Ermon acknowledges support from NSF
(#1651565), AFOSR (FA95501910024), ARO (W911NF-
21-1-0125), Sloan Fellowship, and CZ Biohub. Any opin-
ions, findings, conclusions, or recommendations expressed
in this material are those of the authors and do not necessar-
ily reflect the views of the National Science Foundation.
References
Adams, B., McKenzie, G., Gahegan, M., 2015. Frankenplace: interactive
thematic mapping for ad hoc exploratory search, in: Proceedings of the
24th international conference on world wide web, International World
Wide Web Conferences Steering Committee. pp. 12–22.
Anokhin, I., Demochkin, K., Khakhulin, T., Sterkin, G., Lempitsky, V., Ko-
rzhenkov, D., 2021a. Image generators with conditionally-independent
pixel synthesis, in: Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pp. 14278–14287.
Anokhin, I., Demochkin, K., Khakhulin, T., Sterkin, G., Lempitsky, V., Ko-
rzhenkov, D., 2021b. Image generators with conditionally-independent
pixel synthesis, in: Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, pp. 14278–14287.
Ayush, K., Uzkent, B., Meng, C., Tanmay, K., Burke, M., Lobell, D.,
Ermon, S., 2020. Geography-aware self-supervised learning. arXiv
preprint arXiv:2011.09980 .
Banino, A., Barry, C., Uria, B., Blundell, C., Lillicrap, T., Mirowski, P.,
Pritzel, A., Chadwick, M.J., Degris, T., Modayil, J., et al., 2018. Vector-
based navigation using grid-like representations in artificial agents.
Nature 557, 429.
Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla,
R., Srinivasan, P.P., 2021. Mip-nerf: A multiscale representation for
anti-aliasing neural radiance fields, in: Proceedings of the IEEE/CVF
International Conference on Computer Vision, pp. 5855–5864.
Bartnik, R., Norton, A., 2000. Numerical methods for the einstein equa-
tions in null quasi-spherical coordinates. SIAM Journal on Scientific
Computing 22, 917–950.
Basri, R., Galun, M., Geifman, A., Jacobs, D., Kasten, Y., Kritchman, S.,
2020. Frequency bias in neural networks for input of non-uniform
Mai et al.: Preprint submitted to Elsevier Page 25 of 29
Sphere2Vec
(a) 𝑤𝑟𝑎𝑝 (b) 𝑔𝑟𝑖𝑑 (𝑟𝑚𝑖𝑛 = 10−2 )(c) 𝑔𝑟𝑖𝑑 (𝑟𝑚𝑖𝑛 = 10−6 )(d) 𝑟𝑏𝑓 (𝜎= 1, 𝑚 = 200)
(e) 𝑡ℎ𝑒𝑜𝑟𝑦 (𝑟𝑚𝑖𝑛 = 10−2)(f) 𝑡ℎ𝑒𝑜𝑟𝑦 (𝑟𝑚𝑖𝑛 = 10−6 )(g) 𝑁𝑒𝑅𝐹 (𝑆= 32) (h) 𝑠𝑝ℎ𝑒𝑟𝑒𝑀 (𝑟𝑚𝑖𝑛 = 10−1 )
(i) 𝑠𝑝ℎ𝑒𝑟𝑒𝑀 (𝑟𝑚𝑖𝑛 = 10−2)(j) 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 (𝑟𝑚𝑖𝑛 = 10−2 )(k) 𝑠𝑝ℎ𝑒𝑟𝑒𝐶+ (𝑟𝑚𝑖𝑛 = 10−2)(l) 𝑠𝑝ℎ𝑒𝑟𝑒𝑀 + (𝑟𝑚𝑖𝑛 = 10−2 )
(m) 𝑑𝑓 𝑠 (𝑟𝑚𝑖𝑛 = 10−2)
Figure 15: Embedding clusterings of different location encoders trained on the iNat2017 dataset. (a) 𝑤𝑟𝑎𝑝 ∗with 4 hidden ReLU
layers of 256 neurons; (d) 𝑟𝑏𝑓 with the best kernel size 𝜎= 1 and number of anchor points 𝑚= 200; (b)(c)(e)(f) are Space2Vec
models (Mai et al.,2020b) with different min scale 𝑟𝑚𝑖𝑛 = {10−6,10−2 }.a(g) is 𝑁 𝑒𝑅𝐹 with 𝑟𝑚𝑖𝑛 = 32, and 1 hidden ReLU layer of
512 neurons. (h)-(m) are different Sphere2Vec models.b
aThey share the same best hyperparameters: 𝑆= 64,𝑟𝑚𝑎𝑥 = 1, and 1 hidden ReLU layers of 512 neurons.
bThey share the same best hyperparameters: 𝑆= 32,𝑟𝑚𝑎𝑥 = 1, and 1 hidden ReLU layers of 1024 neurons.
density, in: International Conference on Machine Learning, PMLR. pp.
685–694.
Berg, T., Liu, J., Woo Lee, S., Alexander, M.L., Jacobs, D.W., Belhumeur,
P.N., 2014. BirdSnap: Large-scale fine-grained visual categorization of
birds, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 2011–2018.
Boyer, C.B., 2012. History of analytic geometry. Courier Corporation.
Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P., 2017.
Geometric deep learning: going beyond euclidean data. IEEE Signal
Processing Magazine 34, 18–42.
Caminade, C., Kovats, S., Rocklov, J., Tompkins, A.M., Morse, A.P.,
Colón-González, F.J., Stenlund, H., Martens, P., Lloyd, S.J., 2014.
Impact of climate change on global malaria distribution. Proceedings
of the National Academy of Sciences 111, 3286–3291. URL: https:
//www.pnas.org/content/111/9/3286, doi:10.1073/pnas.1302089111,
arXiv:https://www.pnas.org/content/111/9/3286.full.pdf.
Ceballos, G., Ehrlich, P.R., Raven, P.H., 2020. Vertebrates
on the brink as indicators of biological annihilation and
the sixth mass extinction. Proceedings of the National
Academy of Sciences URL: https://www.pnas.org/content/
early/2020/05/27/1922686117, doi:10.1073/pnas.1922686117,
arXiv:https://www.pnas.org/content/early/2020/05/27/1922686117.full.pdf.
Chen, Y., Liu, S., Wang, X., 2021. Learning continuous image representa-
tion with local implicit image function, in: Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, pp. 8628–8638.
Chinazzi, M., Davis, J.T., Ajelli, M., Gioannini, C., Litvinova, M., Merler,
S., Pastore y Piontti, A., Mu, K., Rossi, L., Sun, K., Viboud, C., Xiong,
X., Yu, H., Halloran, M.E., Longini, I.M., Vespignani, A., 2020. The
effect of travel restrictions on the spread of the 2019 novel coronavirus
(covid-19) outbreak. Science 368, 395–400. URL: https://science.
sciencemag.org/content/368/6489/395, doi:10.1126/science.aba9757,
arXiv:https://science.sciencemag.org/content/368/6489/395.full.pdf.
Chrisman, N.R., 2017. Calculating on a round planet. International Journal
of Geographical Information Science 31, 637–657.
Christie, G., Fendley, N., Wilson, J., Mukherjee, R., 2018. Functional
map of the world, in: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 6172–6180.
Chu, G., Potetz, B., Wang, W., Howard, A., Song, Y., Brucher, F., Leung, T.,
Adam, H., 2019. Geo-aware networks for fine grained recognition, in:
Proceedings of the IEEE International Conference on Computer Vision
Workshops, pp. 0–0.
Cohen, T.S., Geiger, M., Köhler, J., Welling, M., 2018. Spherical CNNs,
in: Proceedings of ICLR 2018.
Coors, B., Paul Condurache, A., Geiger, A., 2018. SphereNet: Learning
spherical representations for detection and classification in omnidirec-
tional images, in: Proceedings of the European Conference on Computer
Vision (ECCV), pp. 518–533.
Cueva, C.J., Wei, X.X., 2018. Emergence of grid-like representations by
training recurrent neural networks to perform spatial localization, in:
International Conference on Learning Representations.
Derksen, D., Izzo, D., 2021. Shadow neural radiance fields for multi-view
satellite photogrammetry, in: Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 1152–1161.
Di Marco, M., Ferrier, S., Harwood, T.D., Hoskins, A.J., Watson, J.E.M.,
2019. Wilderness areas halve the extinction risk of terrestrial bio-
diversity. Nature 573, 582–585. URL: https://doi.org/10.1038/
s41586-019- 1567-7, doi:10.1038/s41586- 019- 1567-7.
Dupont, E., Golinski, A., Alizadeh, M., Teh, Y.W., Doucet, A., . Coin: Com-
pression with implicit neural representations, in: Neural Compression:
From Information Theory to Applications–Workshop@ ICLR 2021.
Dupont, E., Teh, Y.W., Doucet, A., 2021. Generative models as distributions
of functions. arXiv preprint arXiv:2102.04776 .
Gao, R., Xie, J., Zhu, S.C., Wu, Y.N., 2019. Learning grid cells as vector
representation of self-position coupled with matrix representation of
self-motion, in: International Conference on Learning Representations.
Gupta, J., Molnar, C., Xie, Y., Knight, J., Shekhar, S., 2021. Spatial
variability aware deep neural networks (svann): A general approach.
ACM Transactions on Intelligent Systems and Technology (TIST) 12,
1–21.
Mai et al.: Preprint submitted to Elsevier Page 26 of 29
Sphere2Vec
(a) 𝑤𝑟𝑎𝑝 (b) 𝑔𝑟𝑖𝑑 (𝑟𝑚𝑖𝑛 = 10−3 )(c) 𝑔𝑟𝑖𝑑 (𝑟𝑚𝑖𝑛 = 10−6 )(d) 𝑟𝑏𝑓 (𝜎= 1, 𝑚 = 200)
(e) 𝑡ℎ𝑒𝑜𝑟𝑦 (𝑟𝑚𝑖𝑛 = 10−3)(f) 𝑡ℎ𝑒𝑜𝑟𝑦 (𝑟𝑚𝑖𝑛 = 10−6 )(g) NeRF𝑁𝑒𝑅𝐹 (𝑆= 32) (h) 𝑠𝑝ℎ𝑒𝑟𝑒𝑀 (𝑟𝑚𝑖𝑛 = 10−1)
(i) 𝑠𝑝ℎ𝑒𝑟𝑒𝑀 (𝑟𝑚𝑖𝑛 = 10−3)(j) 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 (𝑟𝑚𝑖𝑛 = 10−3 )(k) 𝑠𝑝ℎ𝑒𝑟𝑒𝐶+ (𝑟𝑚𝑖𝑛 = 10−3)(l) 𝑠𝑝ℎ𝑒𝑟𝑒𝑀 + (𝑟𝑚𝑖𝑛 = 10−3 )
(m) 𝑑𝑓 𝑠 (𝑟𝑚𝑖𝑛 = 10−3)
Figure 16: Embedding clusterings of different location encoders trained on the iNat2018 dataset. (a) 𝑤𝑟𝑎𝑝 with 4 hidden ReLU
layers of 256 neurons; (d) 𝑟𝑏𝑓 with the best kernel size 𝜎= 1 and number of anchor points 𝑚= 200; (b)(c)(e)(f) are Space2Vec
models (Mai et al.,2020b) with different min scale 𝑟𝑚𝑖𝑛 = {10−6,10−3 }.a(g) is 𝑁 𝑒𝑅𝐹 with 𝑟𝑚𝑖𝑛 = 32, and 1 hidden ReLU layer of
512 neurons. (h)-(m) are Sphere2Vec models with different min scale 𝑟𝑚𝑖𝑛.b
aThey share the same best hyperparameters: 𝑆= 64,𝑟𝑚𝑎𝑥 = 1, and 1 hidden ReLU layer of 512 neurons.
bThey share the same best hyperparameters: 𝑆= 32,𝑟𝑚𝑎𝑥 = 1, and 1 hidden ReLU layers of 1024 neurons.
Ham, Y.G., Kim, J.H., Luo, J.J., 2019. Deep learning for multi-year enso
forecasts. Nature 573, 568–572.
Hansen, G., Cramer, W., 2015. Global distribution of observed climate
change impacts. Nature Climate Change 5, 182–185. URL: https:
//doi.org/10.1038/nclimate2529, doi:10.1038/nclimate2529.
Harmel, A., 2009. Le nouveau système réglementaire lambert 93. Géoma-
tique Expert 68, 26–30.
He, Y., Wang, D., Lai, N., Zhang, W., Meng, C., Burke, M., Lobell,
D., Ermon, S., 2021. Spatial-temporal super-resolution of satellite
imagery via conditional pixel synthesis. Advances in Neural Information
Processing Systems 34, 27903–27915.
Helber, P., Bischke, B., Dengel, A., Borth, D., 2019. Eurosat: A novel
dataset and deep learning benchmark for land use and land cover classi-
fication. IEEE Journal of Selected Topics in Applied Earth Obser vations
and Remote Sensing 12, 2217–2226.
Hu, Y., Gao, S., Lunga, D., Li, W., Newsam, S., Bhaduri, B., 2019. Geoai
at acm sigspatial: progress, challenges, and future directions. Sigspatial
Special 11, 5–15.
Huang, W., Zhang, D., Mai, G., Guo, X., Cui, L., 2023. Learning urban
region representations with pois and hierarchical graph infomax. ISPRS
Journal of Photogrammetry and Remote Sensing 196, 134–145.
Izbicki, M., Papalexakis, E.E., Tsotras, V.J., 2019a. Exploiting the earth’s
spherical geometry to geolocate images, in: Joint European Conference
on Machine Learning and Knowledge Discovery in Databases, Springer.
pp. 3–19.
Izbicki, M., Papalexakis, V., Tsotras, V., 2019b. Geolocating tweets in any
language at any location, in: Proceedings of the 28th ACM International
Conference on Information and Knowledge Management, pp. 89–98.
Janowicz, K., Gao, S., McKenzie, G., Hu, Y., Bhaduri, B., 2020. GeoAI:
Spatially explicit artificial intelligence techniques for geographic knowl-
edge discovery and beyond.
Janowicz, K., Hitzler, P., Li, W., Rehberger, D., Schildhauer, M., Zhu,
R., Shimizu, C., Fisher, C.K., Cai, L., Mai, G., et al., 2022. Know,
know where, knowwheregraph: A densely connected, cross-domain
knowledge graph and geo-enrichment service stack for applications in
environmental intelligence. AI Magazine 43, 30–39.
Kejriwal, M., Szekely, P., 2017. Neural embeddings for populated geon-
ames locations, in: International Semantic Web Conference, Springer.
pp. 139–146.
Klocek, S., Maziarka, L., Wolczyk, M., Tabor, J., Nowak, J., Smieja, M.,
2019. Hypernetwork functional image representation, in: Tetko, I.V.,
Kurková, V., Karpov, P., Theis, F.J. (Eds.), Artificial Neural Networks
and Machine Learning - ICANN 2019 - 28th International Conference
on Artificial Neural Networks, Munich, Germany, September 17-19,
2019, Proceedings - Workshop and Special Sessions, Springer. pp. 496–
510.
Li, W., Hsu, C.Y., Hu, M., 2021. Tobler’s first law in geoai: A spatially
explicit deep learning model for terrain feature detection under weak
supervision. Annals of the American Association of Geographers 111,
1887–1905.
Liu, P., Biljecki, F., 2022. A review of spatially-explicit geoai applications
in urban geography. International Journal of Applied Earth Observation
and Geoinformation 112, 102936.
Mac Aodha, O., Cole, E., Perona, P., 2019. Presence-only geographical
priors for fine-grained image classification, in: Proceedings of the IEEE
International Conference on Computer Vision, pp. 9596–9606.
Mai, G., Hu, Y., Gao, S., Cai, L., Martins, B., Scholz, J., Gao, J., Janowicz,
K., 2022a. Symbolic and subsymbolic geoai: Geospatial knowledge
graphs and spatially explicit machine learning. Trans GIS 26, 3118–
3124.
Mai, G., Huang, W., Sun, J., Song, S., Mishra, D., Liu, N., Gao, S., Liu, T.,
Cong, G., Hu, Y., et al., 2023a. On the opportunities and challenges of
foundation models for geospatial artificial intelligence. arXiv preprint
arXiv:2304.06798 .
Mai, G., Janowicz, K., Cai, L., Zhu, R., Regalia, B., Yan, B., Shi, M., Lao,
N., 2020a. SE-KGE: A location-aware knowledge graph embedding
model for geographic question answering and spatial semantic lifting.
Transactions in GIS doi:10.1111/tgis.12629.
Mai, G., Janowicz, K., Hu, Y., Gao, S., Yan, B., Zhu, R., Cai, L., Lao,
N., 2022b. A review of location encoding for geoai: methods and
applications. International Jour nal of Geographical Information Science
36, 639–673.
Mai et al.: Preprint submitted to Elsevier Page 27 of 29
Sphere2Vec
Mai, G., Janowicz, K., Yan, B., Zhu, R., Cai, L., Lao, N., 2020b. Multi-scale
representation learning for spatial feature distributions using grid cells,
in: The Eighth International Conference on Learning Representations,
openreview.
Mai, G., Jiang, C., Sun, W.,Zhu, R., Xuan, Y., Cai, L., Janowicz, K., Ermon,
S., Lao, N., 2023b. Towards general-purpose representation learning of
polygonal geometries. GeoInformatica 27, 289–340.
Mai, G., Lao, N., He, Y., Song, J., Ermon, S., 2023c. Csp: Self-supervised
contrastive spatial pre-training for geospatial-visual representations, in:
International Conference on Machine Learning, PMLR.
Mai, G., Yan, B., Janowicz, K., Zhu, R., 2019. Relaxing unanswerable
geographic questions using a spatially explicit knowledge graph embed-
ding model, in: AGILE: The 22nd Annual International Conference on
Geographic Information Science, Springer. pp. 21–39.
Mai, G.M., Cundy, C., Choi, K., Hu, Y., Lao, N., Ermon, S., 2022c. Towards
a foundation model for geospatial artificial intelligence, in: Proceedings
of the 30th SIGSPATIAL international conference on advances in geo-
graphic information systems. doi:10.1145/3557915.3561043.
Marí, R., Facciolo, G., Ehret, T., 2022. Sat-nerf: Learning multi-view
satellite photogrammetry with transient objects and shadow modeling
using rpc cameras, in: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pp. 1311–1321.
Martin-Brualla, R., Radwan, N., Sajjadi, M.S., Barron, J.T., Dosovitskiy,
A., Duckworth, D., 2021. Nerf in the wild: Neural radiance fields
for unconstrained photo collections, in: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 7210–
7219.
Merilees, P.E., 1973. The pseudospectral approximation applied to the
shallow water equations on a sphere. Atmosphere 11, 13–20. doi:10.
1080/00046973.1973.9648342.
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi,
R., Ng, R., 2020. Nerf: Representing scenes as neural radiance fields for
view synthesis, in: ECCV.
Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi,
R., Ng, R., 2021. Nerf: Representing scenes as neural radiance fields for
view synthesis. Communications of the ACM 65, 99–106.
Morlin-Yron, S., 2017. What’s the real size of africa? how western states
used maps to downplay size of continent. CNN URL: https://www.cnn.
com/2016/08/18/africa/real-size- of-africa.
Mulcahy, K.A., Clarke, K.C., 2001. Symbolization of map projection
distortion: a review. Cartography and geographic information science
28, 167–182.
Myers, N., Mittermeier, R.A., Mittermeier, C.G., Da Fonseca, G.A., Kent,
J., 2000. Biodiversity hotspots for conservation priorities. Nature 403,
853.
Nguyen, T.D., Le, T., Bui, H., Phung, D., 2017. Large-scale online kernel
learning with random feature reparameterization, in: Proceedings of the
Twenty-Sixth International Joint Conference on Artificial Intelligence,
IJCAI-17, pp. 2543–2549. doi:10.24963/ijcai.2017/354 .
Niemeyer, M., Geiger, A., 2021. Giraffe: Representing scenes as composi-
tional generative neural feature fields, in: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pp. 11453–
11464.
Orszag, S.A., 1972. Comparison of pseudospectral and spectral approxi-
mation. Appl. Math. 51, 253–259.
Orszag, S.A., 1974. Fourier series on spheres. Mon. Wea. Rev. 102, 56–75.
Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F.,
Bengio, Y., Courville, A., 2019. On the spectral bias of neural networks,
in: International Conference on Machine Learning, PMLR. pp. 5301–
5310.
Rahimi, A., Recht, B., 2008. Random features for large-scale kernel
machines, in: Platt, J.C., Koller, D., Singer, Y., Roweis, S.T. (Eds.), Ad-
vances in Neural Information Processing Systems, Curran Associates,
Inc.. pp. 1177–1184.
Rahimi, A., Recht, B., et al., 2007. Random features for large-scale kernel
machines., in: NIPS, Citeseer. p. 5.
Rao, J., Gao, S., Kang, Y., Huang, Q., 2020. Lstm-trajgan: A deep
learning approach to trajectory privacy protection, in: 11th International
Conference on Geographic Information Science (GIScience 2021)-Part
I, Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
Schölkopf, B., 2001. The kernel trick for distances, in: Advances in Neural
Information Processing Systems, pp. 301–307.
Schwarz, K., Liao, Y., Niemeyer, M., Geiger, A., 2020. Graf: Generative
radiance fields for 3d-aware image synthesis. Advances in Neural
Information Processing Systems 33, 20154–20166.
Sokol, J., 2021. Can this new map fix our distorted views of the world?
New York Times URL: https://www.nytimes.com/2021/02/24/science/
new-world- map.html.
Strümpler, Y., Postels, J., Yang, R., Gool, L.V., Tombari, F., 2022. Implicit
neural representations for image compression, in: Computer Vision–
ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–
27, 2022, Proceedings, Part XXVI, Springer. pp. 74–91.
Sullivan, B.L., Wood, C.L., Iliff, M.J., Bonney, R.E., Fink, D., Kelling, S.,
2009. ebird: A citizen-based bird observation network in the biological
sciences. Biological conservation 142, 2282–2292.
Sun, C., Li, J., Jin, F.F., Xie, F., 2014. Contrasting meridional structures of
stratospheric and tropospheric planetary wave variability in the northern
hemisphere. Tellus A: Dynamic Meteorology and Oceanography 66,
25303.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z., 2016. Rethink-
ing the inception architecture for computer vision, in: Proceedings of
the IEEE conference on computer vision and pattern recognition, pp.
2818–2826.
Tancik, M., Casser, V., Yan, X., Pradhan, S., Mildenhall, B., Srinivasan,
P.P., Barron, J.T., Kretzschmar, H., 2022. Block-nerf: Scalable large
scene neural view synthesis, in: Proceedings of the IEEE/CVF Confer-
ence on Computer Vision and Pattern Recognition, pp. 8248–8258.
Tancik, M., Srinivasan, P., Mildenhall, B., Fridovich-Keil, S., Raghavan,
N., Singhal, U., Ramamoorthi, R., Barron, J., Ng, R., 2020. Fourier
features let networks learn high frequency functions in low dimensional
domains, in: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin,
H. (Eds.), Advances in Neural Information Processing Systems, Curran
Associates, Inc.. pp. 7537–7547.
Tang, K., Paluri, M., Fei-Fei, L., Fergus, R., Bourdev, L., 2015. Improving
image classification with location context, in: Proceedings of the IEEE
international conference on computer vision, pp. 1008–1016.
Van Horn, G., Branson, S., Farrell, R., Haber, S., Barry, J., Ipeirotis, P.,
Perona, P., Belongie, S., 2015. Building a bird recognition app and large
scale dataset with citizen scientists: The fine print in fine-grained dataset
collection, in: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 595–604.
Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun, C., Shepard, A.,
Adam, H., Perona, P., Belongie, S., 2018. The inaturalist species clas-
sification and detection dataset, in: Proceedings of the IEEE conference
on computer vision and pattern recognition, pp. 8769–8778.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need, in: Advances
in Neural Information Processing Systems, pp. 5998–6008.
Weyand, T., Kostrikov, I., Philbin, J., 2016. Planet-photo geolocation with
convolutional neural networks, in: European Conference on Computer
Vision, Springer. pp. 37–55.
Williamson, D., Browning, G., 1973. Comparison of grids and difference
approximations for numerical weather prediction over a sphere. Journal
of Applied Meteorology 12, 264–274.
Wu, X., Nethery, R.C., Sabath, B.M., Braun, D., Dominici, F., 2020.
Exposure to air pollution and covid-19 mortality in the united states.
medRxiv .
Xiangli, Y., Xu, L., Pan, X., Zhao, N., Rao, A., Theobalt, C., Dai, B., Lin, D.,
2022. Bungeenerf: Progressive neural radiance field for extreme multi-
scale scene rendering, in: Computer Vision–ECCV 2022: 17th European
Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part
XXXII, Springer. pp. 106–122.
Xie, Y., He, E., Jia, X., Bao, H., Zhou, X., Ghosh, R., Ravirathinam, P.,
2021. A statistically-guided deep network transformation and mod-
eration framework for data with spatial heterogeneity, in: 2021 IEEE
International Conference on Data Mining (ICDM), IEEE. pp. 767–776.
Mai et al.: Preprint submitted to Elsevier Page 28 of 29
Sphere2Vec
Xu, Y., Piao, Z., Gao, S., 2018. Encoding crowd interaction with deep
neural network for pedestrian trajectory prediction, in: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pp.
5275–5284.
Yan, B., Janowicz, K., Mai, G., Gao, S., 2017. From ITDL to Place2Vec:
Reasoning about place type similarity and relatedness by learning
embeddings from augmented spatial contexts, in: Proceedings of the
25th ACM SIGSPATIAL International Conference on Advances in
Geographic Information Systems, ACM. p. 35.
Yan, B., Janowicz, K., Mai, G., Zhu, R., 2019a. A spatially-explicit
reinforcement learning model for geographic knowledge graph summa-
rization. Transactions in GIS .
Yan, X., Ai, T., Yang, M., Yin, H., 2019b. A graph convolutional neural
network for classification of building patterns using spatial vector data.
ISPRS journal of photogrammetry and remote sensing 150, 259–273.
Yang, Y., Newsam, S., 2010. Bag-of-visual-words and spatial extensions
for land-use classification, in: Proceedings of the 18th SIGSPATIAL in-
ternational conference on advances in geographic information systems,
pp. 270–279.
Zhong, E.D., Bepler, T., Davis, J.H., Berger, B., 2020. Reconstructing
continuous distributions of 3d protein structure from cryo-em images,
in: International Conference on Learning Representations.
Zhu, D., Liu, Y., Yao, X., Fischer, M.M., 2021. Spatial regression graph
convolutional neural networks: A deep learning paradigm for spatial
multivariate distributions. GeoInformatica , 1–32.
Zhu, R., Janowicz, K., Cai, L., Mai, G., 2022. Reasoning over higher-
order qualitative spatial relations via spatially explicit neural networks.
International Journal of Geographical Information Science 36, 2194–
2225.
Mai et al.: Preprint submitted to Elsevier Page 29 of 29
Sphere2Vec
A. Theoretical Proofs of Each Theorem
A.1. Proof of Theorem 1
Proof. Given two points 𝐱1= (𝜆1, 𝜙1),𝐱2= (𝜆2, 𝜙2)on the
same sphere 𝕊2with radius 𝑅, we have 𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱𝑖) =
[sin(𝜙𝑖),cos(𝜙𝑖) cos(𝜆𝑖),cos(𝜙𝑖) sin(𝜆𝑖)] for 𝑖= 1,2, the
inner product
𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱1), 𝑃 𝐸 𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱2)
= sin(𝜙1) sin(𝜙2) + cos(𝜙1) cos(𝜆1) cos(𝜙2) cos(𝜆2)
+ cos(𝜙1) sin(𝜆1) cos(𝜙2) sin(𝜆2)
= sin(𝜙1) sin(𝜙2) + cos(𝜙1) cos(𝜙2) cos(𝜆1−𝜆2)
= cos(Δ𝛿) = cos(Δ𝐷∕𝑅),
(28)
where Δ𝛿is the central angle between 𝐱1and 𝐱2, and the
spherical law of cosines is applied to derive the second last
equality. So,
𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱1) − 𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱2)2
=𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱1) − 𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱2),
𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱1) − 𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱2)
=2 − 2 cos(Δ𝐷∕𝑅)
=4 sin2(Δ𝐷∕2𝑅).
(29)
So 𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱1) − 𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱2)= 2 sin(Δ𝐷∕2𝑅)
since Δ𝐷∕2𝑅∈ [0,𝜋
2]. By Taylor expansion, 𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱1)−
𝑃 𝐸𝑠𝑝ℎ𝑒𝑟𝑒𝐶
1(𝐱2)≈ Δ𝐷∕𝑅when Δ𝐷is small w.r.t. 𝑅.
A.2. Proof of Theorem 2
Proof. ∀ ∗∈ {𝑠𝑝ℎ𝑒𝑟𝑒𝐶, 𝑠𝑝ℎ𝑒𝑟𝑒𝐶 +, 𝑠𝑝ℎ𝑒𝑟𝑒𝑀, 𝑠𝑝ℎ𝑒𝑟𝑒𝑀 +},
𝑃 𝐸∗
𝑆(𝐱1) = 𝑃 𝐸∗
𝑆(𝐱2)implies
sin(𝜙1) = sin(𝜙2),(30)
cos(𝜙1) sin(𝜆1) = cos(𝜙2) sin(𝜆2),(31)
cos(𝜙1) cos(𝜆1) = cos(𝜙2) cos(𝜆2),(32)
from 𝑠= 0 terms. Since sin(𝜙)monotonically increases
when 𝜙∈ [−𝜋∕2, 𝜋∕2], given Equation 30 we have 𝜙1=
𝜙2. If 𝜙1=𝜙2=𝜋∕2, then both points are at North Pole,
𝜆1=𝜆2equal to whatever longitude defined at North Pole.
If 𝜙1=𝜙2= −𝜋∕2, it is similar case at South Pole. When
𝜙1=𝜙2∈ (− 𝜋
2,𝜋
2),cos(𝜙1) = cos(𝜙2)≠0. Then from
Equation 31 and 32, we have
sin 𝜆1= sin(𝜆2),cos(𝜆1) = cos(𝜆2),(33)
which shows that 𝜆1=𝜆2. In summary, 𝐱1=𝐱2, so 𝑃 𝐸 ∗
𝑆is
injective.
If ∗= 𝑑𝑓 𝑠,𝑃 𝐸 ∗
𝑆(𝐱1) = 𝑃 𝐸∗
𝑆(𝐱2)implies sin(𝜙1) = sin(𝜙2),
cos(𝜙1) = cos(𝜙2),sin(𝜆1) = sin(𝜆2), and cos(𝜆1) =
cos(𝜆2), which proves 𝐱1=𝐱2and 𝑃 𝐸∗
𝑆is injective di-
rectly.
A.3. Proof of Theorem 4
Proof. According to the definition of 𝑁𝑒𝑅𝐹 encoder (18),
𝑃 𝐸𝑁 𝑒𝑅𝐹
𝑆(𝐱1) − 𝑃 𝐸𝑁 𝑒𝑅𝐹
𝑆(𝐱2)2
=
𝑆−1
𝑠=0
𝑝∈{𝑧,𝑥,𝑦}(sin(2𝑠𝜋𝑝1) − sin(2𝑠𝜋𝑝2))2
+ (cos(2𝑠𝜋𝑝1) − cos(2𝑠𝜋𝑝2))2
=
𝑆−1
𝑠=0
𝑝∈{𝑧,𝑥,𝑦}sin2(2𝑠𝜋𝑝1) + sin2(2𝑠𝜋𝑝2)
− 2 sin(2𝑠𝜋𝑝1) sin(2𝑠𝜋𝑝2)
+ cos2(2𝑠𝜋𝑝1) + cos2(2𝑠𝜋𝑝2)
− 2 cos(2𝑠𝜋𝑝1) cos(2𝑠𝜋𝑝2)
=
𝑆−1
𝑠=0
𝑝∈{𝑧,𝑥,𝑦}2 − 2(sin(2𝑠𝜋𝑝1) sin(2𝑠𝜋𝑝2)
+ cos(2𝑠𝜋𝑝1) cos(2𝑠𝜋𝑝2))
=
𝑆−1
𝑠=0
𝑝∈{𝑧,𝑥,𝑦}2 − 2 cos(2𝑠𝜋(𝑝1−𝑝2))
=
𝑆−1
𝑠=0
𝑝∈{𝑧,𝑥,𝑦}
4 sin2(2𝑠−1𝜋(𝑝1−𝑝2))
=
𝑆−1
𝑠=0 4 sin2(2𝑠−1𝜋Δ𝐱𝑧) + 4 sin2(2𝑠−1𝜋Δ𝐱𝑥)
+ 4 sin2(2𝑠−1 𝜋Δ𝐱𝑦)
=
𝑆−1
𝑠=0
4𝐘𝑠2,
(34)
where 𝐘𝑠= [sin(2𝑠−1𝜋Δ𝐱𝑧),sin(2𝑠−1 𝜋Δ𝐱𝑥),sin(2𝑠−1𝜋Δ𝐱𝑦)].
Mai et al.: Preprint submitted to Elsevier Page 30 of 29