PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Nearly all state of the art vision models are sensitive to image rotations. Existing methods often compensate for missing inductive biases by using augmented training data to learn pseudo-invariances. Alongside the resource demanding data inflation process, predictions often poorly generalize. The inductive biases inherent to convolutional neural networks allow for translation equivariance through kernels acting parallely to the horizontal and vertical axes of the pixel grid. This inductive bias, however, does not allow for rotation equivariance. We propose a radial beam sampling strategy along with radial kernels operating on these beams to inherently incorporate center-rotation covariance. Together with an angle distance loss, we present a radial beam-based image canonicalization model, short BIC. Our model allows for maximal continuous angle regression and canonicalizes arbitrary center-rotated input images. As a pre-processing model, this enables rotation-invariant vision pipelines with model-agnostic rotation-sensitive downstream predictions. We show that our end-to-end trained angle regressor is able to predict continuous rotation angles on several vision datasets, i.e. FashionMNIST, CIFAR10, COIL100, and LFW.
Learning Continuous Rotation Canonicalization
with Radial Beam Sampling
Johann Schmidt 1Sebastian Stober 1
Abstract
Nearly all state-of-the-art vision models are sen-
sitive to image rotations. Existing methods of-
ten compensate for missing inductive biases by
using augmented training data to learn pseudo-
invariances. Alongside the resource-demanding
data inflation process, predictions are often poorly
generalised. The inductive biases inherent to
convolutional neural networks allow translation
equivariance through kernels acting parallel to
the horizontal and vertical axes of the pixel
grid. This inductive bias, however, does not
allow for rotation equivariance. We propose
a radial beam sampling strategy along with ra-
dial kernels operating on these beams to incor-
porate centre-rotation covariance. We present a
radial beam-based image canonicalisation (BIC)
model with an angle distance loss. Our model
allows for continuous angle regression and canon-
icalises random centre-rotated input images. As
a pre-processing method, this enables rotation-
invariant vision pipelines with model-agnostic
rotation-sensitive downstream predictions. We
show that our angle regressor can predict contin-
uous rotation angles and improves downstream
performances on COIL100,LFW and siscore.
Our code is publicly available at
github.com/
johSchm/RadialBeams.
1. Introduction
Computers often struggle to detect the high-level semantic
similarity between rotated objects, a task trivial to humans.
During the rotation transformation, the object’s structure
(semantic similarity) is preserved. We say the rotation is a
symmetry of that object. Defying these symmetries would
1
AILab, Institute of Intelligent Cooperating Systems, Otto-von-
Guericke University, Magdeburg, Germany. Correspondence to:
Johann Schmidt <johann.schmidt@ovgu.de>.
Proceedings of the
40 th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
proposed BIC
"
Radial Beam Sampling
Angle
Regressor Inverted
Rotation
Canonicalized Image
Robote r Arm free lic ence
Viewing Frustum
of the Endeffector
Robotic Arm
Object Detection Frame
Downstream Task
Figure 1.
Industrial BIC (Radial Beam-based Image Canonicaliza-
tion) application setting.
confront us with abundant semantically redundant informa-
tion. The human brain learns to discard these symmetry
transformations to free its immanent representations from
these redundancies, as shown by (Leibo et al., 2011). From
an Information Theoretical point of view, avoiding redun-
dancies allows for more efficient usage of computing capaci-
ties. With this at hand, we can argue from a human cognition
and mathematical perspective for an increased leverage of
symmetries in novel deep learning architectures. A logical
first step involves integrating a priori known symmetries
in our artificial systems. In this work, we narrow down the
scope to image processing. Convolutional Neural Networks
(CNNs) revolutionised the computer vision field and now
are indispensable in modern deep learning and its applica-
tions (Bexten et al., 2021). This is grounded on the reduced
parameter footprint through shared filter parameters and
spatially-independent feature detection, which introduced
translation equivariance to CNNs (see (Cohen & Welling,
2016) for a proof). Although translation-equivariant ar-
chitectures are a major step toward efficient vision mod-
els, many known visual symmetries are disregarded. In
this work, we aim to extend the symmetry scope of vision
pipelines by rotation-invariance. We focus on image classi-
Learning Continuous Rotation Canonicalization
fication problems, where arbitrary image rotations keep the
semantic label unchanged (see the siscore dataset (Djolonga
et al., 2021)). Generally, three options exist for how to inte-
grate in-/equivariances in deep learning models:
(i) The integration of an inductive bias to render the architec-
ture inherently invariant/equivariant to the target symmetry.
Equivariance limits the function space to valid models under
the rules of natural image formation (Worrall et al., 2017),
which supports learning and generalisation. For instance,
attempts by (Cohen & Welling, 2016; Zhou et al., 2020)
and (Ecker et al., 2019) to integrate rotation-equivariance in
CNNs are non-trivial, come with decisive constraints in the
filter space, and are limited to discrete rotation subgroups.
(ii) The augmentation of the training set inflates the dataset
by rotated versions of the vanilla image (Chen et al., 2019).
These augmented examples are sampled from the symmetry
orbits of the vanilla images (see Section 2). During training,
the network learns to approximate these orbits on the latent
manifold (pseudo-equivariance) or collapse them (pseudo-
invariance). While straightforward, this significantly in-
creases the required sample complexity to sufficiently cover
the symmetry space. However, due to the lack of alterna-
tives, the vast majority of modern vision models employed
in an industrial environment still rely purely on augmenta-
tion (Saeed et al., 2021). In Section 3, we pursue a more
profound investigation of the related work.
(iii) Alternatively, a surjective projection during pre-
processing can be used to collapse the orbit of an image
to its canonical form. This canonicalisation process is em-
ployed in the human brain, where canonical orientations
of different objects are stored, as shown by (Harris et al.,
2001). In this work, we investigate the technical realisation
of this process using deep learning. We design a continuous
angle regressor to estimate the rotation angle of input im-
ages. This should answer the question: Given an arbitrary
centre-rotated image, what is its canonicalised orientation?
The model learns image canonicalisation and maps all pos-
sible rotations of an image, i.e., the image orbit, to a single
orientation. In practice, our model is placed between an
object detector and a rotation-sensitive downstream model,
as illustrated in Figure 1. Object detection acts as an addi-
tional symmetry reducer by pruning all but the relevant pixel
information. This also removes the centre bias posed by the
proposed radial beam sampling, which will be introduced in
Section 4. Our model does not pose any restriction in terms
of the downstream task or model used. All this is based
on radial beam sampling, where beams of pixels originat-
ing from the centre of the image to its edge region. This
introduces a covariance between center-rotations of the un-
derlying image and circular shifts of the sampled beams. We
propose an inductive bias, which integrates this covariance
into the encoder of our proposed BIC model. In contrast to
(Keller & Welling, 2021), this covariance does not need to
be learned. Our model approximates the entire rotational
Lie group, rather than limited discrete rotation subgroups as
of prior works, like (Ecker et al., 2019; Ustyuzhaninov et al.,
2020). Compared to rotation-equivariant inductive biases as
in (Gens & Domingos, 2014), our approach does not bind
the hypothesis class and simultaneously lowers the sample
complexity of the downstream model. We provide more
methodological details in Section 4, which we thoroughly
evaluate in Section 5.
2. Symmetries and Group Theory
Mathematically, symmetries of an object are comprised in a
group, i.e., a set with an operation. A symmetry of an object
is a transformation that preserves the object’s structure, like
the
120
rotation of an equilateral triangle. Therefore, the
rotations, which preserve the structure of the triangle, form
a discrete group containing
0,120
and
240
. Besides dis-
crete symmetry groups, Lie groups are continuous symmetry
groups whose elements form a smooth differentiable mani-
fold, like the rotation group of a circle. Of uttermost interest
in this paper are two-dimensional rotations described by the
Special Orthogonal Group
SO(2)
. This group comprises
distance-preserving transformations of the Euclidean space,
which is an essential property for image rotations. Formally,
the group is defined by
SO(2) =g:ρ(g) = Rθcos(θ)sin(θ)
sin(θ) cos(θ),
(1)
where the rotation angle
θ[0,360)
is expressed by
the group element
g
using the homomorphism between
SO(2)
and angles
θ
as in (Hall, 2015).
ρ(g)
is called the
representation of
g
,i.e., the rotation matrix. The group
comes with the group operation
+360
, such that, for in-
stance,
220+360150= 10
. For an arbitrary group
G
, we define the orbit
orb(·)
as the set of all images un-
der the group action with respect to a fixed element
x
,i.e.,
orb(x, G) = {ρ(g)x|gG}
. A function
f
(e.g., modelled
by a neural network), which follows the symmetry group
G
, is either equivariant, invariant, or covariant (Marcos
et al., 2017). Out of these three categories, covariance is the
most general one, where
f(ρ(g)x) = ρ(g)f(x),gG
.
Both transformations might be potentially different in their
domain and co-domain, respectively. From this defini-
tion, we can derive the following two special cases: If
the output stays unaltered after transforming the input,
f(ρ(g)x) = f(x),gG
, we call
f
invariant. If the out-
put is transformed equally,
f(ρ(g)x) = ρ(g)f(x),gG
,
we call
f
equivariant. For more details please refer to (Bron-
stein et al., 2021). In this work, we strive after continuous
rotation invariances following the Lie group
SO(2)
. How-
ever, as shown in the related work, most rotation symmetric
vision networks respect only narrow subgroups of SO(2).
Learning Continuous Rotation Canonicalization
3. Related Work
Through its simplicity, the process of augmented training is
a prevalent strategy for learning more robust models. The
dataset is augmented with samples from the loss-preserving
symmetry orbit of a datapoint, like rotations of a base image.
During training, pseudo-invariances should be embedded
into the latent manifold by learning proximity regions of
pseudo-invariant data points in the latent space. (Goodfel-
low et al., 2009) show that the quality of pseudo-invariance
in the learned representation increases with the depth of
the network. Minor deformations and transformations, like
shifts and rotations, are attenuated by lower layers, while
higher layers progressively form invariances. As shown by
(Chen et al., 2019) and (Dao et al., 2019), augmentation acts
as a regulariser penalising model complexity by learning
to minimise the average over loss signals under the group
actions. Despite these benefits, modern networks depend
on massive training datasets full of semantic redundancies,
whose sample complexity is even enlarged by additional
augmentation processes.
Contrary to hard-won pseudo-invariances, inductive biases
open the gates for true invariances (Nabarro et al., 2021).
(Gens & Domingos, 2014) leverage the aforementioned
proximity regions of invariant datapoints by enforcing layers
to model the symmetry space of inputs deliberately. Hence,
points propagated through the network can be pooled with
their closest neighbours (i.e., their symmetry transformed
replicates) and mapped to the same output label. (Jader-
berg et al., 2015) ingrain jointly learned transformations to
CNNs, which project inputs to their canonical representa-
tions. The end-to-end trained transformer is limited to small
perturbations of inputs, and convergence to satisfactory local
minima is challenging. Both limitations are tackled by our
BIC model. (Hu et al., 2019) constrain filters to symmetric
weight matrices, i.e., isotropic filters. This also reduces the
memory footprint since only a sub-matrix must be stored.
(Cohen & Welling, 2016) expand the symmetry range of
CNNs to reflections and integer multiples of
90
rotations
around the centre. (Zhou et al., 2020) decompose weight
matrices into learnable weight vectors and meta-learned
(binary) symmetry matrices. (Ecker et al., 2019) represent
kernels in a steerable basis modelled by 2D Hermite polyno-
mials to allow for weight sharing across kernel orientations.
Building upon these results, (Ustyuzhaninov et al., 2020)
canonicalise the kernel weights for different orientations.
Most proposed methods drastically constrain the filter space,
making arbitrary visual feature learning impossible. Fur-
thermore, due to the finite number of kernel matrices and
rotational artefacts, only a limited number of discrete rota-
tions is supported. Both drawbacks are eliminated with our
proposed method.
22
x
x
)"#
$
%
&
'
(
)
+
,
one pixel
coverage
uncovered
overlap
2.*
2.*
/
)
matrix
representation
radial beam
sample
Figure 2.
Beam set
B
on the padded image
X
with beams (grey),
proximity/thickness
ϵ
(light grey), padded pixels (black) and the
radial convolutional kernel (cyan).
4. Methodology
We present an orientation canonicaliser for image data. At
its heart, our model uses radial beams sampled from the
input image as illustrated in
??
. This is followed by an
angle regressor as shown in Figure 3. The predicted angle
is used to canonicalize the input image using the inverted
rotation. Let
X X(Ω)
be an image on the pixel grid
domain
sampled from the dataset
X
. We have an image
tensor
XRW×H×C
with width
W
, height
H
, and color
channels
C
. In practice, we use batched data, but we drop
the batch dimension in this article for brevity.
Radial Beam Sampling Conventional convolution on im-
age data uses horizontally and vertically aligned rectangular
kernels w.r.t.
. This allows for translation-equivariance
when shifting the object along either of these axes. In this
work, we are interested in centre-rotations of images rather
than translations. We argue that one way to enable rotation-
equivariance is to alter the alignment of kernels. To this end,
we utilize kernels operating along beams radiating from the
centre to the edges of the image. Formally, a beam
b
is a
fixed vector radiating from
W
2,H
2
with length
D
and
a certain direction. We use a deterministic sampling strategy
to obtain a radial beam set
B
, as illustrated in
??
. Since
this is a rigid overlay mask on top of
, beam positions
are agnostic to any image
X
. This allows us to check for
intersections of
B
with the pixel grid
once. Afterwards,
we only need to extract the colour information of image
X
at these positions to evaluate the beams. In practice, this
can be implemented via the (Bresenham, 1965) Algorithm.
The cardinality
|B|
is a hyper-parameter, which is upper-
bounded by
|Bmax|
(see Lemma 4.2). Beams are spaced
equally from one another to obtain a near uniform receptive
field and reduce bias in the sampling. We control the thick-
ness of beams by
ϵN
. To this end, we modify the classical
Learning Continuous Rotation Canonicalization
Radial Beams
Conv
Conv
CNN
Beam Encoder
Context Encoder
0
1 0
2 1 0
2
2
1
0
0
0
Latent Space
softmax
!1" #"×$×%
$ % &'(
!'" #"×$×%
)'" # ×)*+,×3×%
Radial Beams
)1" # ×)*+,×3×%
Original Image
Similarity Matrix
00 0
* " # ×
Angle Matrix
+ " , × - " .(/&0 × ×
#
Conv
Conv
LSTM
Permutation Decoder
# ×. #.#)
MLP Decoder
normalize 12 3
45 3
67(/&'(/'8(9
mask 1 mask 2 mask 3
Toe pl i tz Ex tr a ct or
GNN
+
#.
Context Node Embedding
Directed Wheel Graph
aggregate
# ×.
(optional) Toeplitz Prior
shared for
all beams
# ×.
Figure 3.
A schematic illustration of our angle regressor. The proposed Toeplitz prior for training utilizes a tuple containing image
X0
and
its
θ
-rotated version,
Xθ
. First, radial beams
B0
and
Bθ
are sampled with proximity and embedded by a shared CNN encoder. A GNN
encodes the neighbourhood information of adjacent beams. During training time, we impose a prior on the resulting latent manifold to
improve its structure. To that end, a pairwise similarity matrix
Ξ
over embeddings of
B0
and
Bθ
is computed. We show that
θ
is mapped
to diagonals in
Ξ
. Leveraging this insight, we can extract the probability distribution over discrete rotations from
Ξ
by using a Toeplitz
extractor
T
conditioned on the angle matrix
Θ
. During inference, however, we are limited to a single input image
Xθ
. To find the rotation
angle, we leverage the permutation of beam embeddings using a LSTM decoder. We use a complex unit vector regression by jointly
predicting the real and imaginary parts.
Bresemham Algorithm and add additional neighbourhood
pixels, such that with
ϵ= 1
, the sampled line is enlarged
by a border of one pixel to both sides, increasing the width
to
2ϵ+ 1
. To further understand the information density
extracted from
X
by
B
, we propose an approximation of
the beam coverage and the resulting overlaps w.r.t.
|B|
and
length D.
Lemma 4.1. The pixel coverage
Γcover
of radial beams
sampled from image Xis given by
Γcover (|B|, D) |B|
4D2
|B| 8D
|B| (2ϵ+ 1)2
2 tan 360
|B|
.
(2)
Then, the pixel overlap Γover can be estimated by
Γover (|B|, D) |B|"(2ϵ+ 1)D4D2
|B|
+8D
|B| (2ϵ+ 1)2
2 tan 360
|B| #(3)
Lemma 4.2. The maximum number of beams is upper
bounded by |Bmax| 8D/(2ϵ+ 1).
The proofs are stated in Appendix B.1 and Appendix B.2,
respectively.
Shared Beam Encoder As outlined in the following, we
embed the sampled beams using a shared beam encoder
to mitigate coarsening issues. Through the trigonometrical
functions
sin
and
cos
in Equation (1), the rotation matrix
R
can take real values even for integer angles. Mapping
the rotated pixel positions back to the coarse pixel grid
introduces two problems: (i) Pixels might end up outside
the pixel grid, where the corners of images are at risk. This
can be circumvented by isotropic zero-padding and thus
increasing the size to W+ 2δand H+ 2δ.
Lemma 4.3. The optimal padding
δ
for any squared im-
age
X
, which obviates any information loss when ro-
tated under group transforms of
SO(2)
, is given by
δ=
max 0,W21/2.
The proof is stated in Appendix B.3. Note that the padding
does not affect the prediction and is only used to prevent in-
formation loss during group transformations. (ii) Rotations
also introduce rounding errors on
. Due to the real matrix
R
and the discrete pixel grid, an integer constraint must be
enforced after rotation. Thus, destination locations of pixels
might be allocated multiple times where others are missed
entirely. We smoothen the rotation by bilinear interpolation
between pixels to mitigate these artefacts. For further miti-
gation, we leverage the proximity around each pixel. The
beam encoder embeds the proximity-aided beam evaluations
in a latent representation, i.e.,
bR2ϵ+1×DRL
, where
Lis the latent space dimensionality.
Context Encoder The representation emerging from the
beam encoder ties spatial information together with colour
values. Since this is done individually for each beam, we
Learning Continuous Rotation Canonicalization
mitigate greedy decisions by incorporating the spatial con-
text through a Graph Neural Network (GNN). See Ap-
pendix A for more details on GNNs. A directed wheel
graph models the context encoder with
|B| + 1
nodes. The
graph topology encodes two essential properties of infor-
mation exchange during representation learning. First, we
leverage direct links to the centre node of the wheel graph
to aggregate messages from all beam nodes. Therefore,
the centre node representation will hold the full global con-
text. Secondly, we use directed links over undirected links
to avoid permutation invariant updates. The spatial infor-
mation of whether an adjacent node is above or below is
crucial to quantify rotations. This also allows us to mitigate
over-smoothing (Li et al., 2018) by holding back the global
context and adding it after the GNN. In Appendix D.1, we
give more technical details, for instance, regarding the mes-
sage passing strategy.
Toeplitz Prior The Toeplitz Prior represents an optional
analytical aid to improve the latent structure. We aim to
predict the rotation angle given the beam representations
obtained after both encoding steps. To this end, we utilize
the input tuple
(X0, Xθ)
and learn clusters of similar beam
embeddings, where
Xθ=RθX0
with
Rθ C|B|
.
C|B|
is a
subgroup of SO(2),
C|B| =Rθ:θ=k
|B|360|k {0,1,...,|B| 1}.
(4)
In other words, we rotate images during training only by
angles between two beams (not necessarily adjacent ones).
We encode
B
as a tensor, hence each beam
b B
has a
certain position in
B
. Without loss of generality, we set
the first beam as the upper left one. In this light, rotating
by
Rθ C|B|
will shift/roll the positions of beams in
B
.
This introduces a bijection between
k
from Equation (4) and
θ
, such that rotations in the ambient space map to circular
shifted beam embeddings. Hence, we need to quantify how
many positions the beams are shifted to infer the image
rotation. For instance, in Figure 3 we rotated the image
by
θ= 120
and sampled
|B| = 3
. According to Equa-
tion (4) we have
k= (θ|B|)/360= 1
, so the rotation is
mapped to a shift of beams, as illustrated by the beam order
in Figure 3. To find
k
, matching beam pairs of both images
need to be identified (illustrated by equal colours in Fig-
ure 3). We compute a similarity matrix
Ξ
using all pairwise
beam combinations with
1/(1 + || · ||2)
as the similarity
measure. The hypothesis is that the similarity is highest
for beam pairs with the target angle
θ
between them. As
the Toeplitz Prior in Figure 3 shows, this accumulated in a
shifted diagonal in
Ξ
. We say diagonals in
Ξ
are coherent to
circular shifts of beams in
B
. For each rotation in the finite
group
C|B|
, we have such a diagonal, which is formulated
in the angle matrix
Θ
. We can quantify the rotation angle
θ
by finding the maximally activated diagonal. For this,
we mask each diagonal by the Toeplitz extraction tensor
T
,
a binary mask for matrix diagonals. By summing up all
similarity scores we obtain the final logit score for
k
,i.e,
1 Tk)1
, where
1
is a one-vector of length
|B|
and
denotes the Hadamard product. Finally, we estimate the
probability distribution over angles by applying the softmax
function over the logits. Due to the a priori defined
T
this
prior is fully differentiable.
Decoding The Toeplitz prior enforces a deliberate struc-
ture on the latent manifold. This structure needs to be lever-
aged during inference to determine rotation angles of single
input images
Xθ
without
X0
. We enforce order on
B
by
the encoding as a tensor. We build upon the assumption
that during training, a structured latent manifold is learned
on which the order of context-aware beams infers rotation
information. During inference, we utilize this permutation
of
Xθ
and decode the sequence of beams using a Long Short
Term Memory (LSTM). This is followed by linear transfor-
mations, such that the decoder maps R|B|×LRLR2.
Unit Circle Loss Instead of predicting a real number for
θ
, we leverage the specific structure of angles. Let
z
be
a complex vector on the unit circle. As stated in complex
analysis, the angle between the positive real axis and
z
is
arg z=θ
, such that each
z
maps to a
θ
. The argument
of
z
can be computed as
arg (z) = atan2(im(z), re(z))
,
where
im(z)
and
re(z)
are the imaginary and real part of
z
, respectively. Predicting
z
will bound the predicted angle
to
[0,360)
and introduce smooth transitions from
359
to
0
. In practice, we normalize the output of the final linear
layers to the unit length
z/||z||2
. A loss is required which
penalizes the predicted unit vector
z
w.r.t. its ground truth
angle
θ
. We use the squared error between the real and
imaginary parts,
Lcircle(θ, z) = (sin(θ)im(z))2+(cos(θ)re(z))2.(5)
The unit circle loss is a highly expressive loss function, as
it preserves angle distances (Appendix C.1). We provide
bounds in Appendix C.2 and show its Lipschitz smoothness
in Appendix C.3. The Toeplitz prior acts via a supplemen-
tary loss term,
L(θ, z, P ) = Lcircle (θ, z) + Lprior (θ, P )(6)
and
Lprior(θ , P ) = |B|−1
X
i=0
P(i)
θlog P(i),(7)
where
Pθ
is a one-hot vector of length
|B|
that indicates the
true rotation angle
θ
and
P
is the predicted distribution over
angles.
Learning Continuous Rotation Canonicalization
siscore samples
COIL100 samples
training on testing on
C|B|
SO(2)
training on
testing on
training on
testing on
C|B|
C|B|
SO(2)
SO(2)
A B C
Figure 4.
(A) Performance gain and stabilization across rotations. (B) Performance comparison with varying
|B|
. We train/test on either
the finite rotation group,
C|B|
, or on all possible rotations,
SO(2)
. Continuous lines represent training curves, and dashed lines represent
testing curves. (C) t-SNE projection of beam embeddings (C|B| orbit).
5. Experiments
Unless stated explicitly, we used the settings outlined in
Appendix D.1. In Section 5.1 we tested the robustness of
BIC to different backgrounds. In Section 5.2 we showed that
the representations learned indeed form a continuous orbit in
the latent space. In Section 5.3 we studied the generalization
capabilities of BIC under varying
|B|
and found that BIC
is able to fit
SO(2)
. In Section 5.4 we showed that BIC
leverages light reflections as a key feature. In Section 5.5
we provide an investigation of the influence of different
beam lengths.
5.1. Downstream Performance Gain
We evaluated our model on the benchmark dataset siscore
(Djolonga et al., 2021). This benchmark for modelling ro-
bustness to Euclidean group transformations contains image
with varying backgrounds and group-transformed Imagenet
(Deng et al., 2009) class objects. We used a pre-trained Effi-
cientNetB0 by (Tan & Le, 2019) on Imagenet as a classifier
for siscore. We train BIC on
80%
of siscore to learn the
rotation-canonicalisation and leverage the remaining
20%
for evaluating the classification performance. During train-
ing the Toeplitz prior was disabled for higher flexibility on
|B|
. Thus, the model performed a continuous unit vector
regression for
SO(2)
. We hypothesise that if BIC learns
to canonicalise images in siscore, the downstream classifier
will benefit from the canonicalised inputs, causing a perfor-
mance increase with more robustness against rotations. Our
results are shown in Figure 4A. BIC lifts the top-1 accuracy
on siscore to a stable level of over
0.37
on average. Since the
learned canonicalisation is not optimal (see Appendix D.3
for details), the performance with BIC across rotations is
not as high as for non-rotated inputs.
5.2. Learning Continuous Representation Orbits
The latent structure should also accompany our hypothesis
that image rotations cohere with circular shifts of beam em-
beddings. We test this by computing the beam embeddings
of the entire group orbit of
C|B|
. The beam embeddings
should form a surface topologically equivalent to a circle.
Figure 4C shows the 2D t-Distributed Stochastic Neighbor
Embedding (t-SNE) (van der Maaten & Hinton, 2008) pro-
jection of a all beam embeddings of COIL100 sample. We
provide more details and examples in Appendix D.4. The
representation reminisces a M
¨
obius strip indicating exactly
the looked-after continuity. Interestingly, such latent struc-
ture emerge without the Toeplitz prior. We provide an pro-
found study on the necessity of the prior in Appendix D.2.
5.3. Generalization and varying Number of Beams
We are interested in the generalisation capabilities of BIC.
In that light, we evaluate the variance of our model when
trained and tested on the same distribution. To go one step
further, we also evaluated the test performance on all possi-
ble rotations,
SO(2)
, when trained only on a subgroup. All
this is conducted using a set of different number of beams
|B|
to reason about the influence of the receptive field on the
prediction quality. We used the COIL100 dataset for our in-
vestigations using a beam length of
D=W/2δ
to reduce
the risk of overfitting. The performance results are com-
posed in Figure 4B. Generally, we observe a performance
improvement with increasing
|B|
, which was expected. The
Learning Continuous Rotation Canonicalization
upper front top left
right behind front
upper behind upper left upper right
A B C
D E F
G H I
A
B
C
D
E
F
G
H
I
Figure 5.
(left) COIL100 pre-trained BIC tested on synthetic spheres with different light positions and (right) Saliency maps of a COIL100
sample.
higher the sampling rate, the more information are available
to infer the rotation angle.
Blue curves indicate the training and testing performance,
respectively, on the finite group
C|B|
. Orange curves illus-
trate the performance on the
SO(2)
,i.e., continuous angle
regression. We found that the variance of the model trained
and tested on
SO(2)
is much larger than in the subgroup ex-
periments. This is reasonable due to the significantly larger
output space. It is surprising, however, that the training
performance on
SO(2)
is lower than on
C|B|
. Most interest-
ing is the red curve, which illustrates the test performance
on
SO(2)
, when trained on
C|B|
. Here, the model is forced
to interpolate and can not further rely on the bijection be-
tween
k
and
θ
(see Equation (4)). To increase the difficulty,
even more, we kept the sample complexity constant, i.e., the
number of rotations per image in that dataset. The result
indicates that BIC can generalise to more significant distri-
butions and even continuous regression. Another exciting
finding is that this setting outperforms the
SO(2)
-trained
model. We leave both of these remarkable findings for
possible future work.
5.4. Leveraging Light Reflections
As humans, we learn canonical orientations for many differ-
ent object classes, as shown by (Harris et al., 2001). This
originates from our physical understanding of the perceived
world. For instance, our mental model of a giraffe is most
likely in the form of an upstanding animal. What if the
presented object does not come with such a bias, like a
ball? Given that the image is shooted in a room with static
lightning conditions, this can be used as an indicator for
its canonical orientation. We hypothesise that this salient
feature is the main driver for the canonicalisation of such
images. To test our hypothesis, we pre-train our model on
a dataset with a spatially static light, as COIL100. During
testing, we utilise different test sets, each with a different
light position. If reflections are a salient feature to determine
the canonical orientation, then the prediction performance
must vary across the sets. We simulated different lightning
conditions in (Blender, 2022) with a sphere as the target ob-
ject to purify the scene by focusing only on reflections. We
generated
100
sample images with different light intensities
for each of the nine different light positions. The results
in Figure 5 (left) show noticeable prediction performances
across the test sets. This indicates that the model trained
on COIL100 indeed learned to leverage the reflections as
the salient feature for the angle regression. This finding is
supported by the saliency maps in Figure 5 (right), where
the model focuses on the outline and the shadow areas of
the object. The plots also show the equivariance between
the input and the feature map under rotation. If the data is
recorded under static lighting conditions, as in industrial
environments, reflections are a robust salient feature for ad-
equately estimating the orientation across the train and test
set. This explains why the variance in all of our experiments
is reasonably low.
5.5. Performance Impact of Beam Length
During the rotation of any image, colour values move on
circular tracks around the centre. However, the image shape
is rectangular rather than circular. Therefore, pixels on outer
circular tracks might not get a colour value assigned. These
rotation artifacts are visualized in Figure 6 and highlighted
in white. We call these edges between the vanilla image
and any padded region (including these artefacts) image-
to-padding borders. Beams that include these regions hold
valuable information for the model to overfit on this sharp
colour gradient. There are three beam lengths, as shown in
Figure 6 that offer particularly interesting semantic contexts
for the angle regression:
A
We set
D=W/2δ
wit
δ
according to Lemma 4.3.
This prevents beams from covering padded pixels
across all possible rotations. On the one hand, this
Learning Continuous Rotation Canonicalization
D=W
2δ
D=W
2+δ
A B CCOIL100LFW
Figure 6.
Superficial illustration of the three different beam lengths and its performance impact on two test sets, that is COIL100 and LFW.
causes a drastic context reduction but, on the other
hand, eliminates the possibility of overfitting on the
image-to-padding border.
B
We set
D=W/2
. Here the context is increased, but
the model might overfit the included image-to-padding
border.
C
We set
D=W/2 + δ
. The full context is pro-
vided under all possible rotations. However, significant
amounts of input pixels provide no information.
We conducted two experiments, where we trained and eval-
uated BIC on COIL100 and LFW, respectively. Beams with
full context (C) achieved consistently lower
Lcircle
loss as
shown in Figure 6. Due to the significant loss increase un-
der setting A, we hypothesise that BIC overfitted on the
image-to-padding border. To support this hypothesis, we
conducted another experiment.
If the target object on the image lives on a monochrome
background, the image-to-padding border would be elimi-
nated. However, all utilised datasets, including COIL100, do
not show the object on a purely black background. Against
this background, we used an adaptive padding technique
to colour the background monochrome. We extracted the
colour value of a corner pixel of the vanilla COIL100 tar-
get image. The gathered colour code is then utilised for
image-individual padding. The saliency maps (Simonyan
et al., 2014) in Figure 11 show that the model learns the two-
dimensional shape of the target object instead of focusing
on the image-to-padding border.
6. Conclusion
We presented an angle regressor for image data based on
radial beams sampled from the input image. Our Radial
Beam-based Image Canonicalization (BIC) maps random
centre-rotated images to their canonicalised orientations.
This allows for model-agnostic rotation-sensitive down-
stream prediction networks. Our model is part of a disjointly
trained vision pipeline comprising an object detection and
cropping mechanism, followed by Radial Beam-based Im-
age Canonicalization (BIC) and a downstream classificator.
Through the object detection and cropping mechanism, we
ensure that the target object is centred, removing the centre
bias of our radial beam sampling. BIC achieves a regres-
sion error of
13
on COIL100 and
12
on LFW with ran-
dom continuous rotations. A possible application domain
is robotic handling, where a robot arm needs to orient its
end-effector according to the orientation of the target object.
BIC holds the potential to supersede known but inefficient
strategies, like augmented training as in (Gudimella et al.,
2017). During our investigations, our angle distance loss
learned a meaningful structured latent space. Besides em-
pirical findings, we provide a mathematical foundation for
our proposed radial beam sampling to ease future investi-
gations. This includes a potential study of the information
decrease along beams towards the centre, i.e., pixels fur-
ther from the centre are more affected by rotations than
pixels closer to the centre. Pixels towards the centre are
represented redundantly due to the overlapping nature of
beams. Furthermore, end-to-end training using BIC as the
localisation sub-network in Spatial Transformers (Jaderberg
et al., 2015) would be an exciting future direction.
7. Acknowledgements
The authors acknowledge the financial support by the
Federal Ministry of Education and Research of Germany
(BMBF) within the framework for the funding for the project
PASCAL.
References
Bexten, S., Schmidt, J., Walter, C., and Elkmann, N. Human
action recognition as part of a natural machine operation
framework. In 2021 26th IEEE International Confer-
ence on Emerging Technologies and Factory Automation
(ETFA ), pp. 1–8, 2021. doi: 10.1109/ETFA45728.2021.
Learning Continuous Rotation Canonicalization
9613331.
Blender. Blender - a 3D modelling and rendering pack-
age. Blender Foundation, Stichting Blender Foundation,
Amsterdam, 2022.
Bresenham, J. Algorithm for computer control of a digital
plotter. IBM Syst. J., 4:25–30, 1965.
Bronstein, M. M., Bruna, J., Cohen, T., and
Veli
ˇ
ckovi
´
c, P. Geometric deep learning: Grids,
groups, graphs, geodesics, and gauges. ArXiv,
10.48550/ARXIV.2104.13478, 2021.
Chen, S., Dobriban, E., and Lee, J. H. Invariance reduces
variance: Understanding data augmentation in deep learn-
ing and beyond. CoRR, abs/1907.10905, 2019.
Cohen, T. and Welling, M. Group equivariant convolutional
networks. In Balcan, M. F. and Weinberger, K. Q. (eds.),
Proceedings of The 33rd International Conference on
Machine Learning, volume 48 of Proceedings of Machine
Learning Research, pp. 2990–2999, New York, New York,
USA, 20–22 Jun 2016. PMLR.
Dao, T., Gu, A., Ratner, A., Smith, V., De Sa, C., and Re, C.
A kernel theory of modern data augmentation. In Chaud-
huri, K. and Salakhutdinov, R. (eds.), Proceedings of the
36th International Conference on Machine Learning, vol-
ume 97 of Proceedings of Machine Learning Research,
pp. 1528–1537. PMLR, 09–15 Jun 2019.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
L. Imagenet: A large-scale hierarchical image database.
In 2009 IEEE conference on computer vision and pattern
recognition, pp. 248–255. Ieee, 2009.
Djolonga, J., Yung, J., Tschannen, M., Romijnders, R.,
Beyer, L., Kolesnikov, A., Puigcerver, J., Minderer,
M., D’Amour, A., Moldovan, D., Gelly, S., Houlsby,
N., Zhai, X., and Lucic, M. On robustness and trans-
ferability of convolutional neural networks. In 2021
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pp. 16453–16463, 2021. doi:
10.1109/CVPR46437.2021.01619.
Ecker, A. S., Sinz, F. H., Froudarakis, E., Fahey, P. G., Ca-
dena, S. A., Walker, E. Y., Cobos, E., Reimer, J., Tolias,
A. S., and Bethge, M. A rotation-equivariant convolu-
tional neural network model of primary visual cortex. In
Seventh International Conference on Learning Represen-
tations (ICLR), pp. 1–11, 2019.
Farina, F. and Slade, E. Symmetry-driven graph neural
networks. arXiv, 2105.14058, 2021.
Gens, R. and Domingos, P. Deep symmetry networks. In
Proceedings of the 27th International Conference on Neu-
ral Information Processing Systems - Volume 2, NIPS’14,
pp. 2537–2545, Cambridge, MA, USA, 2014. MIT Press.
Goodfellow, I. J., Le, Q. V., Saxe, A. M., Lee, H., and
Ng, A. Y. Measuring invariances in deep networks. In
Proceedings of the 22nd International Conference on
Neural Information Processing Systems, NIPS’09, pp.
646–654, Red Hook, NY, USA, 2009. Curran Associates
Inc. ISBN 9781615679119.
Gudimella, A., Story, R., Shaker, M., Kong, R., Brown, M.,
Shnayder, V., and Campos, M. Deep reinforcement learn-
ing for dexterous manipulation with concept networks,
2017.
Hall, B. Lie Groups, Lie Algebras, and Representations:
An Elementary Introduction. Graduate Texts in Mathe-
matics. Springer International Publishing, 2015. ISBN
9783319134673.
Harris, I. M., Harris, J. A., and Caine, D. Object Orienta-
tion Agnosia: A Failure to Find the Axis? Journal of
Cognitive Neuroscience, 13(6):800–812, 08 2001. ISSN
0898-929X. doi: 10.1162/08989290152541467.
He, K., Zhang, X., Ren, S., and Sun, J. Delving deep into
rectifiers: Surpassing human-level performance on ima-
genet classification. In 2015 IEEE International Confer-
ence on Computer Vision (ICCV), pp. 1026–1034, 2015.
doi: 10.1109/ICCV.2015.123.
Hu, S. X., Zagoruyko, S., and Komodakis, N. Exploring
weight symmetry in deep neural networks. Computer Vi-
sion and Image Understanding, 187:102786, 2019. ISSN
1077-3142. doi: https://doi.org/10.1016/j.cviu.2019.07.
006.
Jaderberg, M., Simonyan, K., Zisserman, A., and
Kavukcuoglu, K. Spatial transformer networks. In Pro-
ceedings of the 28th International Conference on Neural
Information Processing Systems - Volume 2, NIPS’15, pp.
2017–2025, Cambridge, MA, USA, 2015. MIT Press.
Keller, T. A. and Welling, M. Topographic vaes learn
equivariant capsules. In Ranzato, M., Beygelzimer, A.,
Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Ad-
vances in Neural Information Processing Systems, vol-
ume 34, pp. 28585–28597. Curran Associates, Inc., 2021.
Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. CoRR, abs/1412.6980, 2015.
Kipf, T. N. and Welling, M. Semi-supervised classifica-
tion with graph convolutional networks. In International
Conference on Learning Representations, 2017.
Learning Continuous Rotation Canonicalization
Kondor, R., Son, H. T., Pan, H., Anderson, B., and Trivedi,
S. Covariant compositional networks for learning graphs.
2018. doi: 10.48550/ARXIV.1801.02144.
Leibo, J., Mutch, J., and Poggio, T. Learning to discount
transformations as the computational goal of visual cortex.
06 2011. doi: 10.1038/npre.2011.6078.1.
Li, Q., Han, Z., and Wu, X. Deeper insights into graph con-
volutional networks for semi-supervised learning. 32nd
AAAI Conference on Artificial Intelligence, pp. 3538–
3545, 2018.
Marcos, D., Volpi, M., Komodakis, N., and Tuia, D. Ro-
tation equivariant vector field networks. In 2017 IEEE
International Conference on Computer Vision (ICCV).
IEEE, oct 2017. doi: 10.1109/iccv.2017.540.
Nabarro, S., Ganev, S., Garriga-Alonso, A., Fortuin, V.,
van der Wilk, M., and Aitchison, L. Data augmentation
in bayesian neural networks and the cold posterior effect.
ArXiv, abs/2106.05586, 2021.
Saeed, F., Muhammad Jamal, A., junaid, M., Hong, K., Paul,
A., and Kavitha, M. S. A robust approach for industrial
small-object detection using an improved faster regional
convolutional neural network. Scientific Reports, 11, 12
2021. doi: 10.1038/s41598-021-02805-y.
Simonyan, K., Vedaldi, A., and Zisserman, A. Deep inside
convolutional networks: Visualising image classification
models and saliency maps. CoRR, abs/1312.6034, 2014.
Tan, M. and Le, Q. EfficientNet: Rethinking model scaling
for convolutional neural networks. In Chaudhuri, K. and
Salakhutdinov, R. (eds.), Proceedings of the 36th Inter-
national Conference on Machine Learning, volume 97 of
Proceedings of Machine Learning Research, pp. 6105–
6114. PMLR, 2019.
Ustyuzhaninov, I., Cadena, S. A., Froudarakis, E., Fahey,
P. G., Walker, E. Y., Cobos, E., Reimer, J., Sinz, F. H.,
Tolias, A. S., Bethge, M., and Ecker, A. S. Rotation-
invariant clustering of neuronal responses in primary vi-
sual cortex. In International Conference on Learning
Representations, 2020.
van der Maaten, L. and Hinton, G. E. Visualizing data
using t-sne. Journal of Machine Learning Research, 9:
2579–2605, 2008.
Weisfeiler, B. and Lehman, A. A. A Reduction of a Graph
to a Canonical Form and an Algebra Arising During This
Reduction. Nauchno-Technicheskaya Informatsia, Ser. 2
(N9):12–16, 1968.
Worrall, D. E., Garbin, S. J., Turmukhambetov, D., and
Brostow, G. J. Harmonic networks: Deep translation
and rotation equivariance. In 2017 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pp.
7168–7177, 2017. doi: 10.1109/CVPR.2017.758.
Xu, K., Hu, W., Leskovec, J., and Jegelka, S. How powerful
are graph neural networks? 2018. doi: 10.48550/ARXIV.
1810.00826.
Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B.,
Salakhutdinov, R. R., and Smola, A. J. Deep sets. In
Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fer-
gus, R., Vishwanathan, S., and Garnett, R. (eds.), Ad-
vances in Neural Information Processing Systems, vol-
ume 30. Curran Associates, Inc., 2017.
Zhou, A., Knowles, T., and Finn, C. Meta-learning symme-
tries by reparameterization. 2020. doi: 10.48550/ARXIV.
2007.02933.
Learning Continuous Rotation Canonicalization
30
x
x
)"#
$
%
&
'
(
)
+
,
one pixel
coverage
uncovered
overlap
2.*
2.*
/
)
matrix
representation
radial beam
sample
Figure 7.
(left) Beam sketch with
|B| = 8
on the pixel grid
. Each beam has a fixed length of
D
, thus a two-dimensional hull of length
and height
2D
results. (right) One example triangle with base length and height annotated with the deduced formulas, as well as the
coverage Γ(cover)
(|B|, D)and overlap Γ(over)
(|B|, D). Within the triangle, beam widths are depicted by dashed lines.
A. Graph Neural Networks
GNNs enable learning on graphs and thus directly leverage the underlying data structure as an inductive bias. The basis for
this is a learnable inference method, i.e. the iterative parameterized message passing procedure. For simplicity, we consider
only graph convolutional neural networks (Kipf & Welling, 2017), which we will use interchangeably with GNNs in this
work. Furthermore, we use node and vertex as synonyms and edge and link to denote connections. At each layer
l
the node
features of each node are updated by the GNN propagation rule
hl+1
vfnode
u∈N(v)
flink hl
v,hl
u,hl
v!,(8)
where
hl+1
v
is the updated node embedding (output of layer
l
).
denotes a permutation invariant operation, like summation,
maximization, minimization, or averaging (Zaheer et al., 2017). This is the reason for GNN being permutation invariant w.r.t.
their local node set. Therefore, graphs convey the underlying topology, not the geometry, i.e. distances and angles between
nodes. By this, the symmetry of graphs is encoded such that the connection of nodes is relevant and not their mere position.
However, this is not always beneficial since it worsens the representational power and causes steerability downsides for some
use cases (Kondor et al., 2018). Although mean and max-pooling aggregators are well-defined multiset functions, they are
not injective (Xu et al., 2018). For multiset injective aggregators, this learning procedure closely resembles the well-known
(Weisfeiler & Lehman, 1968) Graph Isomorphism Test. This test answers the question of whether or not two graphs are
topologically equivalent. Note, however, that the outcome of this test provides a necessary but insufficient condition for
graph isomorphisms. As shown in (Xu et al., 2018), the sum operator holds the highest expressiveness among permutation
invariant aggregator choices.
(Farina & Slade, 2021) proposed an angle preserving graph network, which encodes the direction of the neighbour nodes.
For
l= 0
we initialize
hl
v
by the node features
xv
. This update mechanism is a composition of a node embedding function
fnode
and a link embedding function
flink
, where both are modelled by neural networks rendering the procedure learnable.
flink
models the message passing procedure, where the query node embedding
hl
v
and its neighbor node embedding
hl
u
are
combined in a non-linear fashion. This allows for an information flow from the neighbour nodes
N(v)
of query node
v
to itself. In Equation (8), we used the sum operator as the aggregator of all messages. The result is passed into the node
embedding function
fnode
, modelled by a neural network, which encodes the aggregated information and the current query
node embedding. Node states are propagated until an equilibrium is obtained. This iterative convolutional process allows for
the treatment of irregular data.
B. Proofs
B.1. Beam Overlaps and Coverage Approximation
Proof.
We use a linear relaxation of the discrete pixel grid to quantify the coverage and overlap. Under the assumption of
reflection symmetric beams, where we can position the symmetry axis at every beam and opposite beam pair (see Figure 7).
Learning Continuous Rotation Canonicalization
Therefore, the space is chopped into equally sized triangles. It is sufficient to estimate quantities for one triangle and
extrapolate to the full scope. Let
γ= 360/|B|
be the angle of each triangle at the center of the image. We can leverage the
spatial bound of any new beams, which is hemmed in by its adjacent neighbor beams. We say that a pixel of a beam either
overlaps another beam’s pixel or not. As aforementioned all beam lengths are limited to
D
pixels. Let
Γ(cover)
,Γ(over)
R+
be the coverage and overlap of triangle
, respectively. The coverage is a strict upper bound for the overlap
Γ(cover)
>Γ(over)
since the last pixel of any beam needs to be unique otherwise the beam already exists in
B
. Intuitively, the narrower the
neighbors, the tighter the spatial bound, the higher the overlap. The base length of each triangle can be computed by
2D
1
4|B| =8D
|B|.(9)
Part of the approximation is the assumption that all are orthogonal triangles. Under the linear relaxation we can compute the
surface area of an triangle by
∆(B, D) = 1
2
8D
|B|D=4D2
|B| .(10)
The surface area of the inner embedded triangle, i.e., the uncovered area, can be computed using different trigonometric
ratios. Here, we leverage the fact that
γ
appears also as the top angle of the embedded triangle, as illustrated in Figure 7. To
estimate the uncovered surface area, we need width of beams, which is given by
(2ϵ+ 1)
with
ϵ
being the thickness. Then,
the base length of this inner triangle is
8D
|B| (2ϵ+ 1).(11)
Using a trigonometric ratio for orthogonal triangles the height of the triangle can be computed by
8D
|B| (2ϵ+ 1)
tan(γ).(12)
Then, the uncovered surface area is approximately
(|B|, D)1
2
8D
|B| (2ϵ+ 1)
tan(γ)8D
|B| (2ϵ+ 1)8D
|B| (2ϵ+ 1)2
2 tan(γ).(13)
Adding together the beam thickness and the uncovered area and subtracting the overall surface area gives the overlap
Γ(over)
(|B|, D)(2ϵ+ 1)D+ (|B|, D)∆(|B|, D).(14)
This allows us to compute the coverage by subtracting this overlap from the beam thickness
Γ(cover)
(|B|, D)(2ϵ+ 1)DΓ(over)
(|B|, D).(15)
Due to the linear relaxation and the partitioning of the area in equivalent triangles, a linear extrapolation to the full image is
possible. Such that, the total overlap and coverage are given by
Γover(|B|, D) |B|Γ(over)
(|B|, D)(16)
|B|[(2ϵ+ 1)D+ (|B|, D)∆(|B|, D)] (17)
|B|
(2ϵ+ 1)D+8D
|B| (2ϵ+ 1)2
2 tan(γ)4D2
|B|
,(18)
Γcover(|B|, D) |B|Γ(cover)
(|B|, D)(19)
|B|[(2ϵ+ 1)D(2ϵ+ 1)D(|B|, D) + ∆(|B|, D)] (20)
|B|
4D2
|B| 8D
|B| (2ϵ+ 1)2
2 tan(γ)
.(21)
Both approximations ground on the assumption of triangles and thus subject to |B| 8.
Learning Continuous Rotation Canonicalization
B.2. Maximal Beam Set Cardinality Upper Bound
Proof. Each beam needs to cover at least one pixel not covered by others, otherwise these beams would be redundant. We
say that the maximum number of beams
|Bmax|
is reached if the receptive field fully covers the image. Therefore, we use
the coverage approximation from Lemma 4.1
Γcover(|B|, D) |B|
4D2
|B| 8D
|B| (2ϵ+ 1)2
2 tan(γ)
.(22)
To obtain |Bmax|we compute the derivation w.r.t. |B|, such that
Γcover(|B, D)
|B| =
64D2
|B|2(2ϵ+ 1)2
2 tan γ= 0 (23)
|B| =s64D2
(2ϵ+ 1)2=8D
(2ϵ+ 1) |Bmax|.(24)
Due to the violated integer constraint this defines an upper bound for |Bmax|.
B.3. Optimal Padding Margin
Proof.
Let
WN
and
HN
denote the width and height, respectively, of an squared image
X
with
W=H
. This sets
the center point at
c= [W/2, W/2]
. The image can be zero padded uniformly at the borders by a margin
δN
. The goal is
to find
δ
such that any arbitrary rotation of the image around its center will not cause any information loss. Information
is lost if any pixel of
X
, which is bounded spatially by the unpadded vanilla image region, gets rejected during rotation.
Using a rotation matrix
R[1,1]4
pixel vectors are mapped to real vectors. When enforcing the rigid discrete pixel
grid, pixels might get cut off. Let
d
be the maximal distant pixel from
c
, which clearly lies in the corners of the image. By
the Pythagorean theorem we have
||d||2=p(W/2)2+ (H/2)2=2(W/2)
using the assumption above. The cut off is
maximal if the image is rotated by
45
and hence
d
is orthogonal to the width and height axes. So, the required padding can
be quantified by the ceiled offset between ||d||2and W/2such that
δ=max 0,||d||2W
2= max 0,2W
2W
2= max 0,1
2W21.(25)
C. Loss Analysis
C.1. Circle Loss preserves Angle Distances
One-dimensional real values lie on the number line
R
or on a subset of it if bounds apply. This renders distance measuring
trivial, e.g. we might use the absolute difference between two scalars. Angles, on the other hand, lie on a circle, such that
0
and 359are closer as 0and 2. We show that our circle loss
Lcircle (θ, z) = (sin(θ)im(z))2+ (cos(θ)re(z))2
preserves angle distances and is therefore highly expressive. It is well known, that the minimum and maximum numerical
angle values are
0
and
360
, respectively. Let
d(θ, θ)R+
be a distance measure between two arbitrary angles
θ
and
θ
.
We seek a distance measure where the following two conditions holds
lim
θ360d(θ, 0) = 0,(26)
lim
θ0d(θ, 360)=0.(27)
Using our circle loss it is easy to show that both conditions hold, i.e.
lim
θ360(sin(0)sin(θ))2+ (cos(0)cos(θ))2
=(sin(0)sin(360))2+ (cos(0)cos(360))2= 0,
(28)
Learning Continuous Rotation Canonicalization
lim
θ0(sin(360)sin(θ))2+ (cos(360)cos(θ))2
=(sin(360)sin(0))2+ (cos(360)cos(0))2= 0.
(29)
Therefore, we conclude that our circle loss indeed respects the angle distances.
C.2. Extrema Analysis
ABC
and
Figure 8.
Aand Billustrate both loss term surface plots, while Cdepicts the plot of both Lipschitz relevant inequality terms with
im(z), re(z) = 2/2.
For an extrema analyis we first derive the partial derivatives of the multivariable loss
Lcircle(θ, z) = (sin(θ)im(z))2+ (cos(θ)re(z))2.
The input lies in the cubic domain bounded by ˆ
θ[0,2π)and z[1,1]2. The gradient is given by
∇Lcircle(θ, z) = Lcircle(θ, z)
∂θ ,Lcircle (θ, z)
∂re(z),Lcircle(θ, z)
∂im(z).(30)
This resolves in
∂θ Lcircle (θ, z) = 2 cos(θ)(sin(θ)im(z)) 2 sin(θ)(cos(θ)re(z))
= 2 [re(z) sin(θ)im(z) cos(θ)] ,
(31)
∂re(z)Lcircle(θ, z) = 2(cos(θ)re(z)),(32)
∂im(z)Lcircle (θ, z) = 2(sin(θ)im(z)).(33)
Analysing the behaviour of Equation (32) and Equation (33) at zero indicates that a critical function is given at
im(z) =
sin(θ)
and
re(z) = cos(θ)
. We provided a visual support in Figure 8, where we plotted both loss terms of Equation (5).
The minimas lie along
sin
and
cos
. Intuitively, due to the squares in Equation (5) and the sum operation as loss term
concatenation, there is no room for other minimas. Since predictions and ground truth values have numerical bounds, the
loss is upper bounded as well. To quantify this upper bound, we aim to solve
θ,z= argmax
θ,zLcircle(θ , z).(34)
We maximize the loss by using diametrical unit vector predictions compared to the ground truth, i.e.,
im(z)= cos(θ)
and
re(z)= sin(θ). This leads to
θ= argmax
θ
(sin(θ)cos(θ))2+ (cos(θ)sin(θ))2=3
4π. (35)
So, we have the upper bound Lcircle(3
4π, [sin(θ),cos(θ)]) = 4.
Learning Continuous Rotation Canonicalization
C.3. Lipschitz Smoothness
Our loss function is Lipschitz smooth iff the following condition hold
∥Lcircle(θ, z) Lcircle(θ,z) Kθθ,(36)
where Kis the Lipschitz constant. Using the definition of the loss in Equation (5) gives
(sin(θ)im(z))2+ (cos(θ)re(z))2
(sin(θ)im(z))2(cos(θ)re(z))2 Kθθ.(37)
This dissolves in
sin2(θ)2im(z) sin(θ) + im(z)2+
cos2(θ)2re(z) cos(θ) + re(z)2
sin2(θ)+2im(z) sin(θ)im(z)2
cos2(θ)+2re(z) cos(θ)re(z)2 Kθθ.
(38)
Using the trigonometrical identity sin2(θ) + cos2(θ)=1and some rearrangements give
2 [im(z)(sin(θ)sin(θ)) + re(z)(cos(θ)cos(θ))] Kθθ.(39)
Assuming no trivial case, i.e.,θ=θ, the projection on the unit circle gives always smaller norms
sin(θ)sin(θ)<θθ,(40)
cos(θ)cos(θ)<θθ.(41)
Furthermore, per definition we have
z= 1
, thus only a fraction of both terms above are aggregated. With
K= 2
the
Lipschitz inequality can be guaranteed. In Figure 8C we plotted both functions of the inequality.
D. Implementation Details and Further Experiments
For a better understanding of the solution quality of our angle regressor, we illustrated the non-linear relationship between
circle loss
Lcircle
values and angles in degree in Figure 9A. For example, if we rotate an image by
180
, the difference
between both images can also be quantified by
Lcircle = 4
. With this in mind, we first provide more implementation details
of our basic experimental setup, followed by further experiments.
D.1. Base Model Specification
The beam encoder is partitioned into a proximity encoder and a spatial encoder. The proximity encoder uses 2D convolutional
kernels shifting over each beam and compresses the signal down to a latent subspace
R|B|×2ϵ+1×D×CR|B|×D8×L/8
.
We use valid padding for all convolutions to reduce the dimensionality. The subsequent beam encoder processes the
embedding with 1D spatial convolution
R|B|×D8×L/8R|B|×L
. We use a
128
-dimensional latent space (i.e.,
L= 128
).
The context and neighbourhood information among beams is encoded using a three-layer GNN. The obtained embeddings
and their ordering are decoded in a three-layer LSTM, representing the permutation decoder
R|B|×LRL
. We utilize
LeakyReLU activations as piece-wise non-linearities, i.e.,
max(0.3x, 0)
, which empirically found to be more performant
than ReLUs. Weight initialisation according to (He et al., 2015) is used with a bias initialisation close to
0
to accompany the
rectified network well. Eventually, three linear layers and a subsequent normalisation transform the final latent state of the
permutation decoder into a complex vector RL[1,1]2.
We use mini-batches of
128
elements for training with
|B| = 32
beams per image and a thickness of
ϵ= 1
. We use a split of
80% training and 20% test data. Each dataset is augmented by randomly centre-rotated replicas of the original data points.
To this end, we sample
θ U 360|B|1k:k {0,1,...,|B| 1}
. As we will show in the following, our model can
generalise well to the test data. Therefore, no explicit regularisation is utilised, like dropout or specific loss terms. We use
the popular Adam optimizer (Kingma & Ba, 2015) with a static learning rate of
0.0001
, momentum terms
β1= 0.9
and
β2= 0.999 to update the model parameters.
Learning Continuous Rotation Canonicalization
A B
5ostablity boundary
SampleCanonicalised Radial Beams
C
Figure 9.
(A) Relationship between the circle loss
Lcircle
and angles in degree. (B) Training and test performance using three different
losses over five runs. (C) Translation robustness on COIL100.
D.1.1. BEA M ENC OD ER
We partitioned the beam encoder into proximity and spatial encoder to have a clean, semantical separation. The proximity
encoder’s receptive field covers the pixels’ local spatial neighbourhood. The proximity is compressed
R2ϵ+1×D×C
RD8×L/8
by two-dimensional convolutional kernels. Find the layer specifications in Table 1. Then, the remaining
spatial pixel information is encoded by one-dimensional convolutional kernels, mapping
RD8×L/8RL
. Due to the
compression, the architecture of the spatial encoder is sensitive to
D
. We provide details of such encoders in Table 2, Table 3,
Table 4, and Table 5. We aim to preserve the spatial ordering of features for all layers, and hence no explicit pooling layers
are used. This originates from the fact that for rotational perturbation detection, the further pixels are from the centre, the
more information they convey. Pixels close to the centre not only hold redundancies due to the overlaps but also follow
smaller rotation circles, and hence pixels are less sensitive to minor rotations.
Table 1. The proximity encoder for ϵ= 1 and with D > 8.
Layer Kernel Padding Strides Non-Linearity Feature Maps
1 (3, 3) no 1 LeakyReLU L/8
Table 2. The spatial encoder for 28 ×28 images with D= 14,e.g., for FashionMNIST.
Layer Kernel Padding Strides Non-Linearity Feature Maps
1 4 no 1 LeakyReLU L/4
2 4 no 1 LeakyReLU L/2
3 4 no 1 LeakyReLU L/2
4 3 no 1 LeakyReLU L
Table 3. The spatial encoder for 32 ×32 images with D= 16,e.g., for CIFAR10.
Layer Kernel Padding Strides Non-Linearity Feature Maps
1 4 no 1 LeakyReLU L/4
2 4 no 1 LeakyReLU L/2
3 4 no 1 LeakyReLU L/2
4 4 no 1 LeakyReLU L
5 2 no 1 LeakyReLU L
Learning Continuous Rotation Canonicalization
Table 4. The spatial encoder for 128 ×128 images with D= 64,e.g., for COIL100.
Layer Kernel Padding Strides Non-Linearity Feature Maps
1 5 no 2 LeakyReLU L/4
2 4 no 2 LeakyReLU L/4
3 4 no 1 LeakyReLU L/2
4 4 no 1 LeakyReLU L/2
5 4 no 1 LeakyReLU L/2
6 3 no 1 LeakyReLU L
7 2 no 1 LeakyReLU L
Table 5. The spatial encoder for 250 ×250 images with D= 125,e.g., for LFW.
Layer Kernel Padding Strides Non-Linearity Feature Maps
1 4 no 2 LeakyReLU L/4
2 3 no 2 LeakyReLU L/4
3 4 no 2 LeakyReLU L/4
4 4 no 1 LeakyReLU L/2
5 4 no 1 LeakyReLU L/2
6 4 no 1 LeakyReLU L/2
7 3 no 1 LeakyReLU L
8 2 no 1 LeakyReLU L
D.1.2. CON TE XT EN CODER
We build upon the fundamentals provided in Section 4. The graph topology allows for controllable and scalable information
exchange between neighbours controlled by the number of layers, i.e., the number of hops. Since the ordering of neighbours
matters in our case, directed edges are used to circumvent the permutation invariance of GNNs (Kipf & Welling, 2017;
Kondor et al., 2018; Xu et al., 2018). For simplicity sake, we update nodes by
λAf(B)W
, where
f(B)
is the output of the
beam encoder,
A {0,1}|B|×|B|
is the binary adjacency matrix and
WRL×L
is the learnable weight matrix. To limit the
impact of neighbour information, we use
λ(0,1]
as a global edge factor. For the proposed directed wheel graph, we have
an adjacency matrix
A=
0100··· 1
0010··· 1
0001··· 1
0000··· 1
.
.
..
.
..
.
..
.
.....
.
.
1000··· 1
.(42)
D.2. Toeplitz Prior Evaluation
Surprisingly, during initial empirical studies, we found that the model is learning similar latent structures and achieves
better performances without the Toeplitz prior. We illustrate our findings in Figure 9B, which shows the mean and standard
deviation over ve runs using three other losses. The following losses are considered: only the circle loss,
L=Lcircle
,
a dynamic linear combination w.r.t. epoch
e
,
L=11
eLcircle +1
eLprior
, both losses summed without any scaling,
L=Lcircle +Lprior
. As initially stated, training without the prior achieves the best results. This might originates from either
a bug in our code or the sensibility of the prior to perturbations. That is, during the computation of the similarity matrix, we
assume that there exists a beam pair combination which is nearly identical. We formulate possible errors introduced during
the rotation procedure in Section 4. If these errors are too significant, the prior might provide an erroneous momentum to the
learning, which results in mediocre results.
Learning Continuous Rotation Canonicalization
A B
5ostablity boundary
SampleCanonicalised Radial Beams
C
Figure 10. (left) siscore samples. (right) t-SNE projections for COIL100 and LWF.
Zero Padd ing Adapted Padding
Figure 11.
Saliency maps of COIL100 test sample in four rotations. Trained with standard zero padding (left) and adapted padding (right).
D.3. Downstream Evaluation
We used the rotation-subset of siscore dataset by (Djolonga et al., 2021), which comprises
39,540
images of
500 ×500 ×3
.
Each image shows a randomly rotated object, like a truck, on a random background. The rotation angle was sampled from
{1 + (360k)/n |k[0,1, . . . , n 1]}
with
n= 18
. To match the ImageNet (Deng et al., 2009) shape, each image is
downsampled to
224 ×224 ×3
. We normalized all color values from
[0,255]
to
[0,1]
and split the dataset in
80%
for
training and
20%
for testing. Due to the insignificance of the background for the classification, no padding is added and the
beam length reduced to
D= 91
. We trained BIC for
8192
iterations on the training set and achieved a training performance
of
Lcircle = 0.378
and a test performance of
Lcircle = 0.492
. That implies that for unseen data the prediction error is on
average
41
. This is the reason why the classification performance is not on the level of non-rotated images. We argue,
however, that with some hyper-parameter tuning this error rate can be significantly reduced.
D.4. Dimensionality Reduction with t-SNE
For dimensionality reduction we utilize t-Distributed Stochastic Neighbor Embedding (t-SNE). We use the
sklearn
implementation of t-SNE with an Barnes-Hut approximation for
1000
iterations. We use a random initialization, a
perplexity of
100
, an early exaggeration of
12
and the Euclidean metric. An automatic learning rate
ρ
is used, such that
ρ= max(n/(0.25e),50)
, where
n
is the sample size and
e
the early exaggeration value. Further, a minimum gradient norm
of
1e7
for early stopping is utilized. We trained our BIC model using the setup given in Section 5. The model was trained
on COIL100 using
C|B|
without the Toeplitz prior. We used the beam embeddings from the beam encoder without the
context integration to avoid bias. A spectral colour schema to illustrate the
|B| = 64
beams is employed for visual purposes.
Results are illustrated in Figure 10(right). These plots show that beam embeddings sampled from different rotations of an
input image form a continuous structure in the latent space.
Learning Continuous Rotation Canonicalization
D.5. Geometric Stability to Translations
In a real-world setting, small translations might perturb input images. BIC is not translation equivariant due to the radial
sampling radiating from the spatially fixed centre point. We view BIC as a part of a more extensive computer vision pipeline,
where other components might compensate for this drawback. However, our model has to fend off these small perturbations
for a robust machine vision pipeline since exact upstream invariances are unlikely in real-world settings. Theoretically,
geometric stability to signal deformations is measured by a continuous complexity measure as in (Bronstein et al., 2021).
In this analysis, we use a subgroup of the translation group
T(2)
, which contains transformations respecting the discrete
pixel grid, i.e.,
T(2,Ω) T(2)
. The violin plot in Figure 9C shows the test performance decrease when translating the
query images horizontally and vertically. Angle predictions below the
5
stability boundary are at most
5
off from the
unperturbed predictions. Considering the mean of the predictions, we conclude that BIC is geometrically stable up to
±3
pixels. This can be improved by an increased thickness ϵ, which we have set to 1for our experiments.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Most existing neural networks for learning graphs address permutation invariance by conceiving of the network as a message passing scheme, where each node sums the feature vectors coming from its neighbors. We argue that this imposes a limitation on their representation power, and instead propose a new general architecture for representing objects consisting of a hierarchy of parts, which we call Covariant Compositional Networks (CCNs). Here, covariance means that the activation of each neuron must transform in a specific way under permutations, similarly to steerability in CNNs. We achieve covariance by making each activation transform according to a tensor representation of the permutation group, and derive the corresponding tensor aggregation rules that each neuron must implement. Experiments show that CCNs can outperform competing methods on standard graph learning benchmarks.
Conference Paper
Full-text available
Translating or rotating an input image should not affect the results of many computer vision tasks. Convolutional neural networks (CNNs) are already translation equivariant: input image translations produce proportionate feature map translations. This is not the case for rotations. Global rotation equivariance is typically sought through data augmentation, but patch-wise equivariance is more difficult. We present Harmonic Networks or H-Nets, a CNN exhibiting equivariance to patch-wise translation and 360-rotation. We achieve this by replacing regular CNN filters with circular harmonics, returning a maximal response and orientation for every receptive field patch. H-Nets use a rich, parameter-efficient and low computational complexity representation, and we show that deep feature maps within the network encode complicated rotational invariants. We demonstrate that our layers are general enough to be used in conjunction with the latest architectures and techniques, such as deep supervision and batch normalization. We also achieve state-of-the-art classification on rotated-MNIST, and competitive results on other benchmark challenges.
Article
Full-text available
Deep reinforcement learning yields great results for a large array of problems, but models are generally retrained anew for each new problem to be solved. Prior learning and knowledge are difficult to incorporate when training new models, requiring increasingly longer training as problems become more complex. This is especially problematic for problems with sparse rewards. We provide a solution to these problems by introducing Concept Network Reinforcement Learning (CNRL), a framework which allows us to decompose problems using a multi-level hierarchy. Concepts in a concept network are reusable, and flexible enough to encapsulate feature extractors, skills, or other concept networks. With this hierarchical learning approach, deep reinforcement learning can be used to solve complex tasks in a modular way, through problem decomposition. We demonstrate the strength of CNRL by training a model to grasp a rectangular prism and precisely stack it on top of a cube using a gripper on a Kinova JACO arm, simulated in MuJoCo. Our experiments show that our use of hierarchy results in a 45x reduction in environment interactions compared to the state-of-the-art on this task.
Article
We propose to impose symmetry in neural network parameters to improve parameter usage and make use of dedicated convolution and matrix multiplication routines. Due to the significant reduction in the number of parameters as a result of the symmetry constraints, one would expect a dramatic drop in accuracy. Surprisingly, we show that this is not the case, and, depending on network size, symmetry can have little or no negative effect on network accuracy, especially in deep overparameterized networks. We propose several ways to impose local symmetry in recurrent and convolutional neural networks and show that our symmetry parameterizations satisfy universal approximation property for single hidden layer networks. We extensively evaluate these parameterizations on CIFAR, ImageNet and language modeling datasets, showing significant benefits from the use of symmetry. For instance, our ResNet-101 with channel-wise symmetry has almost 25% fewer parameters and only 0.2% accuracy loss on ImageNet. Code for our experiments is available at https://github.com/hushell/deep-symmetry.
Conference Paper
In many computer vision tasks, we expect a particular behavior of the output with respect to rotations of the input image. If this relationship is explicitly encoded, instead of treated as any other variation, the complexity of the problem is decreased, leading to a reduction in the size of the required model. In this paper, we propose the Rotation Equivariant Vector Field Networks (RotEqNet), a Convolutional Neural Network (CNN) architecture encoding rotation equivariance, invariance and covariance. Each convolutional filter is applied at multiple orientations and returns a vector field representing magnitude and angle of the highest scoring orientation at every spatial location. We develop a modified convolution operator relying on this representation to obtain deep architectures. We test RotEqNet on several problems requiring different responses with respect to the inputs' rotation: image classification, biomedical image segmentation, orientation estimation and patch matching. In all cases, we show that RotEqNet offers extremely compact models in terms of number of parameters and provides results in line to those of networks orders of magnitude larger.
Article
Data augmentation, a technique in which a training set is expanded with class-preserving transformations, is ubiquitous in modern machine learning pipelines. In this paper, we seek to establish a theoretical framework for understanding modern data augmentation techniques. We start by showing that for kernel classifiers, data augmentation can be approximated by first-order feature averaging and second-order variance regularization components. We connect this general approximation framework to prior work in invariant kernels, tangent propagation, and robust optimization. Next, we explicitly tackle the compositional aspect of modern data augmentation techniques, proposing a novel model of data augmentation as a Markov process. Under this model, we show that performing k-nearest neighbors with data augmentation is asymptotically equivalent to a kernel classifier. Finally, we illustrate ways in which our theoretical framework can be leveraged to accelerate machine learning workflows in practice, including reducing the amount of computation needed to train on augmented data, and predicting the utility of a transformation prior to training.
Article
Many interesting problems in machine learning are being revisited with new deep learning tools. For graph-based semisupervised learning, a recent important development is graph convolutional networks (GCNs), which nicely integrate local vertex features and graph topology in the convolutional layers. Although the GCN model compares favorably with other state-of-the-art methods, its mechanisms are not clear and it still requires a considerable amount of labeled data for validation and model selection. In this paper, we develop deeper insights into the GCN model and address its fundamental limits. First, we show that the graph convolution of the GCN model is actually a special form of Laplacian smoothing, which is the key reason why GCNs work, but it also brings potential concerns of over-smoothing with many convolutional layers. Second, to overcome the limits of the GCN model with shallow architectures, we propose both co-training and self-training approaches to train GCNs. Our approaches significantly improve GCNs in learning with very few labels, and exempt them from requiring additional labels for validation. Extensive experiments on benchmarks have verified our theory and proposals.
Conference Paper
Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.