ArticlePDF Available

Abstract and Figures

This paper addresses the understanding and characterization of residual networks (ResNet), which are among the state-of-the-art deep learning architectures for a variety of supervised learning problems. We focus on the mapping component of ResNets, which map the embedding space toward a new unknown space where the prediction or classification can be stated according to linear criteria. We show that this mapping component can be regarded as the numerical implementation of continuous flows of diffeomorphisms governed by ordinary differential equations. In particular, ResNets with shared weights are fully characterized as numerical approximation of exponential diffeomorphic operators. We stress both theoretically and numerically the relevance of the enforcement of diffeomorphic properties and the importance of numerical issues to make consistent the continuous formulation and the discretized ResNet implementation. We further discuss the resulting theoretical and computational insights into ResNet architectures.
Content may be subject to copyright.
Noname manuscript No.
(will be inserted by the editor)
Residual Networks as Flows of Diffeomorphisms
Fran¸cois Rousseau ·Lucas Drumetz ·
Ronan Fablet
Received: date / Accepted: date
Abstract This paper addresses the understanding and characterization of
residual networks (ResNet), which are among the state-of-the-art deep learn-
ing architectures for a variety of supervised learning problems. We focus on
the mapping component of ResNets, which map the embedding space towards
a new unknown space where the prediction or classification can be stated
according to linear criteria. We show that this mapping component can be
regarded as the numerical implementation of continuous flows of diffeomor-
phisms governed by ordinary differential equations. Especially, ResNets with
shared weights are fully characterized as numerical approximation of exponen-
tial diffeomorphic operators. We stress both theoretically and numerically the
relevance of the enforcement of diffeormorphic properties and the importance
of numerical issues to make consistent the continuous formulation and the dis-
cretized ResNet implementation. We further discuss the resulting theoretical
and computational insights on ResNet architectures.
Keywords Residual Network ·Diffeomorphism ·Dynamical systems
1 Introduction
Deep learning models are the reference models for a wide range of machine
learning problems. Among deep learning (DL) architectures, Residual net-
works (also called ResNets) have become state-of-the-art ones [15,16] . Ex-
perimental evidences emphasize critical aspects in the specification of these
F. Rousseau
IMT Atlantique
Technopˆole Brest Iroise, 29239 Brest, France
R. Fablet and L. Drumetz
IMT Atlantique
2 Fran¸cois Rousseau et al.
architectures for instance in terms of network depths or combination of ele-
mentary layers as well as in their stability and genericity. The understanding
and the characterization of ResNets and more widely DL architectures from a
theoretical point of view remains a key issue despite recent advances for CNN
Interesting insights on ResNets have recently been presented in [25,12,
31] from an ordinary/partial differential equation (ODE/PDE) point of view.
ResNets can be regarded as numerical schemes of differential equations. Es-
pecially, in [25], this PDE-driven setting stresses the importance of numerical
stability issues depending on the selected ResNet configuration. Interestingly,
it makes explicit the interpretation of the ResNet architecture as a depth-
related evolution of an input space towards a new space where the prediction
of the expected output (for instance classes) is solved according to a linear
operator. This interpretation is also pointed out in [13] and discussed in terms
of Riemannian geometry.
In this work, we deepen this analogy between ResNets and deformation
flows to relate ResNet and registration problems [27], especially diffeomorphic
registration [30,5,3,2]. Our contribution is three-fold: (i) we restate ResNet
learning as the learning of a continuous and integral diffeomorphic opera-
tor and investigate different solutions, especially the exponential operator of
velocity fields [2], to enforce diffeomorphic properties; (ii) we make explicit
the interpretation of ResNets as numerical approximations of the underlying
continuous diffeomorphic setting governed by ordinary differential equations
(ODE); (iii) we provide theoretical and computational insights on the specifi-
cation of ResNets and on their properties.
This paper is organized as follows. Section 2 relates ResNets to diffeomor-
phic registrations. We introduce in Section 3 the proposed diffeomorphism-
based learning framework. Section 4 reports experiments. Our key contribu-
tions are further discussed in Section 5.
2 From ResNets to diffeomorphic registrations
ResNets [15,16] have become state-of-the-art deep learning architectures for
a variety of problems, including for instance image recognition [15] or super-
resolution [18]. This architecture has been proposed in order to explore per-
formance of very deep models, without training degradation accuracy when
adding layers. ResNets proved to be easier to optimize and made it possible
to learn very deep models (up to hundreds layers).
As illustrated in Fig.1, ResNets can be decomposed into three main build-
ing blocks:
the embedding block which aims to extract relevant features from the input
variables for the targeted task (such as classification or regression). In [15],
the block consists in a set of 64 convolution filters of size 7×7 followed by
non-linear activation function such as ReLU.
Residual Networks as Flows of Diffeomorphisms 3
Fig. 1 A schematic view of ResNet architecture [15], decomposed into three blocks: embed-
ding, mapping and prediction. ’conv’ means convolution operations followed by non linear
activations, and ’fc’ means fully connected layer.
the mapping block, which aims to incrementally map the embedding space
to a new unknown space, in which the data are, for instance, linearly separa-
ble in the classification case. In [15], this block consists in a series of residual
units. A residual unit is defined as y=F(x,{Wi}) + xwhere the function
Fis the residual mapping to be learned. In [15], F(x) = W2σ(W1x) where
σdenotes the activation function (biases are omitted for simplifying nota-
tions). The operation F(x) + xis performed by a shortcut connection and
element-wise addition.
the prediction block, which addresses the classification or regression steps
from the mapped space to the output space. This prediction block is ex-
pected to involve linear models. In [15], this step is performed with a fully
connected layer.
In this work, we focus on the definition and characterization of the map-
ping block in ResNets. The central idea of ResNets is to learn the additive
residual function Fsuch that the layers in the mapping block are related by
the following equation:
xl+1 =xl+F(xl, Wl) (1)
where xlis the input feature to the lth residual unit. Wlis a set of weights
(and biases) associated with the lth residual unit. In [16], it appears that
such formulation exhibits interesting backward propagation properties. More
specifically, it implies that the gradient of a layer does not vanish even when
the weights are arbitrarily small.
Here, we relate the incremental mapping defined by these ResNets to diffeo-
morphic registration models [27]. These registration models, especially Large
Deformation Diffeomorphic Metric Mapping (LDDMM) [30,5], tackle the reg-
istration issue from the composition of a series of incremental diffeomorphic
mappings, each individual mapping being close to the identity. Conversely, in
ResNet architectures, the lth residual block provides an update of the form
4 Fran¸cois Rousseau et al.
xl+F(xl, Wl). Under the assumption that kF(xl, Wl)kkxlk, the deforma-
tion flows generated by ResNet architectures may be expected to implement
the composition of a series of incremental diffeomorphic mappings.
In [15,16], it is mentioned that the form of the residual function Fis flexi-
ble. Several residual blocks have been proposed and experimentally evaluated
such as bottleneck blocks [15], various shortcut connections [16] or aggregated
residual transformations [32]. However, by making the connection between
ResNets and diffeomorphic mappings, it appears here that the function Fis a
parametrization of an elementary deformation flow, constraining the space of
admissible residual unit architectures.
We argue this registration-based interpretation motivates the definition of
ResNet architectures as the numerical implementation of continuous flows of
diffeomorphisms. Section 3 details the proposed diffeomorphism-based learn-
ing framework in which diffeomorphic flows are governed by ODEs as in the
LDDMM setting. ResNets with shared weights relate to a particularly inter-
esting case yielding the definition of exponential diffeomorphism subgroups
in the underlying Lie algebra. Overall, the proposed framework results in: i)
a theoretical characterization of the mapping block as an integral diffeomor-
phic operator governed by an ODE, ii) in considering deformation flows and
Jacobian maps for the analysis of ResNets, iii) the derivation of ResNet archi-
tectures with additional diffeomorphic constraints.
3 Diffeomorphism-based learning
3.1 Diffeomorphisms and driving velocity vector fields
Registration issues have been widely stated as the estimation of diffeomor-
phic transformations between input and output spaces, especially in medi-
cal imaging [27]. Diffeomorphic properties guarantee the invertibility of the
transformations, which includes the conservation of topological features. The
parametrization of diffeomorphic transformations according to time-varying
velocity vector fields has been shown to be very effective in medical imaging
[21]. Beyond its computational performance, this framework embeds the group
structure of diffeomorphisms and results in flows of diffeomorphisms governed
by an Ordinary Differential Equation (ODE):
dt =Vt(φ(t)) (2)
with φ(t) the diffeomorphism at time t, and Vtthe velocity vector field at
time t.φ(0) is the identity and φ(1) the registration transformation between
embedding space Xand output space X, such that for any element Xin Xits
mapped version in Xis φ(1)(X). Given velocity fields (Vt)t, the computation
of φ(1)(X) comes from the numerical integration of the above ODE.
A specific class of diffeomorphisms refers to stationary velocity fields, that
is to say velocity fields which do not depend on time (Vt=V, t). As introduced
Residual Networks as Flows of Diffeomorphisms 5
in [2], in this case, the resulting diffeomorphisms define a subgroup structure in
the underlying Lie group and yield the definition of the exponential operators.
We here only briefly detail these key properties. We let the reader refer to [1]
for a detailed and more formal presentation of their mathematical derivation.
For a stationary velocity field, the resulting diffeomorphisms belong to the
one-parameter subgroup of diffeomorphisms with infinitesimal generator V. In
particular, they verify the following property: s, t, φ(t)·φ(s) = φ(s+t), where ·
stands for the composition operator in the underlying Lie group. This implies
for instance that computing φ(1) boils down to applying ntimes φ(1/2n)
for any integer value n. Interestingly, this one-parameter subgroup yields the
definition of diffeomorphisms (φ(t))tas exponentials of velocity field Vdenoted
by (exp(tV ))tand governed by the stationary ODE
dt =V(φ(t)) (3)
Conversely, any one-parameter subgroup of diffeomorphisms is governed by an
ODE with a stationary velocity field. It may be noted that the above defini-
tion of exponentials of velocity fields generalizes the definition of exponential
operators for matrices and finite-dimensional spaces.
3.2 Diffeomorphism-based supervised learning
In this section, we view supervised learning issues as the learning of diffeomor-
phisms according to some predefined loss function. Let us consider a typical
supervised classification issue which the goal is to predict a class Yfrom an
N-dimensional real-valued observation X. Let Lθbe a linear classifier model
with parameter θ. Within a neural network setting, Lθtypically refers to a
fully-connected layer with softmax activations and parameter vector θto the
weight and bias parameters of this layer. Let Dbe the group of diffeomor-
phisms in RN. We state the supervised learning as the joint estimation of an
embedding E, a diffeomorphic mapping φ∈ D and linear classification model
Lθaccording to:
φ, b
θ= arg min
E,φ,θ loss ({Lθ(φ(E(Xi))) , Yi}i) (4)
with {Xi, Yi}ithe considered training dataset and loss an appropriate loss
function, typically a cross entropy criterion. Considering the ODE-based parametriza-
tion of diffeomorphisms, the above minimization leads to an equivalent esti-
mation of velocity field sequence (Vt)
θ= arg min
E,(Vt)loss ({Lθ(φ(1) (E(Xi))) , Yi}i) (5)
subject to
dt =Vt(φ(t))
φ(0) = I
6 Fran¸cois Rousseau et al.
When considering stationary velocity fields [2,3], this minimization simplifies
V , b
θ= arg min
E,(V)loss ({Lθ(exp(V) (E(Xi))) , Yi}i) (7)
We may point out that this formulation differs from the image registration
problem in the considered loss function. Whereas image registration usually
involves the minimization of the prediction error Yiφ(1) (E(Xi)) with any
pair Xi, YiRN, we here state the inference of the registration operator φ(1)
according to classification-based loss function. It may also be noted that the
extension to other loss functions (e.g. for regression tasks) is straightforward.
3.3 Derived NN architecture
To solve for minimization issues (5) and (7), additional priors on the velocity
fields can be considered. One may consider the introduction of an additional
term in the minimization, which typically involves the integral of the norm of
the gradient of the velocity fields and favors small registration displacements
between two time steps [5,33]. Parametric priors may also be considered. They
come to set some parametrization for the velocity fields. In image registration
studies, spline-based parametrization has for instance been explored [3].
Here, we combine these two types of priors. We exploit a parametric ap-
proach and consider neural-network based representations of the driving veloc-
ity fields in ODEs (2) and (3). More specifically, the discrete parametrization
of the velocity field, Vt(x), can be considered as a linear combination of basis
Vj,t(x) = X
νt,j,ift,i (x) (8)
where Vj,t denotes component j(a scalar) of the learned velocity field, and
the νt,j,i are the weights learned by the 1D convolutional layer.
Various types of shortcut connections and various usages of activation func-
tions experimented in [16] correspond to various forms of the parametriza-
tion of the velocity field. Understanding residual units in a registration-based
framework allows to provide a methodological guide to propose new valid resid-
ual units. Figure 2 shows 3 residual block units: original ResNet [15], improved
ResNet [16] and the residual block studied in this work. For instance, it can
be noticed that adding an activation function such as ReLU after the shortcut
connection (i.e. after the addition layer) as in [15] (see Figure 2(a)) makes
the mapping no longer bijective, and thus such an architecture may be less
efficient, as shown experimentally in [16].
One way to build diffeomorphisms is to compose small perturbations of the
identity. Using the same notations as in [33], let Rdbe open and bounded.
We denote by C1
0(Ω, Rd) the Banach space of continuously differentiable vector
fields von such that vand Dv vanish on ∂Ω and at infinity. Let χ1
1(T, Ω) be
the set of absolutely integrable functions from [0, T ] to C1
0(Ω, Rd). It can be
Residual Networks as Flows of Diffeomorphisms 7
Fig. 2 Various residual units: (a) original ResNet [15], (b) improved ResNet [16], (c) pro-
posed residual unit.
shown that the flow associated to the time-dependent vector field vχ1
1(T, Ω)
is a diffeomorphism of [33].
In this work, we propose a residual block suitable to build flows of diffeo-
morphisms. In the two case studies considered in this paper, two parametriza-
tions of the basis functions are considered. For the case of the experiments of
section 4.1, on the CIFAR-10 dataset, where the inputs are images, the basis
functions ft,i are parametrized with one convolutional layer and one ReLU
layer. The linear combination of these basis functions can be represented as
a second one dimensional convolutional layer, with a filter size of 1×1. In the
experiments of section 4.2, on the spiral datasets, the inputs are two dimen-
sional. In that case, the basis functions are modeled through the output of a
dense layer, followed by a tanh activation function. It has to be noticed that
no biases are considered for the two convolutional layers. In order to control
the magnitude of the velocity field, we propose to use in the residual block a
tanh layer and a scaling layer. Finally, to ensure that vχ1
1(T, Ω), we intro-
duce a windowing layer such that vand Dv vanish on ∂Ω and at infinity. This
ensures that vis a Lipschitz continuous mapping [33]. This proposed residual
block is shown in Figure 2(c).
In the registration-based framework considered so far, the transformation
φis only applied to the embedding of the observation X. This can introduce
an undesirable asymmetry in the optimization process and have a significant
impact on the registration performance. Inverse consistency, first introduced
by Thirion in [29], can be performed by adding a penalty term. In order to
implement inverse consistent algorithms, it is useful to be able to integrate
backwards as well as forwards. In the diffeomorphic framework, the inverse
8 Fran¸cois Rousseau et al.
consistency can be written as follows:
φ(1) φ(1) = φ(1) φ(1) = φ(0) (9)
This inverse consistency can then be achieved by adding the following term in
the overall loss function:
φ, b
θ= arg min
E,φ,θ α loss ({Lθ(φ(E(Xi))) , Yi}i) + (1 α)X
where E(Xi)=φ(1)(E(Xi)), Xi∈ X and αis a weighting parameter. We may
stress that this term does not depend on the targeted task (i.e. classification
or regression) and only constraint the learning of the mapping block. Thus,
this regularization term can be extended to data points that do not belong to
the learning set, and more generally to points in a given domain, such that the
inverse consistency property does not depend on the sampling of the learning
4 Experiments
In this section, we investigate experimentally the potential of the proposed
architecture of residual blocks using the image classification dataset CIFAR-
10 [22] and synthetic 2D data (spiral dataset). CIFAR-10 is used to explore
the performance of the proposed residual unit with respect to other ResNet
architectures. The 2D spiral dataset helps to further investigate properties
of diffeomorphism-based networks and provides geometrical insights on the
estimated flows.
4.1 CIFAR-10
4.1.1 Experimental setting
The CIFAR-10 dataset contains 60,000 32 ×32 color images in 10 different
classes. 10,000 images are used for testing purpose. The overall architecture is
decomposed into three main parts: embedding, mapping and prediction. First,
the embedding is performed using the following layers: 1) a 5×5 convolutional
layer with 128 filters, 2) batch normalization, 3) tanh activation layer (which
ensures that E(Xi)]1,1[p, an open and bounded interval). Then, the
network consists in several residual blocks as depicted in Figure 2(c) (3 ×3
convolutions without bias, with 128 filters). The scaling factor of the residual
units is learned and shared for every unit. At the end of the mapping block, a
tanh activation layer is used to ensure that φ(E(Xi)) ]1,1[p, an open and
bounded interval. Finally, the prediction step is performed using the following
layers: 1) a 3 ×3 convolutional layer with 128 filters, 2) batch normalization,
3) tanh activation layer, 4) 32 ×32 average pooling, 5) fully connected layer.
Residual Networks as Flows of Diffeomorphisms 9
Methods #params 1k 2.5k 5k 10k 20k 30k 40k 50k
ResNet d56 1.6M 0.45 0.57 0.69 0.80 0.87 0.90 0.91 0.92
DiffeoNet d5 0.53 0.64 0.73 0.81 0.86 0.89 0.90 0.91
DiffeoNet d10 0.54 0.67 0.74 0.81 0.88 0.89 0.91 0.92
DiffeoNet d20 0.53 0.65 0.75 0.82 0.88 0.90 0.92 0.93
Stationary Velocity Fields
DiffeoNet d5 0.52 0.63 0.70 0.75 0.83 0.86 0.87 0.88
DiffeoNet d10 0.51 0.63 0.70 0.77 0.83 0.86 0.89 0.89
DiffeoNet d20 0.53 0.64 0.70 0.77 0.84 0.87 0.89 0.89
Stationary Velocity Fields and Inverse Consistency
DiffeoNet d5 0.52 0.63 0.69 0.75 0.82 0.86 0.88 0.90
DiffeoNet d10 0.51 0.64 0.70 0.76 0.84 0.87 0.88 0.90
DiffeoNet d20 0.52 0.64 0.70 0.77 0.85 0.86 0.88 0.90
Table 1 Accuracy on CIFAR10 for ResNet and proposed approaches (DiffeoNet : proposed
residual unit, stationary velocity fields correspond to the use of shared weights) with respect
to the number of training samples (1000 up to 50 000). Depths (d5, d10, d20) and number
of parameters are reported to perform a fair comparison.
Weights are initialized with a truncated normal distribution centered on 0 [14].
We use `2weight-decay regularization set to 2.104and SGD optimization
method with a starting learning rate of 0.1, minibatch of 128, 100 epochs.
The goal of this experiment is to study the efficiency of the proposed resid-
ual unit with respect to original ResNet. The baseline architecture used for
comparison is the ResNet architecture proposed in [16]. In this experiment,
we use the Keras 1implementation of ResNet for reproductibility purpose:
ResNet56v2 (depth of 56 with increasing number of convolution filters [16]),
with about 1.6M of trainable parameters.
Fig. 3 Performance of various non-stationary ResNet architectures on CIFAR10, with vary-
ing size of the training dataset (left: from 1000 to 50000 images, right: zoom version for very
small dataset size).
1F. Chollet et al. Keras, 2015.
10 Fran¸cois Rousseau et al.
4.1.2 Results
In this work, the dimension of the embedding space is constant throughout
the mapping block. We first compare ResNet56v2 with the proposed approach
with the same number of parameters, which corresponds to a network depth
of 5 (i.e. 5 residual units). It can be seen in Figure 3 that the performance of
these two networks is similar when using the entire training dataset. Resid-
ual units of the Keras ResNet are built using three layers: convolution, batch
normalization and ReLU. ReLU units are crucial to promote efficient sparse
parametrizations of the velocity fields. However, the obtained results show that
batch normalization in the residual units is not required to reach satisfactory
classification accuracy. Instead, we use a tanh activation layer to restrict the
embedding set to be open and bounded for the whole flow of diffeomorphisms.
It has to be noted that in the proposed network, the dimension of the trans-
formed embedding space is constant. Similarly to [17], the experimental re-
sults show that progressive dimension changes of the embedding space are not
required, contrary to popular belief that the performance of deep neural net-
works are based on progressive removal of uninformative variability. To study
the generalization performance of the proposed residual unit, we conduct ex-
periments with decreasing number of training samples (see Table 1 for detailed
results). It appears that the use of the proposed residual unit makes the model
more robust to small training sets compared to the Keras ResNet.
We also investigate the impact of the network depth on the classification re-
sults. Increasing the depth leads to more complex models (in term of trainable
parameters). It can thus be expected to observe overfitting when using very
small training datasets. However, Figure 3 shows that increasing the depth of
the proposed network does not lead to accuracy decrease. From a dynamical
point of view, increasing the depth corresponds to smaller integration steps
and then to smoother variations of velocity fields. The proposed residual ar-
chitecture is not subject to overfitting for small datasets even with a deep
diffeomorphic network.
4.2 Spiral data
In this section, we propose to further deepen the understanding of behavior of
networks based on flows of diffeomorphisms. Following the work on differential
geometry analysis of ResNet architectures of Hauser et al. in [13], we consider
a classification task of 2-dimensional spiral data.
4.2.1 Experimental setting
No embedding layer is required in this experimental setup. The purpose of the
mapping block is then to warp the input data points Xiinto an unknown space
Xwhere the transformed data Xare linearly separable. We have considered
the following setting: the loss function is the binary cross-entropy between the
Residual Networks as Flows of Diffeomorphisms 11
output of a sigmoid function applied to the transformed data points Xand
the true labels. Each network is composed of 20 residual units for which non-
linearities are modeled with tanh activation functions and 10 basis functions
(modeled by dense layers) are used for the parametrization of the velocity
fields. Weights are initialized with the Glorot uniform initializer (also called
Xavier uniform initializer) [9]. We use `2weight-decay regularization set to
104and the ADAM optimization method [20] with a learning rate of 0.001,
β1= 0.9, β2= 0.999, minibatch of 300, 1000 epochs.
We consider four ResNet architectures: a) a ResNet without shared weights
(corresponding to time-varying velocity fields modeling), b) ResNet with shared
weights (corresponding to the stationary velocity fields modeling), c) Data-
driven Symmetric ResNet with shared weights (considering also the inverse
consistency criterion is computed over training data) and d) Domain-driven
Symmetric ResNet with shared weights (where the inverse consistency crite-
rion is computed over the entire domain using a random sampling scheme).
4.2.2 Characterization of ResNet properties
ResNet architectures have been recently studied from the point of view of
differential geometry in [13]. In this article, Hauser et al. studied the impact
of residual-based approaches (compared to non-residual networks) in term of
differentiable coordinate transformations. In our work, we propose to go one
step further by considering the characterization of the estimated deformation
fields leading to an adapted configuration for the considered classification task.
More specifically, we consider in this work the maps of Jacobian values.
The Jacobian (i.e. the determinant of the Jacobian matrices of the defor-
mations) is defined in a 2-dimensional space as follows:
Jφ(x) =
From a physical point of view, the value of the Jacobian represents the
local volume variation induced by the transformation. A transformation with
a Jacobian value equal to 1 is a transformation that preserves volumes. A
Jacobian value greater than 1 corresponds to a local expansion and a value
less than 1 corresponds to a local contraction. The case where the Jacobian
is zero means that several points are warped onto a single point: this case
corresponds to the limit case from which the bijectivity of the transformation
is not verified anymore, thus justifying the constraint on the positivity of the
Jacobian in several registration methods [27].
4.2.3 Results
Classification algorithms are usually only evaluated using the classification
accuracy (as the number of correct predictions from all predictions made).
However, the classification rate is not enough to characterize the performance
12 Fran¸cois Rousseau et al.
of a specific algorithm. In all the experiments shown in this work, the clas-
sification rate is greater than 99%. Visualization of the decision boundary is
an alternative way to provide complementary insights on the regularity of the
solution in the embedding space. Figure 4 shows the decision boundary for the
four considered ResNets. Although all methods achieved very high classifica-
tion rates, it can be seen that adding constraints such as the use of stationary
velocity fields (i.e. shared weights) and inverse consistency constraints lead to
smoother decision boundaries with no effect on the overall accuracy. This is
regarded as critical for generalization and adversarial robustness [28].
Fig. 4 Decision boundaries for the classification task of 2-dimensional spiral data. From
left to right: ResNet without shared weights, ResNet with shared weights, Data-driven Sym-
metric ResNet with shared weights, Domain-driven Symmetric ResNet with shared weights.
We refer the reader to the main text for the correspondence between ResNet architectures
and diffeomorphic flows.
Decision boundaries correspond to the projection of the estimated linear
decision boundary in the space Xinto the embedding space X. The visual-
ization of decision boundaries does not however provide information regarding
the topology of the manifold in the output space X. We also study the de-
formation flow through the spatial configuration of data points through the
network layers as in [13]. Figure 5 shows how each network untangles the spiral
data. Networks with shared weights exhibit smoother layer-wise transforma-
tions. More specifically, this visualization provides insights on the geometrical
properties (such as topology preservation / connectedness) of the transformed
set of input data points.
To evaluate the quality of the estimation warping transformation, Figure 6
shows the Jacobian maps for each considered network. Local Jacobian sign
changes correspond to locations where bijectivity is not satisfied. It can be
seen that adding constraints such as stationary velocity fields and inverse con-
sistency leads to more regular geometrical shapes of the deformed manifold.
The domain-driven regularization applied to a ResNet with shared weights
leads to the most regular geometrical pattern. Adding the symmetry consis-
tency leads to positive Jacobian values over the entire domain, guaranteeing
the bijectivity of the estimated mapping.
Residual Networks as Flows of Diffeomorphisms 13
Fig. 5 Evolution of the spatial configuration of data points through the 20 residual units.
From top to bottom: ResNet without shared weights, ResNet with shared weights, Data-
driven Symmetric ResNet with shared weights, Domain-driven Symmetric ResNet with
shared weights.
Fig. 6 Jacobian maps for the four ResNet architectures. From left to right: ResNet without
shared weights (Jmin =5.59, Jmax = 6.34), ResNet with shared weights (Jmin =1.41,
Jmax = 2.27), Data-driven Symmetric ResNet with shared weights (Jmin = 0.55, Jmax =
5.92), Domain-driven Symmetric ResNet with shared weights (Jmin = 0.30, Jmax = 1.44).
(colormap : Jmin =2.5, Jmax = 2.5, so dark pixels correspond to negative Jacobian
5 Discussion: Insights on ResNet architectures from a
diffeomorphic viewpoint
As illustrated in the previous section, the proposed diffeomorphic formula-
tion of ResNets provides new theoretical and computational insights for their
interpretation and characterization as discussed below.
5.1 Theoretical characterization of ResNet architectures
In this work, we make explicit the interpretation of the mapping block of
ResNet architectures as a discretized numerical implementation of a contin-
uous diffeomorphic registration operator. This operator is stated as an inte-
gral operator associated with an ODE governed by velocity fields. Moreover,
ResNet architectures with shared weights are viewed as the numerical imple-
mentation of the exponential of velocity fields, equivalently defined as diffeo-
morphic operators governed by stationary velocity fields. Exponentials of ve-
locity fields are by construction diffeomorphic under smoothness constraints on
the generating velocity fields. Up to the choice of the ODE solver implemented
by ResNet architecture (in our case an Euler scheme), ResNet architectures
14 Fran¸cois Rousseau et al.
with shared weights are then fully characterized from a mathematical point of
The diffeomorphic property naturally arises as a critical property in reg-
istration problems, as it relates to invertibility properties. Such invertibility
properties are also at the core of the definition of kernel approaches, which
implicitly defines mapping operators [26]. As illustrated for the reported clas-
sification experiments, the diffeomorphic property prevents the mapping oper-
ator from modifying the topology of the manifold structure of the input data.
When not imposing such properties, for instance in unconstrained ResNet ar-
chitectures as well as, the learned deformation flows may present unexpected
topology changes.
The diffeomorphic property may be regarded as a regularization criterion
on the mapping operator, so that the learned mapping enables a linear sep-
aration of the classes while guaranteeing the smoothness of the classification
boundary and of the underlying deformation flow. It is obvious that a ResNet
architecture with shared weights is a special case of an unconstrained ResNet.
Therefore, the training of a ResNet architecture with shared weights may be
viewed as the training of an unconstrained ResNet within a reduced search
space. The same holds for the symmetry property which further constrains
the search space during training. The later constraint is shown to be numer-
ically important so that the discretized scheme complies with the theoretical
diffeomorphic properties.
Overall, this analysis stresses that over an infinity of mapping operators
reaching optimal training performance one may favor those depicting dif-
feomorphic properties so that key properties such as generalization perfor-
mance, prediction stability and robustness to adversarial examples may be
greatly improved. Numerical schemes which fulfill such diffeomorphic proper-
ties during the training process could be further investigated and could ben-
efit from the registration literature, including for diffeomorphics flows gov-
erned by non-stationary velocity fields [30,5, 4]. In particular, the impact of
the diffeomorphism-based network building on adversarial example estimation
is a open research direction for further studies.
5.2 Computational issues
Besides theoretical aspects, computational properties also derive from the pro-
posed diffeomorphism-based formulation. Within this continuous setting, the
depth of the network relates to the integration time step and the precision of
the integration scheme. The deeper the network, the smaller the integration
step. Especially, a large integration time step, i.e. a shallower ResNet archi-
tecture, may result in numerical integration instabilities and hence in non-
diffeomorphic transformations. Therefore, deep enough architectures should
be considered to guarantee numerical stability and diffeomorphic properties.
The maximal integration step relates to the regularity of the velocity fields
governing the ODEs. In our experiments, we only consider an explicit first-
Residual Networks as Flows of Diffeomorphisms 15
order Euler scheme. Higher-order explicit schemes, for instance the classic
fourth-order Runge-Kutta scheme, seem of great interest as well as implicit
integration schemes [8]. Given the spatial variabilities of the governing velocity
fields, adaptive integration schemes also appear as particularly relevant.
Using diffeomorphism-based framework leads to specific architectures of
residual units. In this work, for instance, tanh activation layers are used to
constraint the domain to be open and bounded. Such activation layer guar-
antees the diffeomorphic properties of the mapping block. The popular use
of ReLU activation in the embedding block cannot provide such guarantee.
Several other reversible architectures have been recently proposed [7,10, 17],
showing the potential of such frameworks for the analysis of residual networks.
In the LDDMM framework [5], the parametrization of the velocity field is often
carried out with Reproducing Kernel Hilbert Spaces (RKHS). Recent works
have been done in this direction connecting RKHS and deep networks [6].
In our work, vand Dv vanish on ∂Ω and at infinity. This property guaran-
tees that the learned residual units are Lipschitz continuous, which is related
to recent works investigating explicit constraints on Lipschitz continuity in
neural networks [11]. Moreover, this condition implies that the Hilbert space
of admissible velocity fields is a RKHS [34]. Further work could focus on the
parametrization of the velocity fields (i.e. residual units) using suitable kernels.
Diffeomorphic mapping defined as exponential of velocity fields were shown
to be computationally more stable with smoother integral mappings. They lead
to ResNet architectures with shared weights, which greatly lowers the com-
putational complexity and memory requirements compared with the classic
ResNet architectures. They can be implemented as Recurrent Neural Net-
works [19,23]. Importantly, the NN-based specification of the elementary of
velocity field V(8) becomes the bottleneck in terms of modeling complexity.
The parametrization (Equation 8) may be critical to reach good prediction
performance. Here, we considered a two-layer architecture regarded as a pro-
jection of Vonto basis function. Higher-complexity architecture, for instance
with larger convolution supports, more filters or layers, might be considered
while keeping the numerical stability of the overall ResNet architectures. By
contrast, considering higher-complexity elementary blocks in a ResNet archi-
tectures without shared weights would increase numerical instabilities and may
required complementary regularization constraints across network depth [15,
Regarding training issues, our experiments exploited a classic backpropa-
gation implementation with a random initialization. From the considered con-
tinuous log-Euclidean prospective, the training may be regarded as the pro-
jection of the random initialization onto the manifold of acceptable solutions,
i.e. solutions satisfying both the minimization of the training loss and diffeo-
morphic constraints. In the registration literature [27], the numerical schemes
considered for the inference of the mapping usually combine a parametric
representation of the velocity fields and a multiscale optimization strategy in
space and time. The combination of such multiscale optimization strategy to
backpropagation schemes appears as a promising path to improve convergence
16 Fran¸cois Rousseau et al.
properties, especially the robustness to the initialization. The different solu-
tions proposed to enforce diffeomorphic properties are also of interest. Here,
we focused on the invertibility constraints, which result in additional terms to
be minimized in the training loss.
6 Conclusion
This paper introduces a novel registration-based formulation of ResNets. We
provide a theoretical interpretation of ResNets as numerical implementations
of continuous flows of diffeomorphisms. Numerical experiments support the
relevance of this interpretation, especially the importance of the enforcement
of diffeomorphic properties, which ensure the stabilityof a trained ResNet. This
work opens new research avenues to explore further diffeomorphism-based for-
mulations and associated numerical tools for ResNet-based learning, especially
regarding numerical issues.
Acknowledgements We thank B. Chapron and C. Herzet for their comments and sugges-
tions. The research leading to these results has been supported by the ANR MAIA project,
grant ANR-15-CE23-0009 of the French National Research Agency, INSERM and Institut
Mines T´el´ecom Atlantique (Chaire Imagerie m´edicale en th´erapie interventionnelle) and Fon-
dation pour la Recherche M´edicale (FRM grant DIC20161236453), Labex Cominlabs (grant
SEACS), CNES (grant OSTST-MANATEE) and Microsoft trough AI-for-Earth EU Oceans
Grant (AI4Ocean). We also gratefully acknowledge the support of NVIDIA Corporation
with the donation of the Titan Xp GPU used for this research.
1. Arsigny, V.: Processing Data in Lie Groups: An Algebraic Approach. Application to
Non-Linear Registration and Diffusion Tensor MRI. Ph.D. thesis, Ecole Polytechnique
2. Arsigny, V., Commowick, O., Pennec, X., Ayache, N.: A Log-Euclidean Framework for
Statistics on Diffeomorphisms. International Conference on Medical Image Computing
and Computer-Assisted Intervention: MICCAI 9(Pt 1), 924–931 (2006)
3. Ashburner, J.: A fast diffeomorphic image registration algorithm. NeuroImage 38(1),
95–113 (2007)
4. Avants, B., Gee, J.C.: Geodesic estimation for large deformation anatomical shape av-
eraging and interpolation. NeuroImage 23, S139–S150 (2004)
5. Beg, M.F., Miller, M.I., Trouv´e, A., Younes, L.: Computing Large Deformation Metric
Mappings via Geodesic Flows of Diffeomorphisms. International Journal of Computer
Vision () 61(2), 139–157 (2005)
6. Bietti, A., Mairal, J.: Group Invariance, Stability to Deformations, and Complexity of
Deep Convolutional Representations. (2017)
7. Chang, B., Meng, L., Haber, E., Ruthotto, L., Begert, D., Holtham, E.: Reversible
Architectures for Arbitrarily Deep Residual Neural Networks. Neural Networks cs.CV
8. Davis, P.J., Rabinowitz, P.: Methods of Numerical Integration. Academic Press (1984)
9. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural
networks. AISTATS (2010)
10. Gomez, A.N., Ren, M., Urtasun, R., Grosse, R.B.: The Reversible Residual Network -
Backpropagation Without Storing Activations. NIPS (2017)
Residual Networks as Flows of Diffeomorphisms 17
11. Gouk, H., Frank, E., Pfahringer, B., Cree, M.J.: Regularisation of Neural Networks by
Enforcing Lipschitz Continuity. Neural Networks stat.ML (2018)
12. Haber, E., Ruthotto, L.: Stable Architectures for Deep Neural Networks. (1),
014004 (2017)
13. Hauser, M., Ray, A.: Principles of Riemannian Geometry in Neural Networks. NIPS
14. He, K., Zhang, X., Ren, S., Sun, J.: Delving Deep into Rectifiers: Surpassing Human-
Level Performance on ImageNet Classification. (2015)
15. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition.
In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.
770–778. IEEE (2016)
16. He, K., Zhang, X., Ren, S., Sun, J.: Identity Mappings in Deep Residual Networks. (2016)
17. Jacobsen, J.H., Smeulders, A., Oyallon, E.: i-RevNet: Deep Invertible Networks. (2018)
18. Kim, J., Lee, J.K., Lee, K.M.: Accurate Image Super-Resolution Using Very Deep Con-
volutional Networks. CVPR (2016)
19. Kim, J., Lee, J.K., Lee, K.M.: Deeply-Recursive Convolutional Network for Image Super-
Resolution. CVPR (2016)
20. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. (2014)
21. Klein, A., Andersson, J., Ardekani, B.A., Ashburner, J., Avants, B., Chiang, M.C.,
Christensen, G.E., Collins, D.L., Gee, J., Hellier, P., Song, J.H., Jenkinson, M., Lepage,
C., Rueckert, D., Thompson, P., Vercauteren, T., Woods, R.P., Mann, J.J., Parsey,
R.V.: Evaluation of 14 nonlinear deformation algorithms applied to human brain MRI
registration. NeuroImage 46(3), 786–802 (2009)
22. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)
23. Liao, Q., Poggio, T.A.: Bridging the Gaps Between Residual Learning, Recurrent Neural
Networks and Visual Cortex. Neural Networks cs.LG (2016)
24. Mallat, S.: Understanding deep convolutional networks. Philosophical Transactions
of the Royal Society A: Mathematical, Physical and Engineering Sciences 374(2065),
20150203–16 (2016)
25. Ruthotto, L., Haber, E.: Deep Neural Networks motivated by Partial Differential Equa-
tions. (2018)
26. Scholkopf, B., Smola, A.J.: Learning with Kernels. Support Vector Machines, Regular-
ization, Optimization, and Beyond. MIT Press (2002)
27. Sotiras, A., Davatzikos, C., Paragios, N.: Deformable medical image registration: a sur-
vey. IEEE Transactions on Medical Imaging 32(7), 1153–1190 (2013)
28. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus,
R.: Intriguing properties of neural networks. (2013)
29. Thirion, J.: Image matching as a diffusion process: an analogy with Maxwell’s demons.
Medical Image analysis 2(3), 243–260 (1998)
30. Trouv´e, A.: Diffeomorphisms Groups and Pattern Matching in Image Analysis. Inter-
national Journal of Computer Vision () 28(3), 213–221 (1998)
31. Weinan, E., 2017: A proposal on machine learning via dynamical systems. Communi-
cations in Mathematics and Statistics 5(1), 1–11 (2017)
32. Xie, S., Girshick, R.B., Doll´ar, P., Tu, Z., He, K.: Aggregated Residual Transformations
for Deep Neural Networks. CVPR pp. 5987–5995 (2017)
33. Younes, L.: Shapes and Diffeomorphisms, Applied Mathematical Sciences, vol. 171.
Springer Science & Business Media, Berlin, Heidelberg (2010)
34. Younes, L.: Diffeomorphic Learning. (2018)
... This can be achieved by studying their invertibility properties [37,68], by exploiting sparse coding concepts [304], and by interpreting deep learning as a parameter identification or optimal control problem for ordinary differential equations [167,352,412]. CNNs can also be connected to flows of diffeomorphisms [309] and to parabolic or hyperbolic PDEs [112,113,231,240], where it is possible to transfer L 2 stability results [319]. This thesis focuses on diffusion PDEs. ...
... A common result is that ResNets with a symmetric filter structure can be shown to be stable in the Euclidean norm [166,167,309,319,412]. We motivate this result from a novel viewpoint based on diffusion processes. ...
... Similar results have been obtained recently in [309,319,412], albeit with alternative justifications. However, our unique diffusion interpretation allows novel design concepts for CNNs such as nonmonotone activation functions (cf. ...
... Residual blocks can be related to diffeomorphic registration according to Rousseau et al. [178]. Indeed, stacking residual blocks in ResNets [80] aims to incrementally map the embedding space to a new unknown space, each block being defined as y = F (x) + x, with x and y the respective input and output of the residual blocks, and F the residual mapping to be learned. ...
... Here, we further demonstrate that beyond any prior coming from the dataset learning step, the architecture design has an important effect on the registration results, acting as an implicit regularizer. Our study also shows the impact of some of the architecture components, particularly the residual blocks, and we justify this behavior by making a link with findings from Rousseau et al. [178]. Moreover, we find that for our application, a pyramidal architecture capturing the whole image with a limited amount of parameters to optimize, as conventional registration methods, provides precise registration results. ...
Metastatic breast cancer requires constant monitoring. During follow-up care, PET images are regularly acquired and interpreted according to specific guidelines, such as PERCIST, to decide whether or not the treatment should be adapted. However, PERCIST focuses only on one lesion representing tumor burden. The objective of this PhD thesis is to assist physicians monitormetastatic breast cancer patients with longitudinal PET images and improve tumor evaluation by providing them tools to consider all regions showing a high uptake. Our first contribution is a method for the automatic segmentation of active organs (brain, bladder, etc). Our second contribution formulates the segmentation of lesions in the followup examination as an image registration problem.The longitudinal full-body PET image registration problem is addressed, in this thesis, with our novel method called MIRRBA (Medical Image Registration Regularized By Architecture), which combines the strengths of both conventional and DL-based approaches within a Deep Image Prior (DIP) setup. We validated the three types of approaches (conventional, DL and MIRRBA) on a private longitudinalPET dataset obtained in the context of the EPICURE project. Finally, the third contribution is the evaluation of the biomarkers extracted from lesion segmentations obtained from the lesion registration step. We propose a new tool for the monitoring of metastatic breast cancer.
... in homogeneous coordinates, see [24] for more details. The eight parameters of the ho- To test our explainers, we will be using ResNet50 [22] for image classification, as a mean to find a balance between simplicity and performance for a more complicated dataset. The reference Grad-CAM [23], FEM [8] and MLFEM [5] will be applied to this network. ...
... Scheme of a residual block in ResNet50[22] Illustration of MLFEM method on a generic ConvNet: FEM is applied to every layer of interest, producing n heatmaps, that are then combined into a final explanation map by the fusion operator. ...
The most popular methods and algorithms for AI are, for the vast majority, black boxes. Black boxes can be an acceptable solution to unimportant problems (in the sense of the degree of impact) but have a fatal flaw for the rest. Therefore the explanation tools for them have been quickly developed. The evaluation of their quality remains an open research question. In this technical report, we remind recently proposed post-hoc explainers FEM and MLFEM which have been designed for explanations of CNNs in image and video classification tasks. We also propose their evaluation with reference-based and no-reference metrics. The reference-based metrics are Pearson Correlation coefficient and Similarity computed between the explanation maps and the ground truth, which is represented by Gaze Fixation Density Maps obtained due to a psycho-visual experiment. As a no-reference metric we use "stability" metric, proposed by Alvarez-Melis and Jaakkola. We study its behaviour, consensus with reference-based metrics and show that in case of several kind of degradations on input images, this metric is in agreement with reference-based ones. Therefore it can be used for evaluation of the quality of explainers when the ground truth is not available.
... Equation (10) can be further simplified as: (12) is the Mathieu-type parametric vibration equation of the strip-type composite compression bar under the action of longitudinal periodic loads [6]. The elastic critical load of the strip-type composite compression rod considering the shear deformation is as follows: ...
... Applied Mathematics and Nonlinear Sciences (aop) (aop) [1][2][3][4][5][6][7][8] The load potential energy is: ...
Full-text available
The compression rod is an important stress member of house building and bridge structure. When the load on the compression rod reaches the critical load, the entire structure will lose its stability. We use the fractional-order differential equation of the curvature of the member to bend and apply the fourth-order differential equation's general solution to establish the compression rod's stability model in construction engineering. In this paper, the discrete boundary conditions are applied to the algebraic equation system by the substitution method to obtain the characteristic equation about the buckling load of the compression rod. The research found that the method proposed in the paper is simple. The critical load relation deduced in this paper is reasonable and efficient.
... The stability of ResNets has been analysed in several works [17,47,48,88,93,110]. A common result is that ResNets with a symmetric filter structure can be shown to be stable in the Euclidean norm. ...
... Similar results have been obtained recently in [88,93,110], albeit with alternative justifications. In Sect. ...
Full-text available
We investigate numerous structural connections between numerical algorithms for partial differential equations (PDEs) and neural architectures. Our goal is to transfer the rich set of mathematical foundations from the world of PDEs to neural networks. Besides structural insights, we provide concrete examples and experimental evaluations of the resulting architectures. Using the example of generalised nonlinear diffusion in 1D, we consider explicit schemes, acceleration strategies thereof, implicit schemes, and multigrid approaches. We connect these concepts to residual networks, recurrent neural networks, and U-net architectures. Our findings inspire a symmetric residual network design with provable stability guarantees and justify the effectiveness of skip connections in neural networks from a numerical perspective. Moreover, we present U-net architectures that implement multigrid techniques for learning efficient solutions of partial differential equation models, and motivate uncommon design choices such as trainable nonmonotone activation functions. Experimental evaluations show that the proposed architectures save half of the trainable parameters and can thus outperform standard ones with the same model complexity. Our considerations serve as a basis for explaining the success of popular neural architectures and provide a blueprint for developing new mathematically well-founded neural building blocks.
... Others are concerned with using neural networks to solve [16,34,59] or learn PDEs from data [45,62,64]. Moreover, the approximation capabilities [17,32,40,71] and stability aspects [2,10,33,61,63,70,86] of CNNs are often analysed from a PDE viewpoint. ...
Full-text available
Partial differential equation models and their associated variational energy formulations are often rotationally invariant by design. This ensures that a rotation of the input results in a corresponding rotation of the output, which is desirable in applications such as image analysis. Convolutional neural networks (CNNs) do not share this property, and existing remedies are often complex. The goal of our paper is to investigate how diffusion and variational models achieve rotation invariance and transfer these ideas to neural networks. As a core novelty, we propose activation functions which couple network channels by combining information from several oriented filters. This guarantees rotation invariance within the basic building blocks of the networks while still allowing for directional filtering. The resulting neural architectures are inherently rotationally invariant. With only a few small filters, they can achieve the same invariance as existing techniques which require a fine-grained sampling of orientations. Our findings help to translate diffusion and variational models into mathematically well-founded network architectures and provide novel concepts for model-based CNN design.
This paper proposes a novel approach for the longitudinal registration of PET imaging acquired for the monitoring of patients with metastatic breast cancer. Unlike with other image analysis tasks, the use of deep learning has not significantly improved the performance of image registration. With this work, we propose a new registration approach to bridge the performance gap between conventional and deep learning-based methods: Medical Image Registration method Regularized By Architecture (MIRRBA). MIRRBA is a subject-specific deformable registration method which relies on a deep pyramidal architecture to parametrize the deformation field. Diverging from the usual deep-learning paradigms, MIRRBA does not require a learning database, but only a pair of images to be registered that is used to optimize the network’s parameters. We applied MIRRBA on a private dataset of 110 whole-body PET images of patients with metastatic breast cancer. We used different architecture configurations to produce the deformation field and studied the results obtained. We also compared our method to several standard registration approaches: two conventional iterative registration methods (ANTs and Elastix) and two supervised deep learning-based models (LapIRN and Voxelmorph). Registration accuracy was evaluated using the Dice score, the TRE, the average Hausdorff distance and the detection rate, while the realism of the registration obtained was evaluated using Jacobian’s determinant. The ability of the different methods to shrink disappearing lesions was also computed with the disappearing rate. MIRRBA significantly improved all metrics when compared to DL-based approaches. Regarding conventional approaches, MIRRBA presented comparable results showing the feasibility of our method. In this paper, we also demonstrate the regularizing power of deep architectures and present new elements to understand the role of the architecture in deep learning methods used for registration.
Full-text available
Partial differential equations (PDEs) are indispensable for modeling many physical phenomena and also commonly used for solving image processing tasks. In the latter area, PDE-based approaches interpret image data as discretizations of multivariate functions and the output of image processing algorithms as solutions to certain PDEs. Posing image processing problems in the infinite dimensional setting provides powerful tools for their analysis and solution. Over the last three decades, the reinterpretation of classical image processing tasks through the PDE lens has been creating multiple celebrated approaches that benefit a vast area of tasks including image segmentation, denoising, registration, and reconstruction. In this paper, we establish a new PDE-interpretation of deep convolution neural networks (CNN) that are commonly used for learning tasks involving speech, image, and video data. Our interpretation includes convolution residual neural networks (ResNet), which are among the most promising approaches for tasks such as image classification having improved the state-of-the-art performance in prestigious benchmark challenges. Despite their recent successes, deep ResNets still face some critical challenges associated with their design, immense computational costs and memory requirements, and lack of understanding of their reasoning. Guided by well-established PDE theory, we derive three new ResNet architectures that fall two new classes: parabolic and hyperbolic CNNs. We demonstrate how PDE theory can provide new insights and algorithms for deep learning and demonstrate the competitiveness of three new CNN architectures using numerical experiments.
Full-text available
We investigate the effect of explicitly enforcing the Lipschitz continuity of neural networks. Our main hypothesis is that constraining the Lipschitz constant of a networks will have a regularising effect. To this end, we provide a simple technique for computing the Lipschitz constant of a feed forward neural network composed of commonly used layer types. This technique is then utilised to formulate training a Lipschitz continuous neural network as a constrained optimisation problem, which can be easily solved using projected stochastic gradient methods. Our evaluation study shows that, in isolation, our method performs comparatively to state-of-the-art regularisation techniques. Moreover, when combined with existing approaches to regularising neural networks the performance gains are cumulative.
Full-text available
Recently, deep residual networks have been successfully applied in many computer vision and natural language processing tasks, pushing the state-of-the-art performance with deeper and wider architectures. In this work, we interpret deep residual networks as ordinary differential equations (ODEs), which have long been studied in mathematics and physics with rich theoretical and empirical success. From this interpretation, we develop a theoretical framework on stability and reversibility of deep neural networks, and derive three reversible neural network architectures that can go arbitrarily deep in theory. The reversibility property allows a memory-efficient implementation, which does not need to store the activations for most hidden layers. Together with the stability of our architectures, this enables training deeper networks using only modest computational resources. We provide both theoretical analyses and empirical results. Experimental results demonstrate the efficacy of our architectures against several strong baselines on CIFAR-10, CIFAR-100 and STL-10 with superior or on-par state-of-the-art performance. Furthermore, we show our architectures yield superior results when trained using fewer training data.
This chapter discusses variational registration methods, minimizing registration costs with a guarantee that the optimal solution is a diffeomorphism. The main focus will be on methods optimizing over flows associated with differential equations, using concepts introduced in the previous chapters.
It is widely believed that the success of deep convolutional networks is based on progressively discarding uninformative variability about the input with respect to the problem at hand. This is supported empirically by the difficulty of recovering images from their hidden representations, in most commonly used network architectures. In this paper we show via a one-to-one mapping that this loss of information is not a necessary condition to learn representations that generalize well on complicated problems, such as ImageNet. Via a cascade of homeomorphic layers, we build the i-RevNet, a network that can be fully inverted up to the final projection onto the classes, i.e. no information is discarded. Building an invertible architecture is difficult, for one, because the local inversion is ill-conditioned, we overcome this by providing an explicit inverse. An analysis of i-RevNets learned representations suggests an alternative explanation for the success of deep networks by a progressive contraction and linear separation with depth. To shed light on the nature of the model learned by the i-RevNet we reconstruct linear interpolations between natural image representations.
Deep residual networks (ResNets) have significantly pushed forward the state-of-the-art on image classification, increasing in performance as networks grow both deeper and wider. However, memory consumption becomes a bottleneck, as one needs to store the activations in order to calculate gradients using backpropagation. We present the Reversible Residual Network (RevNet), a variant of ResNets where each layer's activations can be reconstructed exactly from the next layer's. Therefore, the activations for most layers need not be stored in memory during backpropagation. We demonstrate the effectiveness of RevNets on CIFAR-10, CIFAR-100, and ImageNet, establishing nearly identical classification accuracy to equally-sized ResNets, even though the activation storage requirements are independent of depth.
In this paper, we study deep signal representations that are invariant to groups of transformations and stable to the action of diffeomorphisms without losing signal information. This is achieved by generalizing the multilayer kernel introduced in the context of convolutional kernel networks and by studying the geometry of the corresponding reproducing kernel Hilbert space. We show that the signal representation is stable, and that models from this functional space, such as a large class of convolutional neural networks with homogeneous activation functions, may enjoy the same stability.