Content uploaded by Ronan Fablet

Author content

All content in this area was uploaded by Ronan Fablet on Jun 05, 2019

Content may be subject to copyright.

Noname manuscript No.

(will be inserted by the editor)

Residual Networks as Flows of Diﬀeomorphisms

Fran¸cois Rousseau ·Lucas Drumetz ·

Ronan Fablet

Received: date / Accepted: date

Abstract This paper addresses the understanding and characterization of

residual networks (ResNet), which are among the state-of-the-art deep learn-

ing architectures for a variety of supervised learning problems. We focus on

the mapping component of ResNets, which map the embedding space towards

a new unknown space where the prediction or classiﬁcation can be stated

according to linear criteria. We show that this mapping component can be

regarded as the numerical implementation of continuous ﬂows of diﬀeomor-

phisms governed by ordinary diﬀerential equations. Especially, ResNets with

shared weights are fully characterized as numerical approximation of exponen-

tial diﬀeomorphic operators. We stress both theoretically and numerically the

relevance of the enforcement of diﬀeormorphic properties and the importance

of numerical issues to make consistent the continuous formulation and the dis-

cretized ResNet implementation. We further discuss the resulting theoretical

and computational insights on ResNet architectures.

Keywords Residual Network ·Diﬀeomorphism ·Dynamical systems

1 Introduction

Deep learning models are the reference models for a wide range of machine

learning problems. Among deep learning (DL) architectures, Residual net-

works (also called ResNets) have become state-of-the-art ones [15,16] . Ex-

perimental evidences emphasize critical aspects in the speciﬁcation of these

F. Rousseau

IMT Atlantique

Technopˆole Brest Iroise, 29239 Brest, France

E-mail: francois.rousseau@imt-atlantique.fr

R. Fablet and L. Drumetz

IMT Atlantique

2 Fran¸cois Rousseau et al.

architectures for instance in terms of network depths or combination of ele-

mentary layers as well as in their stability and genericity. The understanding

and the characterization of ResNets and more widely DL architectures from a

theoretical point of view remains a key issue despite recent advances for CNN

[24].

Interesting insights on ResNets have recently been presented in [25,12,

31] from an ordinary/partial diﬀerential equation (ODE/PDE) point of view.

ResNets can be regarded as numerical schemes of diﬀerential equations. Es-

pecially, in [25], this PDE-driven setting stresses the importance of numerical

stability issues depending on the selected ResNet conﬁguration. Interestingly,

it makes explicit the interpretation of the ResNet architecture as a depth-

related evolution of an input space towards a new space where the prediction

of the expected output (for instance classes) is solved according to a linear

operator. This interpretation is also pointed out in [13] and discussed in terms

of Riemannian geometry.

In this work, we deepen this analogy between ResNets and deformation

ﬂows to relate ResNet and registration problems [27], especially diﬀeomorphic

registration [30,5,3,2]. Our contribution is three-fold: (i) we restate ResNet

learning as the learning of a continuous and integral diﬀeomorphic opera-

tor and investigate diﬀerent solutions, especially the exponential operator of

velocity ﬁelds [2], to enforce diﬀeomorphic properties; (ii) we make explicit

the interpretation of ResNets as numerical approximations of the underlying

continuous diﬀeomorphic setting governed by ordinary diﬀerential equations

(ODE); (iii) we provide theoretical and computational insights on the speciﬁ-

cation of ResNets and on their properties.

This paper is organized as follows. Section 2 relates ResNets to diﬀeomor-

phic registrations. We introduce in Section 3 the proposed diﬀeomorphism-

based learning framework. Section 4 reports experiments. Our key contribu-

tions are further discussed in Section 5.

2 From ResNets to diﬀeomorphic registrations

ResNets [15,16] have become state-of-the-art deep learning architectures for

a variety of problems, including for instance image recognition [15] or super-

resolution [18]. This architecture has been proposed in order to explore per-

formance of very deep models, without training degradation accuracy when

adding layers. ResNets proved to be easier to optimize and made it possible

to learn very deep models (up to hundreds layers).

As illustrated in Fig.1, ResNets can be decomposed into three main build-

ing blocks:

–the embedding block which aims to extract relevant features from the input

variables for the targeted task (such as classiﬁcation or regression). In [15],

the block consists in a set of 64 convolution ﬁlters of size 7×7 followed by

non-linear activation function such as ReLU.

Residual Networks as Flows of Diﬀeomorphisms 3

Fig. 1 A schematic view of ResNet architecture [15], decomposed into three blocks: embed-

ding, mapping and prediction. ’conv’ means convolution operations followed by non linear

activations, and ’fc’ means fully connected layer.

–the mapping block, which aims to incrementally map the embedding space

to a new unknown space, in which the data are, for instance, linearly separa-

ble in the classiﬁcation case. In [15], this block consists in a series of residual

units. A residual unit is deﬁned as y=F(x,{Wi}) + xwhere the function

Fis the residual mapping to be learned. In [15], F(x) = W2σ(W1x) where

σdenotes the activation function (biases are omitted for simplifying nota-

tions). The operation F(x) + xis performed by a shortcut connection and

element-wise addition.

–the prediction block, which addresses the classiﬁcation or regression steps

from the mapped space to the output space. This prediction block is ex-

pected to involve linear models. In [15], this step is performed with a fully

connected layer.

In this work, we focus on the deﬁnition and characterization of the map-

ping block in ResNets. The central idea of ResNets is to learn the additive

residual function Fsuch that the layers in the mapping block are related by

the following equation:

xl+1 =xl+F(xl, Wl) (1)

where xlis the input feature to the lth residual unit. Wlis a set of weights

(and biases) associated with the lth residual unit. In [16], it appears that

such formulation exhibits interesting backward propagation properties. More

speciﬁcally, it implies that the gradient of a layer does not vanish even when

the weights are arbitrarily small.

Here, we relate the incremental mapping deﬁned by these ResNets to diﬀeo-

morphic registration models [27]. These registration models, especially Large

Deformation Diﬀeomorphic Metric Mapping (LDDMM) [30,5], tackle the reg-

istration issue from the composition of a series of incremental diﬀeomorphic

mappings, each individual mapping being close to the identity. Conversely, in

ResNet architectures, the lth residual block provides an update of the form

4 Fran¸cois Rousseau et al.

xl+F(xl, Wl). Under the assumption that kF(xl, Wl)kkxlk, the deforma-

tion ﬂows generated by ResNet architectures may be expected to implement

the composition of a series of incremental diﬀeomorphic mappings.

In [15,16], it is mentioned that the form of the residual function Fis ﬂexi-

ble. Several residual blocks have been proposed and experimentally evaluated

such as bottleneck blocks [15], various shortcut connections [16] or aggregated

residual transformations [32]. However, by making the connection between

ResNets and diﬀeomorphic mappings, it appears here that the function Fis a

parametrization of an elementary deformation ﬂow, constraining the space of

admissible residual unit architectures.

We argue this registration-based interpretation motivates the deﬁnition of

ResNet architectures as the numerical implementation of continuous ﬂows of

diﬀeomorphisms. Section 3 details the proposed diﬀeomorphism-based learn-

ing framework in which diﬀeomorphic ﬂows are governed by ODEs as in the

LDDMM setting. ResNets with shared weights relate to a particularly inter-

esting case yielding the deﬁnition of exponential diﬀeomorphism subgroups

in the underlying Lie algebra. Overall, the proposed framework results in: i)

a theoretical characterization of the mapping block as an integral diﬀeomor-

phic operator governed by an ODE, ii) in considering deformation ﬂows and

Jacobian maps for the analysis of ResNets, iii) the derivation of ResNet archi-

tectures with additional diﬀeomorphic constraints.

3 Diﬀeomorphism-based learning

3.1 Diﬀeomorphisms and driving velocity vector ﬁelds

Registration issues have been widely stated as the estimation of diﬀeomor-

phic transformations between input and output spaces, especially in medi-

cal imaging [27]. Diﬀeomorphic properties guarantee the invertibility of the

transformations, which includes the conservation of topological features. The

parametrization of diﬀeomorphic transformations according to time-varying

velocity vector ﬁelds has been shown to be very eﬀective in medical imaging

[21]. Beyond its computational performance, this framework embeds the group

structure of diﬀeomorphisms and results in ﬂows of diﬀeomorphisms governed

by an Ordinary Diﬀerential Equation (ODE):

dφ(t)

dt =Vt(φ(t)) (2)

with φ(t) the diﬀeomorphism at time t, and Vtthe velocity vector ﬁeld at

time t.φ(0) is the identity and φ(1) the registration transformation between

embedding space Xand output space X∗, such that for any element Xin Xits

mapped version in X∗is φ(1)(X). Given velocity ﬁelds (Vt)t, the computation

of φ(1)(X) comes from the numerical integration of the above ODE.

A speciﬁc class of diﬀeomorphisms refers to stationary velocity ﬁelds, that

is to say velocity ﬁelds which do not depend on time (Vt=V, ∀t). As introduced

Residual Networks as Flows of Diﬀeomorphisms 5

in [2], in this case, the resulting diﬀeomorphisms deﬁne a subgroup structure in

the underlying Lie group and yield the deﬁnition of the exponential operators.

We here only brieﬂy detail these key properties. We let the reader refer to [1]

for a detailed and more formal presentation of their mathematical derivation.

For a stationary velocity ﬁeld, the resulting diﬀeomorphisms belong to the

one-parameter subgroup of diﬀeomorphisms with inﬁnitesimal generator V. In

particular, they verify the following property: ∀s, t, φ(t)·φ(s) = φ(s+t), where ·

stands for the composition operator in the underlying Lie group. This implies

for instance that computing φ(1) boils down to applying ntimes φ(1/2n)

for any integer value n. Interestingly, this one-parameter subgroup yields the

deﬁnition of diﬀeomorphisms (φ(t))tas exponentials of velocity ﬁeld Vdenoted

by (exp(tV ))tand governed by the stationary ODE

dφ(t)

dt =V(φ(t)) (3)

Conversely, any one-parameter subgroup of diﬀeomorphisms is governed by an

ODE with a stationary velocity ﬁeld. It may be noted that the above deﬁni-

tion of exponentials of velocity ﬁelds generalizes the deﬁnition of exponential

operators for matrices and ﬁnite-dimensional spaces.

3.2 Diﬀeomorphism-based supervised learning

In this section, we view supervised learning issues as the learning of diﬀeomor-

phisms according to some predeﬁned loss function. Let us consider a typical

supervised classiﬁcation issue which the goal is to predict a class Yfrom an

N-dimensional real-valued observation X. Let Lθbe a linear classiﬁer model

with parameter θ. Within a neural network setting, Lθtypically refers to a

fully-connected layer with softmax activations and parameter vector θto the

weight and bias parameters of this layer. Let Dbe the group of diﬀeomor-

phisms in RN. We state the supervised learning as the joint estimation of an

embedding E, a diﬀeomorphic mapping φ∈ D and linear classiﬁcation model

Lθaccording to:

b

E,b

φ, b

θ= arg min

E,φ,θ loss ({Lθ(φ(E(Xi))) , Yi}i) (4)

with {Xi, Yi}ithe considered training dataset and loss an appropriate loss

function, typically a cross entropy criterion. Considering the ODE-based parametriza-

tion of diﬀeomorphisms, the above minimization leads to an equivalent esti-

mation of velocity ﬁeld sequence (Vt)

b

E,d

(Vt),b

θ= arg min

E,(Vt),θ loss ({Lθ(φ(1) (E(Xi))) , Yi}i) (5)

subject to

dφ(t)

dt =Vt(φ(t))

φ(0) = I

(6)

6 Fran¸cois Rousseau et al.

When considering stationary velocity ﬁelds [2,3], this minimization simpliﬁes

as

b

E,b

V , b

θ= arg min

E,(V),θ loss ({Lθ(exp(V) (E(Xi))) , Yi}i) (7)

We may point out that this formulation diﬀers from the image registration

problem in the considered loss function. Whereas image registration usually

involves the minimization of the prediction error Yi−φ(1) (E(Xi)) with any

pair Xi, Yi∈RN, we here state the inference of the registration operator φ(1)

according to classiﬁcation-based loss function. It may also be noted that the

extension to other loss functions (e.g. for regression tasks) is straightforward.

3.3 Derived NN architecture

To solve for minimization issues (5) and (7), additional priors on the velocity

ﬁelds can be considered. One may consider the introduction of an additional

term in the minimization, which typically involves the integral of the norm of

the gradient of the velocity ﬁelds and favors small registration displacements

between two time steps [5,33]. Parametric priors may also be considered. They

come to set some parametrization for the velocity ﬁelds. In image registration

studies, spline-based parametrization has for instance been explored [3].

Here, we combine these two types of priors. We exploit a parametric ap-

proach and consider neural-network based representations of the driving veloc-

ity ﬁelds in ODEs (2) and (3). More speciﬁcally, the discrete parametrization

of the velocity ﬁeld, Vt(x), can be considered as a linear combination of basis

functions:

Vj,t(x) = X

i

νt,j,ift,i (x) (8)

where Vj,t denotes component j(a scalar) of the learned velocity ﬁeld, and

the νt,j,i are the weights learned by the 1D convolutional layer.

Various types of shortcut connections and various usages of activation func-

tions experimented in [16] correspond to various forms of the parametriza-

tion of the velocity ﬁeld. Understanding residual units in a registration-based

framework allows to provide a methodological guide to propose new valid resid-

ual units. Figure 2 shows 3 residual block units: original ResNet [15], improved

ResNet [16] and the residual block studied in this work. For instance, it can

be noticed that adding an activation function such as ReLU after the shortcut

connection (i.e. after the addition layer) as in [15] (see Figure 2(a)) makes

the mapping no longer bijective, and thus such an architecture may be less

eﬃcient, as shown experimentally in [16].

One way to build diﬀeomorphisms is to compose small perturbations of the

identity. Using the same notations as in [33], let Ω⊂Rdbe open and bounded.

We denote by C1

0(Ω, Rd) the Banach space of continuously diﬀerentiable vector

ﬁelds von Ωsuch that vand Dv vanish on ∂Ω and at inﬁnity. Let χ1

1(T, Ω) be

the set of absolutely integrable functions from [0, T ] to C1

0(Ω, Rd). It can be

Residual Networks as Flows of Diﬀeomorphisms 7

Fig. 2 Various residual units: (a) original ResNet [15], (b) improved ResNet [16], (c) pro-

posed residual unit.

shown that the ﬂow associated to the time-dependent vector ﬁeld v∈χ1

1(T, Ω)

is a diﬀeomorphism of Ω[33].

In this work, we propose a residual block suitable to build ﬂows of diﬀeo-

morphisms. In the two case studies considered in this paper, two parametriza-

tions of the basis functions are considered. For the case of the experiments of

section 4.1, on the CIFAR-10 dataset, where the inputs are images, the basis

functions ft,i are parametrized with one convolutional layer and one ReLU

layer. The linear combination of these basis functions can be represented as

a second one dimensional convolutional layer, with a ﬁlter size of 1×1. In the

experiments of section 4.2, on the spiral datasets, the inputs are two dimen-

sional. In that case, the basis functions are modeled through the output of a

dense layer, followed by a tanh activation function. It has to be noticed that

no biases are considered for the two convolutional layers. In order to control

the magnitude of the velocity ﬁeld, we propose to use in the residual block a

tanh layer and a scaling layer. Finally, to ensure that v∈χ1

1(T, Ω), we intro-

duce a windowing layer such that vand Dv vanish on ∂Ω and at inﬁnity. This

ensures that vis a Lipschitz continuous mapping [33]. This proposed residual

block is shown in Figure 2(c).

In the registration-based framework considered so far, the transformation

φis only applied to the embedding of the observation X. This can introduce

an undesirable asymmetry in the optimization process and have a signiﬁcant

impact on the registration performance. Inverse consistency, ﬁrst introduced

by Thirion in [29], can be performed by adding a penalty term. In order to

implement inverse consistent algorithms, it is useful to be able to integrate

backwards as well as forwards. In the diﬀeomorphic framework, the inverse

8 Fran¸cois Rousseau et al.

consistency can be written as follows:

φ(1) ◦φ(−1) = φ(−1) ◦φ(1) = φ(0) (9)

This inverse consistency can then be achieved by adding the following term in

the overall loss function:

b

E,b

φ, b

θ= arg min

E,φ,θ α loss ({Lθ(φ(E(Xi))) , Yi}i) + (1 −α)X

i

(E(Xi)−φ(−1)(E(Xi)∗))2

(10)

where E(Xi)∗=φ(1)(E(Xi)), Xi∈ X and αis a weighting parameter. We may

stress that this term does not depend on the targeted task (i.e. classiﬁcation

or regression) and only constraint the learning of the mapping block. Thus,

this regularization term can be extended to data points that do not belong to

the learning set, and more generally to points in a given domain, such that the

inverse consistency property does not depend on the sampling of the learning

dataset.

4 Experiments

In this section, we investigate experimentally the potential of the proposed

architecture of residual blocks using the image classiﬁcation dataset CIFAR-

10 [22] and synthetic 2D data (spiral dataset). CIFAR-10 is used to explore

the performance of the proposed residual unit with respect to other ResNet

architectures. The 2D spiral dataset helps to further investigate properties

of diﬀeomorphism-based networks and provides geometrical insights on the

estimated ﬂows.

4.1 CIFAR-10

4.1.1 Experimental setting

The CIFAR-10 dataset contains 60,000 32 ×32 color images in 10 diﬀerent

classes. 10,000 images are used for testing purpose. The overall architecture is

decomposed into three main parts: embedding, mapping and prediction. First,

the embedding is performed using the following layers: 1) a 5×5 convolutional

layer with 128 ﬁlters, 2) batch normalization, 3) tanh activation layer (which

ensures that E(Xi)∈]−1,1[p, an open and bounded interval). Then, the

network consists in several residual blocks as depicted in Figure 2(c) (3 ×3

convolutions without bias, with 128 ﬁlters). The scaling factor of the residual

units is learned and shared for every unit. At the end of the mapping block, a

tanh activation layer is used to ensure that φ(E(Xi)) ∈]−1,1[p, an open and

bounded interval. Finally, the prediction step is performed using the following

layers: 1) a 3 ×3 convolutional layer with 128 ﬁlters, 2) batch normalization,

3) tanh activation layer, 4) 32 ×32 average pooling, 5) fully connected layer.

Residual Networks as Flows of Diﬀeomorphisms 9

Methods #params 1k 2.5k 5k 10k 20k 30k 40k 50k

ResNet d56 1.6M 0.45 0.57 0.69 0.80 0.87 0.90 0.91 0.92

DiﬀeoNet d5 0.53 0.64 0.73 0.81 0.86 0.89 0.90 0.91

DiﬀeoNet d10 0.54 0.67 0.74 0.81 0.88 0.89 0.91 0.92

DiﬀeoNet d20 0.53 0.65 0.75 0.82 0.88 0.90 0.92 0.93

Stationary Velocity Fields

DiﬀeoNet d5 0.52 0.63 0.70 0.75 0.83 0.86 0.87 0.88

DiﬀeoNet d10 0.51 0.63 0.70 0.77 0.83 0.86 0.89 0.89

DiﬀeoNet d20 0.53 0.64 0.70 0.77 0.84 0.87 0.89 0.89

Stationary Velocity Fields and Inverse Consistency

DiﬀeoNet d5 0.52 0.63 0.69 0.75 0.82 0.86 0.88 0.90

DiﬀeoNet d10 0.51 0.64 0.70 0.76 0.84 0.87 0.88 0.90

DiﬀeoNet d20 0.52 0.64 0.70 0.77 0.85 0.86 0.88 0.90

Table 1 Accuracy on CIFAR10 for ResNet and proposed approaches (DiﬀeoNet : proposed

residual unit, stationary velocity ﬁelds correspond to the use of shared weights) with respect

to the number of training samples (1000 up to 50 000). Depths (d5, d10, d20) and number

of parameters are reported to perform a fair comparison.

Weights are initialized with a truncated normal distribution centered on 0 [14].

We use `2weight-decay regularization set to 2.10−4and SGD optimization

method with a starting learning rate of 0.1, minibatch of 128, 100 epochs.

The goal of this experiment is to study the eﬃciency of the proposed resid-

ual unit with respect to original ResNet. The baseline architecture used for

comparison is the ResNet architecture proposed in [16]. In this experiment,

we use the Keras 1implementation of ResNet for reproductibility purpose:

ResNet56v2 (depth of 56 with increasing number of convolution ﬁlters [16]),

with about 1.6M of trainable parameters.

Fig. 3 Performance of various non-stationary ResNet architectures on CIFAR10, with vary-

ing size of the training dataset (left: from 1000 to 50000 images, right: zoom version for very

small dataset size).

1F. Chollet et al. Keras, 2015. https://keras.io

10 Fran¸cois Rousseau et al.

4.1.2 Results

In this work, the dimension of the embedding space is constant throughout

the mapping block. We ﬁrst compare ResNet56v2 with the proposed approach

with the same number of parameters, which corresponds to a network depth

of 5 (i.e. 5 residual units). It can be seen in Figure 3 that the performance of

these two networks is similar when using the entire training dataset. Resid-

ual units of the Keras ResNet are built using three layers: convolution, batch

normalization and ReLU. ReLU units are crucial to promote eﬃcient sparse

parametrizations of the velocity ﬁelds. However, the obtained results show that

batch normalization in the residual units is not required to reach satisfactory

classiﬁcation accuracy. Instead, we use a tanh activation layer to restrict the

embedding set to be open and bounded for the whole ﬂow of diﬀeomorphisms.

It has to be noted that in the proposed network, the dimension of the trans-

formed embedding space is constant. Similarly to [17], the experimental re-

sults show that progressive dimension changes of the embedding space are not

required, contrary to popular belief that the performance of deep neural net-

works are based on progressive removal of uninformative variability. To study

the generalization performance of the proposed residual unit, we conduct ex-

periments with decreasing number of training samples (see Table 1 for detailed

results). It appears that the use of the proposed residual unit makes the model

more robust to small training sets compared to the Keras ResNet.

We also investigate the impact of the network depth on the classiﬁcation re-

sults. Increasing the depth leads to more complex models (in term of trainable

parameters). It can thus be expected to observe overﬁtting when using very

small training datasets. However, Figure 3 shows that increasing the depth of

the proposed network does not lead to accuracy decrease. From a dynamical

point of view, increasing the depth corresponds to smaller integration steps

and then to smoother variations of velocity ﬁelds. The proposed residual ar-

chitecture is not subject to overﬁtting for small datasets even with a deep

diﬀeomorphic network.

4.2 Spiral data

In this section, we propose to further deepen the understanding of behavior of

networks based on ﬂows of diﬀeomorphisms. Following the work on diﬀerential

geometry analysis of ResNet architectures of Hauser et al. in [13], we consider

a classiﬁcation task of 2-dimensional spiral data.

4.2.1 Experimental setting

No embedding layer is required in this experimental setup. The purpose of the

mapping block is then to warp the input data points Xiinto an unknown space

X∗where the transformed data X∗are linearly separable. We have considered

the following setting: the loss function is the binary cross-entropy between the

Residual Networks as Flows of Diﬀeomorphisms 11

output of a sigmoid function applied to the transformed data points X∗and

the true labels. Each network is composed of 20 residual units for which non-

linearities are modeled with tanh activation functions and 10 basis functions

(modeled by dense layers) are used for the parametrization of the velocity

ﬁelds. Weights are initialized with the Glorot uniform initializer (also called

Xavier uniform initializer) [9]. We use `2weight-decay regularization set to

10−4and the ADAM optimization method [20] with a learning rate of 0.001,

β1= 0.9, β2= 0.999, minibatch of 300, 1000 epochs.

We consider four ResNet architectures: a) a ResNet without shared weights

(corresponding to time-varying velocity ﬁelds modeling), b) ResNet with shared

weights (corresponding to the stationary velocity ﬁelds modeling), c) Data-

driven Symmetric ResNet with shared weights (considering also the inverse

consistency criterion is computed over training data) and d) Domain-driven

Symmetric ResNet with shared weights (where the inverse consistency crite-

rion is computed over the entire domain using a random sampling scheme).

4.2.2 Characterization of ResNet properties

ResNet architectures have been recently studied from the point of view of

diﬀerential geometry in [13]. In this article, Hauser et al. studied the impact

of residual-based approaches (compared to non-residual networks) in term of

diﬀerentiable coordinate transformations. In our work, we propose to go one

step further by considering the characterization of the estimated deformation

ﬁelds leading to an adapted conﬁguration for the considered classiﬁcation task.

More speciﬁcally, we consider in this work the maps of Jacobian values.

The Jacobian (i.e. the determinant of the Jacobian matrices of the defor-

mations) is deﬁned in a 2-dimensional space as follows:

Jφ(x) =

∂φ1(x)

∂x1

∂φ1(x)

∂x2

∂φ2(x)

∂x1

∂φ2(x)

∂x2

(11)

From a physical point of view, the value of the Jacobian represents the

local volume variation induced by the transformation. A transformation with

a Jacobian value equal to 1 is a transformation that preserves volumes. A

Jacobian value greater than 1 corresponds to a local expansion and a value

less than 1 corresponds to a local contraction. The case where the Jacobian

is zero means that several points are warped onto a single point: this case

corresponds to the limit case from which the bijectivity of the transformation

is not veriﬁed anymore, thus justifying the constraint on the positivity of the

Jacobian in several registration methods [27].

4.2.3 Results

Classiﬁcation algorithms are usually only evaluated using the classiﬁcation

accuracy (as the number of correct predictions from all predictions made).

However, the classiﬁcation rate is not enough to characterize the performance

12 Fran¸cois Rousseau et al.

of a speciﬁc algorithm. In all the experiments shown in this work, the clas-

siﬁcation rate is greater than 99%. Visualization of the decision boundary is

an alternative way to provide complementary insights on the regularity of the

solution in the embedding space. Figure 4 shows the decision boundary for the

four considered ResNets. Although all methods achieved very high classiﬁca-

tion rates, it can be seen that adding constraints such as the use of stationary

velocity ﬁelds (i.e. shared weights) and inverse consistency constraints lead to

smoother decision boundaries with no eﬀect on the overall accuracy. This is

regarded as critical for generalization and adversarial robustness [28].

Fig. 4 Decision boundaries for the classiﬁcation task of 2-dimensional spiral data. From

left to right: ResNet without shared weights, ResNet with shared weights, Data-driven Sym-

metric ResNet with shared weights, Domain-driven Symmetric ResNet with shared weights.

We refer the reader to the main text for the correspondence between ResNet architectures

and diﬀeomorphic ﬂows.

Decision boundaries correspond to the projection of the estimated linear

decision boundary in the space X∗into the embedding space X. The visual-

ization of decision boundaries does not however provide information regarding

the topology of the manifold in the output space X∗. We also study the de-

formation ﬂow through the spatial conﬁguration of data points through the

network layers as in [13]. Figure 5 shows how each network untangles the spiral

data. Networks with shared weights exhibit smoother layer-wise transforma-

tions. More speciﬁcally, this visualization provides insights on the geometrical

properties (such as topology preservation / connectedness) of the transformed

set of input data points.

To evaluate the quality of the estimation warping transformation, Figure 6

shows the Jacobian maps for each considered network. Local Jacobian sign

changes correspond to locations where bijectivity is not satisﬁed. It can be

seen that adding constraints such as stationary velocity ﬁelds and inverse con-

sistency leads to more regular geometrical shapes of the deformed manifold.

The domain-driven regularization applied to a ResNet with shared weights

leads to the most regular geometrical pattern. Adding the symmetry consis-

tency leads to positive Jacobian values over the entire domain, guaranteeing

the bijectivity of the estimated mapping.

Residual Networks as Flows of Diﬀeomorphisms 13

Fig. 5 Evolution of the spatial conﬁguration of data points through the 20 residual units.

From top to bottom: ResNet without shared weights, ResNet with shared weights, Data-

driven Symmetric ResNet with shared weights, Domain-driven Symmetric ResNet with

shared weights.

Fig. 6 Jacobian maps for the four ResNet architectures. From left to right: ResNet without

shared weights (Jmin =−5.59, Jmax = 6.34), ResNet with shared weights (Jmin =−1.41,

Jmax = 2.27), Data-driven Symmetric ResNet with shared weights (Jmin = 0.55, Jmax =

5.92), Domain-driven Symmetric ResNet with shared weights (Jmin = 0.30, Jmax = 1.44).

(colormap : Jmin =−2.5, Jmax = 2.5, so dark pixels correspond to negative Jacobian

values).

5 Discussion: Insights on ResNet architectures from a

diﬀeomorphic viewpoint

As illustrated in the previous section, the proposed diﬀeomorphic formula-

tion of ResNets provides new theoretical and computational insights for their

interpretation and characterization as discussed below.

5.1 Theoretical characterization of ResNet architectures

In this work, we make explicit the interpretation of the mapping block of

ResNet architectures as a discretized numerical implementation of a contin-

uous diﬀeomorphic registration operator. This operator is stated as an inte-

gral operator associated with an ODE governed by velocity ﬁelds. Moreover,

ResNet architectures with shared weights are viewed as the numerical imple-

mentation of the exponential of velocity ﬁelds, equivalently deﬁned as diﬀeo-

morphic operators governed by stationary velocity ﬁelds. Exponentials of ve-

locity ﬁelds are by construction diﬀeomorphic under smoothness constraints on

the generating velocity ﬁelds. Up to the choice of the ODE solver implemented

by ResNet architecture (in our case an Euler scheme), ResNet architectures

14 Fran¸cois Rousseau et al.

with shared weights are then fully characterized from a mathematical point of

view.

The diﬀeomorphic property naturally arises as a critical property in reg-

istration problems, as it relates to invertibility properties. Such invertibility

properties are also at the core of the deﬁnition of kernel approaches, which

implicitly deﬁnes mapping operators [26]. As illustrated for the reported clas-

siﬁcation experiments, the diﬀeomorphic property prevents the mapping oper-

ator from modifying the topology of the manifold structure of the input data.

When not imposing such properties, for instance in unconstrained ResNet ar-

chitectures as well as, the learned deformation ﬂows may present unexpected

topology changes.

The diﬀeomorphic property may be regarded as a regularization criterion

on the mapping operator, so that the learned mapping enables a linear sep-

aration of the classes while guaranteeing the smoothness of the classiﬁcation

boundary and of the underlying deformation ﬂow. It is obvious that a ResNet

architecture with shared weights is a special case of an unconstrained ResNet.

Therefore, the training of a ResNet architecture with shared weights may be

viewed as the training of an unconstrained ResNet within a reduced search

space. The same holds for the symmetry property which further constrains

the search space during training. The later constraint is shown to be numer-

ically important so that the discretized scheme complies with the theoretical

diﬀeomorphic properties.

Overall, this analysis stresses that over an inﬁnity of mapping operators

reaching optimal training performance one may favor those depicting dif-

feomorphic properties so that key properties such as generalization perfor-

mance, prediction stability and robustness to adversarial examples may be

greatly improved. Numerical schemes which fulﬁll such diﬀeomorphic proper-

ties during the training process could be further investigated and could ben-

eﬁt from the registration literature, including for diﬀeomorphics ﬂows gov-

erned by non-stationary velocity ﬁelds [30,5, 4]. In particular, the impact of

the diﬀeomorphism-based network building on adversarial example estimation

is a open research direction for further studies.

5.2 Computational issues

Besides theoretical aspects, computational properties also derive from the pro-

posed diﬀeomorphism-based formulation. Within this continuous setting, the

depth of the network relates to the integration time step and the precision of

the integration scheme. The deeper the network, the smaller the integration

step. Especially, a large integration time step, i.e. a shallower ResNet archi-

tecture, may result in numerical integration instabilities and hence in non-

diﬀeomorphic transformations. Therefore, deep enough architectures should

be considered to guarantee numerical stability and diﬀeomorphic properties.

The maximal integration step relates to the regularity of the velocity ﬁelds

governing the ODEs. In our experiments, we only consider an explicit ﬁrst-

Residual Networks as Flows of Diﬀeomorphisms 15

order Euler scheme. Higher-order explicit schemes, for instance the classic

fourth-order Runge-Kutta scheme, seem of great interest as well as implicit

integration schemes [8]. Given the spatial variabilities of the governing velocity

ﬁelds, adaptive integration schemes also appear as particularly relevant.

Using diﬀeomorphism-based framework leads to speciﬁc architectures of

residual units. In this work, for instance, tanh activation layers are used to

constraint the domain Ωto be open and bounded. Such activation layer guar-

antees the diﬀeomorphic properties of the mapping block. The popular use

of ReLU activation in the embedding block cannot provide such guarantee.

Several other reversible architectures have been recently proposed [7,10, 17],

showing the potential of such frameworks for the analysis of residual networks.

In the LDDMM framework [5], the parametrization of the velocity ﬁeld is often

carried out with Reproducing Kernel Hilbert Spaces (RKHS). Recent works

have been done in this direction connecting RKHS and deep networks [6].

In our work, vand Dv vanish on ∂Ω and at inﬁnity. This property guaran-

tees that the learned residual units are Lipschitz continuous, which is related

to recent works investigating explicit constraints on Lipschitz continuity in

neural networks [11]. Moreover, this condition implies that the Hilbert space

of admissible velocity ﬁelds is a RKHS [34]. Further work could focus on the

parametrization of the velocity ﬁelds (i.e. residual units) using suitable kernels.

Diﬀeomorphic mapping deﬁned as exponential of velocity ﬁelds were shown

to be computationally more stable with smoother integral mappings. They lead

to ResNet architectures with shared weights, which greatly lowers the com-

putational complexity and memory requirements compared with the classic

ResNet architectures. They can be implemented as Recurrent Neural Net-

works [19,23]. Importantly, the NN-based speciﬁcation of the elementary of

velocity ﬁeld V(8) becomes the bottleneck in terms of modeling complexity.

The parametrization (Equation 8) may be critical to reach good prediction

performance. Here, we considered a two-layer architecture regarded as a pro-

jection of Vonto basis function. Higher-complexity architecture, for instance

with larger convolution supports, more ﬁlters or layers, might be considered

while keeping the numerical stability of the overall ResNet architectures. By

contrast, considering higher-complexity elementary blocks in a ResNet archi-

tectures without shared weights would increase numerical instabilities and may

required complementary regularization constraints across network depth [15,

25].

Regarding training issues, our experiments exploited a classic backpropa-

gation implementation with a random initialization. From the considered con-

tinuous log-Euclidean prospective, the training may be regarded as the pro-

jection of the random initialization onto the manifold of acceptable solutions,

i.e. solutions satisfying both the minimization of the training loss and diﬀeo-

morphic constraints. In the registration literature [27], the numerical schemes

considered for the inference of the mapping usually combine a parametric

representation of the velocity ﬁelds and a multiscale optimization strategy in

space and time. The combination of such multiscale optimization strategy to

backpropagation schemes appears as a promising path to improve convergence

16 Fran¸cois Rousseau et al.

properties, especially the robustness to the initialization. The diﬀerent solu-

tions proposed to enforce diﬀeomorphic properties are also of interest. Here,

we focused on the invertibility constraints, which result in additional terms to

be minimized in the training loss.

6 Conclusion

This paper introduces a novel registration-based formulation of ResNets. We

provide a theoretical interpretation of ResNets as numerical implementations

of continuous ﬂows of diﬀeomorphisms. Numerical experiments support the

relevance of this interpretation, especially the importance of the enforcement

of diﬀeomorphic properties, which ensure the stabilityof a trained ResNet. This

work opens new research avenues to explore further diﬀeomorphism-based for-

mulations and associated numerical tools for ResNet-based learning, especially

regarding numerical issues.

Acknowledgements We thank B. Chapron and C. Herzet for their comments and sugges-

tions. The research leading to these results has been supported by the ANR MAIA project,

grant ANR-15-CE23-0009 of the French National Research Agency, INSERM and Institut

Mines T´el´ecom Atlantique (Chaire Imagerie m´edicale en th´erapie interventionnelle) and Fon-

dation pour la Recherche M´edicale (FRM grant DIC20161236453), Labex Cominlabs (grant

SEACS), CNES (grant OSTST-MANATEE) and Microsoft trough AI-for-Earth EU Oceans

Grant (AI4Ocean). We also gratefully acknowledge the support of NVIDIA Corporation

with the donation of the Titan Xp GPU used for this research.

References

1. Arsigny, V.: Processing Data in Lie Groups: An Algebraic Approach. Application to

Non-Linear Registration and Diﬀusion Tensor MRI. Ph.D. thesis, Ecole Polytechnique

(2006)

2. Arsigny, V., Commowick, O., Pennec, X., Ayache, N.: A Log-Euclidean Framework for

Statistics on Diﬀeomorphisms. International Conference on Medical Image Computing

and Computer-Assisted Intervention: MICCAI 9(Pt 1), 924–931 (2006)

3. Ashburner, J.: A fast diﬀeomorphic image registration algorithm. NeuroImage 38(1),

95–113 (2007)

4. Avants, B., Gee, J.C.: Geodesic estimation for large deformation anatomical shape av-

eraging and interpolation. NeuroImage 23, S139–S150 (2004)

5. Beg, M.F., Miller, M.I., Trouv´e, A., Younes, L.: Computing Large Deformation Metric

Mappings via Geodesic Flows of Diﬀeomorphisms. International Journal of Computer

Vision () 61(2), 139–157 (2005)

6. Bietti, A., Mairal, J.: Group Invariance, Stability to Deformations, and Complexity of

Deep Convolutional Representations. arXiv.org (2017)

7. Chang, B., Meng, L., Haber, E., Ruthotto, L., Begert, D., Holtham, E.: Reversible

Architectures for Arbitrarily Deep Residual Neural Networks. Neural Networks cs.CV

(2017)

8. Davis, P.J., Rabinowitz, P.: Methods of Numerical Integration. Academic Press (1984)

9. Glorot, X., Bengio, Y.: Understanding the diﬃculty of training deep feedforward neural

networks. AISTATS (2010)

10. Gomez, A.N., Ren, M., Urtasun, R., Grosse, R.B.: The Reversible Residual Network -

Backpropagation Without Storing Activations. NIPS (2017)

Residual Networks as Flows of Diﬀeomorphisms 17

11. Gouk, H., Frank, E., Pfahringer, B., Cree, M.J.: Regularisation of Neural Networks by

Enforcing Lipschitz Continuity. Neural Networks stat.ML (2018)

12. Haber, E., Ruthotto, L.: Stable Architectures for Deep Neural Networks. arXiv.org (1),

014004 (2017)

13. Hauser, M., Ray, A.: Principles of Riemannian Geometry in Neural Networks. NIPS

(2017)

14. He, K., Zhang, X., Ren, S., Sun, J.: Delving Deep into Rectiﬁers: Surpassing Human-

Level Performance on ImageNet Classiﬁcation. arXiv.org (2015)

15. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition.

In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.

770–778. IEEE (2016)

16. He, K., Zhang, X., Ren, S., Sun, J.: Identity Mappings in Deep Residual Networks.

arXiv.org (2016)

17. Jacobsen, J.H., Smeulders, A., Oyallon, E.: i-RevNet: Deep Invertible Networks.

arXiv.org (2018)

18. Kim, J., Lee, J.K., Lee, K.M.: Accurate Image Super-Resolution Using Very Deep Con-

volutional Networks. CVPR (2016)

19. Kim, J., Lee, J.K., Lee, K.M.: Deeply-Recursive Convolutional Network for Image Super-

Resolution. CVPR (2016)

20. Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization. arXiv.org (2014)

21. Klein, A., Andersson, J., Ardekani, B.A., Ashburner, J., Avants, B., Chiang, M.C.,

Christensen, G.E., Collins, D.L., Gee, J., Hellier, P., Song, J.H., Jenkinson, M., Lepage,

C., Rueckert, D., Thompson, P., Vercauteren, T., Woods, R.P., Mann, J.J., Parsey,

R.V.: Evaluation of 14 nonlinear deformation algorithms applied to human brain MRI

registration. NeuroImage 46(3), 786–802 (2009)

22. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)

23. Liao, Q., Poggio, T.A.: Bridging the Gaps Between Residual Learning, Recurrent Neural

Networks and Visual Cortex. Neural Networks cs.LG (2016)

24. Mallat, S.: Understanding deep convolutional networks. Philosophical Transactions

of the Royal Society A: Mathematical, Physical and Engineering Sciences 374(2065),

20150203–16 (2016)

25. Ruthotto, L., Haber, E.: Deep Neural Networks motivated by Partial Diﬀerential Equa-

tions. arXiv.org (2018)

26. Scholkopf, B., Smola, A.J.: Learning with Kernels. Support Vector Machines, Regular-

ization, Optimization, and Beyond. MIT Press (2002)

27. Sotiras, A., Davatzikos, C., Paragios, N.: Deformable medical image registration: a sur-

vey. IEEE Transactions on Medical Imaging 32(7), 1153–1190 (2013)

28. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus,

R.: Intriguing properties of neural networks. arXiv.org (2013)

29. Thirion, J.: Image matching as a diﬀusion process: an analogy with Maxwell’s demons.

Medical Image analysis 2(3), 243–260 (1998)

30. Trouv´e, A.: Diﬀeomorphisms Groups and Pattern Matching in Image Analysis. Inter-

national Journal of Computer Vision () 28(3), 213–221 (1998)

31. Weinan, E., 2017: A proposal on machine learning via dynamical systems. Communi-

cations in Mathematics and Statistics 5(1), 1–11 (2017)

32. Xie, S., Girshick, R.B., Doll´ar, P., Tu, Z., He, K.: Aggregated Residual Transformations

for Deep Neural Networks. CVPR pp. 5987–5995 (2017)

33. Younes, L.: Shapes and Diﬀeomorphisms, Applied Mathematical Sciences, vol. 171.

Springer Science & Business Media, Berlin, Heidelberg (2010)

34. Younes, L.: Diﬀeomorphic Learning. arXiv.org (2018)

- A preview of this full-text is provided by Springer Nature.
- Learn more

Preview content only

Content available from Journal of Mathematical Imaging and Vision

This content is subject to copyright. Terms and conditions apply.