ChapterPDF Available

Abstract and Figures

Deep-learning approaches in data-driven modeling relies on learning a finite number of transformations (and representations) of the data that are structured in a hierarchy and are often instantiated as deep neural networks (and their internal activations). State-of-the-art models for visual data usually implement deep residual learning: the network learns to predict a finite number of discrete updates that are applied to the internal network state to enrich it. Pushing the residual learning idea to the limit, ODE Net—a novel network formulation involving continuously evolving internal representations that gained the best paper award at NeurIPS 2018—has been recently proposed. Differently from traditional neural networks, in this model the dynamics of the internal states are defined by an ordinary differential equation with learnable parameters that defines a continuous transformation of the input representation. These representations can be computed using standard ODE solvers, and their dynamics can be steered to learn the input-output mapping by adjusting the ODE parameters via standard gradient-based optimization. In this work, we investigate the image representation learned in the continuous hidden states of ODE Nets. In particular, we train image classifiers including ODE-defined continuous layers and perform preliminary experiments to assess the quality, in terms of transferability and generality, of the learned image representations and compare them to standard representation extracted from residual networks. Experiments on CIFAR-10 and Tiny-ImageNet-200 datasets show that representations extracted from ODE Nets are more transferable and suggest an improved robustness to overfit.
Content may be subject to copyright.
Evaluation of continuous image features learned
by ODE Nets?
Fabio Carrara1[0000000150145089] , Giuseppe Amato1[0000000301714315],
Fabrizio Falchi1[0000000162585313], and Claudio Gennaro1[0000000209675050]
Institute of Information Science and Technologies (ISTI), Italian National Research
Council (CNR), Via G. Moruzzi 1, 56124 Pisa, Italy
{fabio.carrara, giuseppe.amato, fabrizio.falchi, claudio.gennaro}
@isti.cnr.it
Abstract. Deep-learning approaches in data-driven modeling relies on
learning a finite number of transformations (and representations) of the
data that are structured in a hierarchy and are often instantiated as deep
neural networks (and their internal activations). State-of-the-art models
for visual data usually implement deep residual learning: the network
learns to predict a finite number of discrete updates that are applied to
the internal network state to enrich it. Pushing the residual learning idea
to the limit, ODE Net — a novel network formulation involving continu-
ously evolving internal representations that gained the best paper award
at NeurIPS 2018 — has been recently proposed. Differently from tradi-
tional neural networks, in this model the dynamics of the internal states
are defined by an ordinary differential equation with learnable parame-
ters that defines a continuous transformation of the input representation.
These representations can be computed using standard ODE solvers, and
their dynamics can be steered to learn the input-output mapping by
adjusting the ODE parameters via standard gradient-based optimiza-
tion. In this work, we investigate the image representation learned in
the continuous hidden states of ODE Nets. In particular, we train image
classifiers including ODE-defined continuous layers and perform prelim-
inary experiments to assess the quality, in terms of transferability and
generality, of the learned image representations and compare them to
standard representation extracted from residual networks. Experiments
on CIFAR-10 and Tiny-ImageNet-200 datasets show that representations
extracted from ODE Nets are more transferable and suggest an improved
robustness to overfit.
Keywords: Transfer Learning ·Image Representations ·Continuous
Neural Networks ·Ordinary Differential Equations.
?This work was partially supported by “Automatic Data and documents Analysis
to enhance human-based processes” (ADA), CUP CIPE D55F17000290009, and by
the AI4EU project, funded by the EC (H2020 - Contract n. 825619). We gratefully
acknowledge the support of NVIDIA Corporation with the donation of a Tesla K40
GPU used for this research.
2 F. Carrara et al.
1 Introduction
The last decade witnessed the renaissance of neural networks and deep differ-
entiable models for multi-level representation learning known as Deep Learning,
that highly improved Artificial Intelligence (AI) and Machine Perception with a
special emphasis on Computer Vision. The AI renaissance started in 2012 when a
deep neural network, built by Hinton’s team, won the ImageNet Large Scale Vi-
sual Recognition Challenge [18], and from that, the astonishing results obtained
by deep-learning approaches for data-driven modeling produced an exponential-
growing research activity on this field. Deep Learning methods have been, and
still are, the driving force behind this renaissance, and impressive results have
been obtained through the adoption of deep learning in tasks such as image
classification [18, 14], object detection [27, 26], cross-media retrieval [6], image
sentiment analysis [31], recognition [1], etc. Being a representation learning ap-
proach, the rationale behind deep-learning methods is to automatically discover
a set of multi-level representations from raw data that are specialized for the
specific task to be solved, such as object detection or classification [19]. Start-
ing from raw data, each level of representation captures features of the input
at increasing level of abstraction that are useful for building successive repre-
sentations. Following this definition, we understand how relevant representations
learned in intermediate layers of deep learning architectures are. In the context of
visual data modeling, the architectures of models, mostly based on convolutional
neural networks, rapidly evolved from simple feed-forward networks to very deep
models with complex interactions between intermediate representations, such as
residual [15] or densely connected networks [16].
Recently, in the NeurIPS 2018 best paper [9], Chen et al. proposed ODE
Nets — a novel model formulation with continuous intermediate representations
defined by parametric ordinary differential equations (ODEs). This models can
be used as a generic building block for neural modeling: the evolution of the ac-
tivations and the gradients with respect to parameters can be computed calling
a generic ODE solver. This formulation provides several benefits, including nat-
ural continuous-time modeling, O(1)-memory cost, adaptive computation, and
tunable trade-off between speed and accuracy at inference time. The authors
demonstrated ODE blocks in image classifiers trained on the MNIST dataset,
actually creating a continuous and evolving activation space of image represen-
tations.
In this work, we analyze the continuous feature hierarchy created by ODE
Nets when classifying natural images in terms of generality and transferabil-
ity, and we compare them to representations extracted with standard neural
networks. We investigate multiple architectures in which a different amount of
processing is delegated to ODE blocks: we analyze standard residual networks,
mixed residual-ODE networks, and finally we also consider ODE-only architec-
tures. Preliminary experiments on CIFAR-10 and Tiny-ImageNet-200 datasets
show promising results for continuous representations extracted by ODE Nets
outperforming similar-sized standard residual networks on a transfer learning
benchmark.
Evaluation of continuous image features learned by ODE Nets 3
2 Related Work
Neural Image Representations Ever since the recent breakthroughs in the deep
learning field, extracting image representations from deep models, specially con-
volutional neural networks, has led to unprecedented accuracy in many vision
tasks. Early studies explored features extracted from generic object classifiers
trained on ImageNet: activations of late fully-connected layers played the role of
global descriptors and provided a strong baseline as robust image representations
[29, 5]. With the definition of more complex networks, the attention shifted to
feature maps obtained from convolutional layers. Effective representations can
be extracted from convolutional feature maps via spatial max-pooling [3, 30, 25]
or sum-pooling [4, 17], or more complex aggregation methods [2, 21, 24]. Better
representation can be obtained by fine-tuning the pretrained networks to the
retrieval task via siamese [23] or triplet [2,12] learning approaches. To the best
of our knowledge, we are the first to investigate ODE-derived continuous image
representations.
ODE-inspired Neural Architectures Most of current state-of-the art models im-
plements some sort of residual learning [14, 15], in which each layer or block
computes an update to be added to its input to obtain its output instead of
directly predict it. Recently, several works showed a strong parallelism between
residual networks and discretized ODE solutions, specifically demonstrating that
residual networks can be seen as the discretization of the Euler solution [33, 22].
This interpretation sprouted novel residual networks architectures inspired by
advanced discretizations of differential equations. [22] and [35] derived residual
architectures justified by approximating respectively the Linear Multi-step and
Runge–Kutta methods. Comparisons with dynamical systems inspired works on
reversibility and stability of residual networks [13,8, 28, 7]. [9] propose to directly
adopt ODE solvers to implement continuous dynamics inside neural networks.
Traditional variable-step ODE solvers enable sample-wise adaptive computations
in a natural way, while previously proposed methods for adaptive computation
on classical networks [32,8] require additional parameters to be trained.
3 ODE Nets
In this section, we review the main concepts about ODE Nets, including their
formulation and training approach. For a full detailed description, see [9].
An ODE Net is a neural network that include one or more blocks whose
internal states are defined by a parametric ordinary differential equation (ODE).
Let z(t) the vector of activations at a specific time tof its evolution. We define
its dynamics by a first-order ODE parametrized by θ
dz(t)
dt=f(z(t), t, θ).(1)
Given the initial value of the state z(t0) — the input of the ODE block — we
can compute the value of the state at a future time z(t1) — that we consider
4 F. Carrara et al.
the output of the ODE block — via integration of Equation 1
z(t1) = z(t0) + Zt1
t0
dz(t)
dtdt=z(t0) + Zt1
t0
f(z(t), t, θ)dt . (2)
This computation can be efficiently performed by modern ODE solvers, such
as the ones belonging to the Runge-Kutta family. Thus, the forward pass of an
ODE block is implemented as a call to a generic ODE solver
z(t1) = ODESolver(f, z(t0), t0, t1, θ),(3)
where fcan be an arbitrary function parametrized by θwhich is implemented
as a standard neural network.
In order to be able to train ODE Nets, we need to adjust the parameters θ
in order to implement the correct dynamics of the continuous internal state for
our specific task. Thus, given a loss function L, we need to compute its gradient
with respect to parameters dL/dθto perform a gradient descent step. Although
we can keep track of all the internal operations of the specific ODE solver used
and use backpropagation, this leads to a huge memory overhead, specially when
the dynamics of the internal state are complex, and the ODE solver requires
many steps to find the solution. Instead, Chen et al. [9] proposed to adopt the
adjoint sensitivity method. The adjoint state a(t) is defined as the derivative of
the loss with respect to the internal state z(t)
a(t) = L
z(t),(4)
and its dynamics can be described by the following ODE
da(t)
dt=a(t)∂f (z(t), t, θ)
z(t).(5)
The quantity we are interest in — the derivative of the loss with respect to
parameters dL/dθ— can be expressed in function of the adjoint a(t)
dL
dθ=Zt1
t0
a(t)∂f (z(t), t, θ)
∂θ dt , (6)
where ∂f (z(t), t, θ)/∂θ is known and defined by the structure of f. To compute
a(t) and thus dL/dθ, we need to know the entire trajectory of z(t), but this can be
recovered starting from the last state z(t1) and by solving its ODE (Equation 1)
backward in time. With a clever formulation, Chen et al. [9] also showed that it
is possible to combine the process for finding z(t), a(t), and dL/dθin a unique
additional call to the ODE solver.
Among the properties of ODE Nets, noteworthy benefits are a) O(1)-memory
cost, since no intermediate activations are needed to be stored for both forward
and backward operations, b) adaptive computation, as modern adaptive ODE
solvers automatically adjust the step size required to find the solution depending
on the complexity of the dynamics induced by a specific input, c) inference-
time speed-accuracy trade-off tuning, as the tolerance of adaptive solvers can be
lowered at inference time to obtain less accurate solutions faster or viceversa.
Evaluation of continuous image features learned by ODE Nets 5
4 Tested architectures
In this section, we describe the architectures of the image classifiers implemented
with ODE Nets that we are going to analyze. We test three architectures in total.
The first two are the ones defined by Chen et al. [9], i.e. a standard residual net-
work with 8 residual blocks, and a mixed architecture with two residual blocks
and an ODE block. In addition, we analyze an architecture defined by the mini-
mum amount of standard layers, that is thus composed by a single convolutional
layer and an ODE block. A detailed description of the architectures follows.
Residual Net We choose a standard residual network (ResNet) as a baseline im-
age classifier with the same architecture chosen by Chen et al. [9]. Starting from
the input, the ResNet is composed by two residual blocks each with a down-
sample factor of 2, and then by six additional residual blocks. The output of
the last residual block is average-pooled and followed by a fully-connected layer
with softmax activation that produces the final classification. The formulation
of the residual block is the standard one proposed in [15], but the batch normal-
ization operation is replaced with group normalization [34]. Thus, the structure
of the residual block is composed by two 3x3 256-filters convolutions preceded
by a 32-group normalization and ReLU activation, and a last group normal-
ization: GroupNorm-ReLU-Conv-GroupNorm-ReLU-Conv-GroupNorm. For the
first two blocks, we used 64-filters convolutions, and we employ 1x1 convolutions
with stride 2 in the shortcut connections to downsample its input.
Res-ODE Net The first ODE-defined architecture tested is the one proposed by
Chen et al. [9]. They proposed to keep the first part of the architecture as the
previously described ResNet and substitute the last six residual blocks by an
ODE block that evolves a continuous state z(t) in a normalized time interval
[0,1]. The ODE function fdefining its dynamics is implemented using the same
network used in the residual blocks. In addition, this module takes the value of
the current time tas input to convolutional layers as a constant feature maps
concatenated to the other input maps. Similarly to ResNets, the output of the
ODE block z(1) is average-pooled and fed to a fully-connected layer with softmax
activation.
ODE-only Net To fully exploit the ODE block and analyze its internal evolution,
we explore an additional architecture only composed by a single convolutional
layer and an ODE block. The convolutional layer has 256 4x4 filters slided with
stride 2 which is not followed by any non-linear activation. The ODE block,
defined as in the Res-ODE architecture, takes the output of the convolution as
the initial state of the ODE block z(0). As in the other architectures, the final
state z(1) is taken as output and fed to the classification layer.
5 Experimental Evaluation
Following [29], we evaluate the effectiveness and generality of learned image
representation by measuring its effectiveness in a transfer learning scenario [11].
6 F. Carrara et al.
We learn features extractors for a particular image classification task (source),
and we evaluate them by using the learned representations as high-level features
for another image classification task with similar domain (target).
For our investigation, we used two low-resolution datasets, that is CIFAR-10
for the source task, and Tiny-ImageNet-200 for the target task. CIFAR-10 [20] is
a small-resolution 10-class image classification datasets with 50k training images
and 10k test images. Tiny-ImageNet-2001is a 200-class classification dataset
with 64x64 images extracted from the famous ImageNet subset used for the
ILSVRC challenge. Each class has 500 training images, 50 validation images,
and 50 test images, for a total of 100k, 10k, and 10k images respectively for
training, validation, and test sets.
We train all the models (Residual Net, Res-ODE Net, ODE-only Net) for 200
epochs on the CIFAR-10 dataset, adopting the SGD optimizer with momentum
of 0.9, a batch size of 128, a learning rate of 0.1 decreased by a factor 10 when the
loss plateaus, and a L2 weight decay of 104. We employ commonly used data
augmentation techniques for CIFAR-10, that is random cropping, color jittering,
and horizontal flipping, and we apply dropout with a .5 drop probability on the
layer preceeding the classifier. As ODE solver in ODE Nets, we employ a GPU
implementation2of the adaptive-step fourth order Runge-Kutta method [10],
that performs six function evaluation per step plus the initial and final timestep
evalution, i.e. number of function evaluation = 6 ×steps + 2.
Table 1 reports for each model the best test classification error obtained and
the complexity in both terms of number of parameters and ODE solver steps.
The introduction of ODE blocks in the image classification pipeline drastically
reduces the number of parameters of the model but also introduced a slight per-
formance degradation of the overall classification performance. Also note that
for ODE Nets, the number of steps required by the ODE solver to compute a
forward pass of the network depends on the complexity of the dynamics of in-
ternal state induced by a specific input. For Res-ODE models, the ODE solver
requires 3 to 4 steps to process an image, indicating that the learned dynamics
of hidden state are quite simple, and most of the information extraction process
is due to preceding standard layers. On the other hand, in ODE-only networks
the ODE block is responsible to model the entire feature extraction process and
thus requires to learn more complex dynamics of the hidden state; as a conse-
quence, the mean number of solver step required is higher, but it is more variable
depending on the input image. Figure 1 show the top-5 and bottom-5 images
of the CIFAR-10 test set in terms of number of solver steps required to make
a prediction; we can notice that the more prototypical and easily recognizable
images require fewer steps, while additional processing is adaptively employed
by the ODE solver when more challenging images are presented.
We extract intermediate activations from all the trained models as image
representations for the target task (Tiny-ImageNet-200). For Residual Nets, we
test the output of the last 7 residual modules before the classifier. For both ODE
1https://tiny-imagenet.herokuapp.com/
2https://github.com/rtqichen/torchdiffeq
Evaluation of continuous image features learned by ODE Nets 7
Fig. 1: The most (left) and least (right) demanding images of CIFAR-10 test set in
terms of the number of solver steps required by the ODE solver (that is reported near
each image).
Test Error Params Solver steps
Residual Net 7.28% 7.92M -
Res-ODE Net 7.80% 2.02M 3.8 ±0.4
ODE-only Net 9.17% 1.20M 7.8 ±1.5
Table 1: Classification Performance on CIFAR-10.
8 F. Carrara et al.
Nets, there are an infinite amount of intermediate states z(t), t [0,1] that we
can extract; we sample z(t) between 0 and 1 with a sample rate of 0.05 and test
every sample as image representation for the target task. For all the extracted
representations, we apply global average pooling to obtain a spatial-agnostic
feature vector.
We train a linear SVM classifier that rely on the extracted features on the
validation set of Tiny-ImageNet-200 (for which labels are provided): we perform
a grid search of the penalty parameter C∈ {0.01,0.1,1,10,100}, keeping track of
the configuration that obtained the best 5-fold cross-validated accuracy. We then
retrain this configuration on the whole set and report its accuracy. In Figure 2,
Fig. 2: Accuracy (%) on the Tiny-ImageNet-200 validation set of a linear SVM trained
on z(t). Results obtained using the 7 intermediate layers of the Residual Net are evenly
placed between 0 and 1 on the x-axis.
we report the accuracies obtained by all the SVMs trained on different internal
activations of all the tested models. The x-axis indicate the time stamp tused to
extract the internal representation of ODE Nets z(t), while the y-axis indicate
the obtained accuracy. For convenience, we place the 7 points obtained from the
7 intermediate layers of the Residual Net evenly spaced in the x-axis between 0
and 1.
Evaluation of continuous image features learned by ODE Nets 9
In both ODE Nets, we observe a concave trend of the accuracy when using
later activations, with a maximum accuracy obtained using intermediate features
extracted from the early or mid evolution of the continuous hidden states (21%
at t=.45 for ODE-only and 19.5% at t=.1 for Res-ODE). As already suggested
by findings in other works [5,3], mid-features seem to be more transferable.
Mid-features in Res-ODE are already extracted by preceding standard layers,
thus they occur early in the evolution of the continuous hidden state. ODE Nets
provide a more general and transferable image representation w ith respect to
Residual Nets that instead provide a lower and practically constant performance
on the target task, suggesting a higher degree of overfit to the source task.
Notwithstanding that, the CIFAR-10 dataset is not able to provide enough
information about all the classes of the target dataset to obtain competitive
accuracies, and a larger and more complex dataset should be used as a source
task. Unfortunately, training ODE Nets has currently a high computational cost,
as also suggested by the evaluation of their proposers that was limited to the
MNIST dataset for image classification. This limits our ability to perform a
larger-scale experimentation, that are left for future work.
6 Conclusions
In this paper, we investigated the representations learned by ODE Nets, a
promising and potentially revolutionary deep-learning approach in which hidden
states are defined by an ordinary differential equation with learnable parameters.
We conducted our experiments in a transfer learning scenario: we trained three
deep-learning architectures (ODE-only Net, Res-ODE Net and Residual Net)
on a particular image classification task (CIFAR-10), and we evaluate them by
using the learned representations as high-level features for another image classi-
fication task (Tiny-ImageNet-200). The results show that ODE Nets provide a
more transferable, and thus more general, image representation with respect to
standard residual networks. Considering also other intrinsic advantages of ODE
Nets, such as O(1)-memory cost, and adaptive and adjustable inference-time
computational cost, this preliminary analysis justifies and encourages additional
research on the optimization of this kind of networks and its adoption in image
representation learning.
References
1. Amato, G., Falchi, F., Vadicamo, L.: Visual recognition of ancient inscriptions
using convolutional neural network and fisher vector. Journal on Computing and
Cultural Heritage (JOCCH) 9(4), 21 (2016)
2. Arandjelovic, R., Gronat, P., Torii, A., Pajdla, T., Sivic, J.: Netvlad: Cnn ar-
chitecture for weakly supervised place recognition. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. pp. 5297–5307 (2016)
3. Azizpour, H., Sharif Razavian, A., Sullivan, J., Maki, A., Carlsson, S.: From generic
to specific deep representations for visual recognition. In: Proceedings of the IEEE
conference on computer vision and pattern recognition workshops. pp. 36–45 (2015)
10 F. Carrara et al.
4. Babenko, A., Lempitsky, V.: Aggregating local deep features for image retrieval.
In: Proceedings of the IEEE international conference on computer vision. pp. 1269–
1277 (2015)
5. Babenko, A., Slesarev, A., Chigorin, A., Lempitsky, V.: Neural codes for image
retrieval. In: European conference on computer vision. pp. 584–599. Springer (2014)
6. Carrara, F., Esuli, A., Fagni, T., Falchi, F., Moreo Fern´andez, A.: Pic-
ture it in your mind: generating high level visual representations from tex-
tual descriptions. Information Retrieval Journal 21(2), 208–229 (Jun 2018).
https://doi.org/10.1007/s10791-017-9318-6, https://doi.org/10.1007/s10791-017-
9318-6
7. Chang, B., Meng, L., Haber, E., Ruthotto, L., Begert, D., Holtham, E.: Reversible
architectures for arbitrarily deep residual neural networks. In: Thirty-Second AAAI
Conference on Artificial Intelligence (2018)
8. Chang, B., Meng, L., Haber, E., Tung, F., Begert, D.: Multi-level residual networks
from dynamical systems view. In: International Conference on Learning Represen-
tations (2018), https://openreview.net/forum?id=SyJS-OgR-
9. Chen, T.Q., Rubanova, Y., Bettencourt, J., Duvenaud, D.K.: Neural ordinary dif-
ferential equations. In: Advances in Neural Information Processing Systems. pp.
6572–6583 (2018)
10. Dormand, J.R., Prince, P.J.: A family of embedded runge-kutta formulae. Journal
of computational and applied mathematics 6(1), 19–26 (1980)
11. Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT press (2016)
12. Gordo, A., Almazan, J., Revaud, J., Larlus, D.: End-to-end learning of deep vi-
sual representations for image retrieval. International Journal of Computer Vision
124(2), 237–254 (2017)
13. Haber, E., Ruthotto, L.: Stable architectures for deep neural networks. Inverse
Problems 34(1), 014004 (2017)
14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016)
15. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks.
In: European conference on computer vision. pp. 630–645. Springer (2016)
16. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected
convolutional networks. In: Proceedings of the IEEE conference on computer vision
and pattern recognition. pp. 4700–4708 (2017)
17. Kalantidis, Y., Mellina, C., Osindero, S.: Cross-dimensional weighting for aggre-
gated deep convolutional features. In: European conference on computer vision.
pp. 685–701. Springer (2016)
18. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. In: Advances in neural information processing systems.
pp. 1097–1105 (2012)
19. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. nature 521(7553), 436 (2015)
20. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al.: Gradient-based learning ap-
plied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
21. Li, Y., Xu, Y., Wang, J., Miao, Z., Zhang, Y.: Ms-rmac: Multiscale regional maxi-
mum activation of convolutions for image retrieval. IEEE Signal Processing Letters
24(5), 609–613 (2017)
22. Lu, Y., Zhong, A., Li, Q., Dong, B.: Beyond finite layer neural networks:
Bridging deep architectures and numerical differential equations. arXiv preprint
arXiv:1710.10121 (2017)
Evaluation of continuous image features learned by ODE Nets 11
23. Radenovi´c, F., Tolias, G., Chum, O.: Cnn image retrieval learns from bow: Un-
supervised fine-tuning with hard examples. In: European conference on computer
vision. pp. 3–20. Springer (2016)
24. Radenovi´c, F., Tolias, G., Chum, O.: Fine-tuning cnn image retrieval with no
human annotation. IEEE transactions on pattern analysis and machine intelligence
(2018)
25. Razavian, A.S., Sullivan, J., Carlsson, S., Maki, A.: Visual instance retrieval with
deep convolutional networks. ITE Transactions on Media Technology and Appli-
cations 4(3), 251–258 (2016)
26. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified,
real-time object detection. In: Proceedings of the IEEE conference on computer
vision and pattern recognition. pp. 779–788 (2016)
27. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detec-
tion with region proposal networks. In: Advances in neural information processing
systems. pp. 91–99 (2015)
28. Ruthotto, L., Haber, E.: Deep neural networks motivated by partial differential
equations. arXiv preprint arXiv:1804.04272 (2018)
29. Sharif Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: Cnn features off-the-
shelf: an astounding baseline for recognition. In: Proceedings of the IEEE confer-
ence on computer vision and pattern recognition workshops. pp. 806–813 (2014)
30. Tolias, G., Sicre, R., J´egou, H.: Particular object retrieval with integral max-
pooling of cnn activations. arXiv preprint arXiv:1511.05879 (2015)
31. Vadicamo, L., Carrara, F., Cimino, A., Cresci, S., Dell’Orletta, F., Falchi, F.,
Tesconi, M.: Cross-media learning for image sentiment analysis in the wild. In:
2017 IEEE International Conference on Computer Vision Workshops (ICCVW).
pp. 308–317 (Oct 2017). https://doi.org/10.1109/ICCVW.2017.45
32. Veit, A., Belongie, S.: Convolutional networks with adaptive inference graphs. In:
Proceedings of the European Conference on Computer Vision (ECCV). pp. 3–18
(2018)
33. Weinan, E.: A proposal on machine learning via dynamical systems. Communica-
tions in Mathematics and Statistics 5(1), 1–11 (2017)
34. Wu, Y., He, K.: Group normalization. In: Proceedings of the European Conference
on Computer Vision (ECCV). pp. 3–19 (2018)
35. Zhu, M., Chang, B., Fu, C.: Convolutional neural networks combined with runge-
kutta methods. arXiv preprint arXiv:1802.08831 (2018)
... usually any combination of convolutional, normalization, and activation layers in the image processing context. Comparing Equation (6) to (5), residual networks resemble the discretization of the solution of an ODE found with the Euler method for ∆t = 1. Pushing this concept to the limit, a residual network with an infinite amount of layers each producing an infinitesimal update can be formulated as the solution of an ODE defining the dynamics of its continuous hidden state h(t) ...
... This properties paved the way for novel methodologies for multiple tasks, such as irregular time-series modelling [32] and continuous normalizing flows [14], which include continuous feature extraction modelling for visual tasks. In addition to Chen et al. [9], Dupont et al. [12] ResNet RES-ODE [9] ODE-only [6] 3 × 3, 64 3 × 3, 64 ...
... global average pooling, 10-d fc, softmax benchmarks, while Carrara et al. [6] also tested how representations extracted from ODE-defined models transfer between similar image classification tasks. Our work follows this last line of research, but we focus on extracting image representation from ODE-based networks for image retrieval and on exploiting continuity and adaptive computation for tuning the efficiency-effectiveness trade-off at query time. ...
... Our research on this topic concerns the evaluation of these novel models when applied to common vision tasks. We analyzed the continuous image representation learned by ODE-Nets and evaluate them in image classification and transfer learning tasks, observing an improved transferability over features extracted by standard residual networks [9]. (More details in Section 3.2.8). ...
... F. Carrara, G. Amato, F. Falchi, C. Gennaro. In Image Analysis and Processing -ICIAP 2019, 20th International Conference, Trento, Italy, September 9-13, 2019 [9]. Abstract: ...
... The Pytorch code reproducing [6,9] is publicly available on GitHub 7 . ...
Technical Report
Full-text available
The Artificial Intelligence for Multimedia Information Retrieval (AIMIR) research group is part of the NeMIS laboratory of the Information Science and Technologies Institute ``A. Faedo'' (ISTI) of the Italian National Research Council (CNR). The AIMIR group has a long experience in topics related to: Artificial Intelligence, Multimedia Information Retrieval, Computer Vision and Similarity search on a large scale. We aim at investigating the use of Artificial Intelligence and Deep Learning, for Multimedia Information Retrieval, addressing both effectiveness and efficiency. Multimedia information retrieval techniques should be able to provide users with pertinent results, fast, on huge amount of multimedia data. Application areas of our research results range from cultural heritage to smart tourism, from security to smart cities, from mobile visual search to augmented reality. This report summarize the 2019 activities of the research group.
Article
Full-text available
Do convolutional networks really need a fixed feed-forward structure? What if, after identifying the high-level concept of an image, a network could move directly to a layer that can distinguish fine-grained differences? Currently, a network would first need to execute sometimes hundreds of intermediate layers that specialize in unrelated aspects. Ideally, the more a network already knows about an image, the better it should be at deciding which layer to compute next. In this work, we propose convolutional networks with adaptive inference graphs (ConvNet-AIG) that adaptively define their network topology conditioned on the input image. Following a high-level structure similar to residual networks (ResNets), ConvNet-AIG decides for each input image on the fly which layers are needed. In experiments on ImageNet we show that ConvNet-AIG learns distinct inference graphs for different categories. Both ConvNet-AIG with 50 and 101 layers outperform their ResNet counterpart, while using \(20\%\) and \(38\%\) less computations respectively. By grouping parameters into layers for related classes and only executing relevant layers, ConvNet-AIG improves both efficiency and overall classification quality. Lastly, we also study the effect of adaptive inference graphs on the susceptibility towards adversarial examples. We observe that ConvNet-AIG shows a higher robustness than ResNets, complementing other known defense mechanisms.
Article
Full-text available
Partial differential equations (PDEs) are indispensable for modeling many physical phenomena and also commonly used for solving image processing tasks. In the latter area, PDE-based approaches interpret image data as discretizations of multivariate functions and the output of image processing algorithms as solutions to certain PDEs. Posing image processing problems in the infinite dimensional setting provides powerful tools for their analysis and solution. Over the last three decades, the reinterpretation of classical image processing tasks through the PDE lens has been creating multiple celebrated approaches that benefit a vast area of tasks including image segmentation, denoising, registration, and reconstruction. In this paper, we establish a new PDE-interpretation of deep convolution neural networks (CNN) that are commonly used for learning tasks involving speech, image, and video data. Our interpretation includes convolution residual neural networks (ResNet), which are among the most promising approaches for tasks such as image classification having improved the state-of-the-art performance in prestigious benchmark challenges. Despite their recent successes, deep ResNets still face some critical challenges associated with their design, immense computational costs and memory requirements, and lack of understanding of their reasoning. Guided by well-established PDE theory, we derive three new ResNet architectures that fall two new classes: parabolic and hyperbolic CNNs. We demonstrate how PDE theory can provide new insights and algorithms for deep learning and demonstrate the competitiveness of three new CNN architectures using numerical experiments.
Article
Full-text available
Image descriptors based on activations of Convolutional Neural Networks (CNNs) have become dominant in image retrieval due to their discriminative power, compactness of the representation, and the efficiency of search. Training of CNNs, either from scratch or fine-tuning, requires a large amount of annotated data, where high quality of the annotation is often crucial. In this work, we propose to fine-tune CNNs for image retrieval on a large collection of unordered images in a fully automatic manner. Reconstructed 3D models, obtained by the state-of-the-art retrieval and structure-from-motion methods, guide the selection of the training data. We show that both hard positive and hard negative examples, selected by exploiting the geometry and the camera positions available from the 3D models, enhance the performance in particular object retrieval. CNN descriptor whitening discriminatively learned from the same training data outperforms the commonly used PCA whitening. We propose a novel trainable Generalized-Mean (GeM) pooling layer that generalizes max and average pooling and show that it boosts retrieval performance. Applying the proposed method on VGG network achieves state-of-the-art performance on standard benchmarks: Oxford Buildings, Paris, and Holidays datasets.
Article
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper we embrace this observation and introduce the Dense Convolutional Network (DenseNet), where each layer is directly connected to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections, one between each layer and its subsequent layer (treating the input as layer 0), our network has L(L+1)/2 direct connections. For each layer, the feature maps of all preceding layers are treated as separate inputs whereas its own feature maps are passed on as inputs to all subsequent layers. Our proposed connectivity pattern has several compelling advantages: it alleviates the vanishing gradient problem and strengthens feature propagation; despite the increase in connections, it encourages feature reuse and leads to a substantial reduction of parameters; its models tend to generalize surprisingly well. We evaluate our proposed architecture on five highly competitive object recognition benchmark tasks. The DenseNet obtains significant improvements over the state-of-the-art on all five of them (e.g., yielding 3.74% test error on CIFAR-10, 19.25% on CIFAR-100 and 1.59% on SVHN).
Conference Paper
Full-text available
Much progress has been made in the field of sentiment analysis in the past years. Researchers relied on textual data for this task, while only recently they have started investigating approaches to predict sentiments from multimedia content. With the increasing amount of data shared on social media, there is also a rapidly growing interest in approaches that work "in the wild" , i.e. that are able to deal with uncontrolled conditions. In this work, we faced the challenge of training a visual sentiment classifier starting from a large set of user-generated and unlabeled contents. In particular, we collected more than 3 million tweets containing both text and images, and we leveraged on the sentiment polarity of the textual contents to train a visual sentiment classifier. To the best of our knowledge, this is the first time that a cross-media learning approach is proposed and tested in this context. We assessed the validity of our model by conducting comparative studies and evaluations on a benchmark for visual sentiment analysis. Our empirical study shows that although the text associated to each image is often noisy and weakly correlated with the image content, it can be profitably exploited to train a deep Convolutional Neural Network that effectively predicts the sentiment polarity of previously unseen images.
Chapter
Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems—BN’s error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN’s usage for training larger models and transferring features to computer vision tasks including detection, segmentation, and video, which require small batches constrained by memory consumption. In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN’s computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants. Moreover, GN can be naturally transferred from pre-training to fine-tuning. GN can outperform its BN-based counterparts for object detection and segmentation in COCO, and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. GN can be easily implemented by a few lines of code.
Chapter
Do convolutional networks really need a fixed feed-forward structure? What if, after identifying the high-level concept of an image, a network could move directly to a layer that can distinguish fine-grained differences? Currently, a network would first need to execute sometimes hundreds of intermediate layers that specialize in unrelated aspects. Ideally, the more a network already knows about an image, the better it should be at deciding which layer to compute next. In this work, we propose convolutional networks with adaptive inference graphs (ConvNet-AIG) that adaptively define their network topology conditioned on the input image. Following a high-level structure similar to residual networks (ResNets), ConvNet-AIG decides for each input image on the fly which layers are needed. In experiments on ImageNet we show that ConvNet-AIG learns distinct inference graphs for different categories. Both ConvNet-AIG with 50 and 101 layers outperform their ResNet counterpart, while using \(20\%\) and \(33\%\) less computations respectively. By grouping parameters into layers for related classes and only executing relevant layers, ConvNet-AIG improves both efficiency and overall classification quality. Lastly, we also study the effect of adaptive inference graphs on the susceptibility towards adversarial examples. We observe that ConvNet-AIG shows a higher robustness than ResNets, complementing other known defense mechanisms.
Article
Deep residual networks (ResNets) and their variants are widely used in many computer vision applications and natural language processing tasks. However, the theoretical principles for designing and training ResNets are still not fully understood. Recently, several points of view have emerged to try to interpret ResNet theoretically, such as unraveled view, unrolled iterative estimation and dynamical systems view. In this paper, we adopt the dynamical systems point of view, and analyze the lesioning properties of ResNet both theoretically and experimentally. Based on these analyses, we additionally propose a novel method for accelerating ResNet training. We apply the proposed method to train ResNets and Wide ResNets for three image classification benchmarks, reducing training time by more than 40\% with superior or on-par accuracy.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Conference Paper
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region pro-posal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolu-tional features. For the very deep VGG-16 model [18], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.