PreprintPDF Available

Improved Few-Shot Visual Classification

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Few-shot learning is a fundamental task in computer vision that carries the promise of alleviating the need for exhaustively labeled data. Most few-shot learning approaches to date have focused on progressively more complex neural feature extractors and classifier adaptation strategies, as well as the refinement of the task definition itself. In this paper, we explore the hypothesis that a simple class-covariance-based distance metric, namely the Mahalanobis distance, adopted into a state of the art few-shot learning approach (CNAPS) can, in and of itself, lead to a significant performance improvement. We also discover that it is possible to learn adaptive feature extractors that allow useful estimation of the high dimensional feature covariances required by this metric from surprisingly few samples. The result of our work is a new "Simple CNAPS" architecture which has up to 9.2% fewer trainable parameters than CNAPS and performs up to 6.1% better than state of the art on the standard few-shot image classification benchmark dataset.
Content may be subject to copyright.
Improved Few-Shot Visual Classification
Peyman Bateni1, Raghav Goyal1,3, Vaden Masrani1, Frank Wood1,2,4, Leonid Sigal1,3,4
1University of British Columbia, 2MILA, 3Vector Institute, 4CIFAR AI Chair
{pbateni, rgoyal14, vadmas, fwood, lsigal}@cs.ubc.ca
Abstract
Few-shot learning is a fundamental task in computer vi-
sion that carries the promise of alleviating the need for ex-
haustively labeled data. Most few-shot learning approaches
to date have focused on progressively more complex neural
feature extractors and classifier adaptation strategies, and
the refinement of the task definition itself. In this paper, we
explore the hypothesis that a simple class-covariance-based
distance metric, namely the Mahalanobis distance, adopted
into a state of the art few-shot learning approach (CNAPS
[30]) can, in and of itself, lead to a significant performance
improvement. We also discover that it is possible to learn
adaptive feature extractors that allow useful estimation of
the high dimensional feature covariances required by this
metric from surprisingly few samples. The result of our
work is a new “Simple CNAPS” architecture which has up
to 9.2% fewer trainable parameters than CNAPS and per-
forms up to 6.1% better than state of the art on the standard
few-shot image classification benchmark dataset.
1. Introduction
Deep learning successes have led to major computer vi-
sion advances [11,13,37]. However, most methods behind
these successes have to operate in fully-supervised, high
data availability regimes. This limits the applicability of
these methods, effectively excluding domains where data
is fundamentally scarce or impossible to label en masse.
This inspired the field of few-shot learning [42,43] which
aims to computationally mimic human reasoning and learn-
ing from limited data.
The goal of few-shot learning is to automatically adapt
models such that they work well on instances from classes
not seen at training time, given only a few labelled exam-
ples for each new class. In this paper, we focus on few-shot
image classification where the ultimate aim is to develop a
classification methodology that automatically adapts to new
classification tasks at test time, and particularly in the case
where only a very small number of labelled “support” im-
ages are available per class.
(a) Squared Euclidean Distance (b) Squared Mahalanobis Distance
Figure 1: Class-covariance metric: Two-dimensional il-
lustration of the embedded support image features output by
a task-adapted feature extractor (points), per-class embed-
ding means (inset icons), explicit (left) and implied class
decision boundaries (right), and test query instance (gray
point and inset icon) for two classifiers: standard L2
2-based
(left) and ours, class-covariance-based (Mahalanobis dis-
tance, right). An advantage of using a class-covariance-
based metric during classification is that taking into ac-
count the distribution in feature space of each class can re-
sult in improved non-linear classifier decision boundaries.
What cannot explicitly appear in this figure, but we wish
to convey here regardless, is that the task-adaptation mech-
anism used to produce these embeddings is trained end-
to-end from the Mahalanobis-distance-based classification
loss. This means that, in effect, the task-adaptation feature
extraction mechanism learns to produce embeddings that re-
sult in informative task-adapted covariance estimates.
Few-shot learning approaches typically take one of two
forms: 1) nearest neighbor approaches and their variants,
including matching networks [40], which effectively ap-
ply nearest-neighbor or weighted nearest neighbor clas-
sification on the samples themselves, either in a feature
[15,16,34] or a semantic space [5]; or 2) embedding meth-
ods that effectively distill all of the examples to a single pro-
totype per class, where a prototype may be learned [9,30]
or implicitly derived from the samples [36] (e.g. mean em-
bedding). The prototypes are often defined in feature or
semantic space (e.g. word2vec [44]). Most research in this
domain has focused on learning non-linear mappings, of-
ten expressed as neural nets, from images to the embed-
1
arXiv:1912.03432v3 [cs.CV] 11 Jun 2020
Pre-
Trained
Globally
Trained
Partially
Adapted
Fully
Adapted
Cosine
Similarity
Squared
Euclidean
Distance
L1
(weighted)
Distance
Linear
Classifier
Adapted
Linear
Classifier Squared
Mahalanobis
Distance
MLP
Siamese
Networks
Simple
CNAPS
Relation
Networks
CNAPS
MAML
Matching
Networks
Meta-LSTM
Prototypical
Networks
Finetune
Proto-
MAML
k-NN
TADAM
Dot Product Bregman Divergences Neural
Networks
Classifier
Feature
Extractor
Reptile,
SNAIL
Figure 2: Approaches to few-shot image classification:
organized by image feature extractor adaptation scheme
(vertical axis) versus final classification method (horizon-
tal axis). Our method (Simple CNAPS) partially adapts
the feature extractor (which is architecturally identical to
CNAPS) but is trained with, and uses, a fixed, rather than
adapted, Mahalanobis metric for final classification.
ding space subject to a pre-defined metric in the embedding
space used for final nearest class classification; usually co-
sine similarity between query image embedding and class
embedding. Most recently, CNAPS [30] achieved state of
the art (SoTA) few-shot visual image classification by utiliz-
ing sparse FiLM [27] layers within the context of episodic
training to avoid problems that arise from trying to adapt
the entire embedding network using few support samples.
Overall much less attention has been given to the met-
ric used to compute distances for classification in the em-
bedding space. Presumably this is because common wis-
dom dictates that flexible non-linear mappings are ostensi-
bly able to adapt to any such metric, making the choice of
metric apparently inconsequential. In practice, as we find
in this paper, the choice of metric is quite important. In
[36] the authors analyze the underlying distance function
used in order to justify the use of sample means as pro-
totypes. They argue that Bregman divergences [1] are the
theoretically sound family of metrics to use in this setting,
but only utilize a single instance within this class squared
Euclidean distance, which they find to perform better than
the more traditional cosine metric. However, the choice of
Euclidean metric involves making two flawed assumptions:
1) that feature dimensions are un-correlated and 2) that they
have uniform variance. Also, it is insensitive to the distribu-
tion of within-class samples with respect to their prototype
and recent results [26,36] suggest that this is problematic.
Modeling this distribution (in the case of [1] using extreme
value theory) is, as we find, a key to better performance.
Our Contributions: Our contributions are four-fold: 1) A
robust empirical finding of a significant 6.1% improvement,
on average, over SoTA (CNAPS [30]) in few-shot image
classification, obtained by utilizing a test-time-estimated
class-covariance-based distance metric, namely the Maha-
lanobis distance [6], in final, task-adapted classification. 2)
The surprising finding that we are able to estimate such
a metric even in the few shot classification setting, where
the number of available support examples, per class, is far
too few in theory to estimate the required class-specific co-
variances. 3) A new “Simple CNAPS” architecture that
achieves this performance despite removing 788,485 pa-
rameters (3.2%-9.2% of the total) from original CNAPS
architecture, replacing them with fixed, not-learned, deter-
ministic covariance estimation and Mahalanobis distance
computations. 4) Evidence that should make readers ques-
tion the common understanding that CNN feature extractors
of sufficient complexity can adapt to any final metric (be it
cosine similarity/dot product or otherwise).
2. Related Work
Most of last decade’s few-shot learning works [43] can
be differentiated along two main axes: 1) how images are
transformed into vectorized embeddings, and 2) how “dis-
tances” are computed between vectors in order to assign la-
bels. This is shown in Figure 2.
Siamese networks [16], an early approach to few-shot
learning and classification, used a shared feature extractor
to produce embeddings for both the support and query im-
ages. Classification was then done by picking the small-
est weighted L1 distance between query and labelled im-
age embeddings. Relation networks [38], and recent GCNN
variants [15,34], extended this by parameterizing and learn-
ing the classification metric using a Multi-Layer Perceptron
(MLP). Matching networks [40] learned distinct feature ex-
tractors for support and query images which were then used
to compute cosine similarities for classification.
The feature extractors used by these models were, no-
tably, not adapted to test-time classification tasks. It has
become established that adapting feature extraction to new
tasks at test time is generally a good thing to do. Fine tun-
ing transfer-learned networks [45] did this by fine-tuning
the feature extractor network using the task-specific support
images but found limited success due to problems related
to overfitting to, the generally very few, support examples.
MAML [3] (and its many extensions [23,24,28]) mitigated
this issue by learning a set of meta-parameters that specif-
ically enabled the feature extractors to be adapted to new
tasks given few support examples using few gradient steps.
The two methods most similar to our own are CNAPS
[30] (and the related TADAM [26]) and Prototypical net-
works [36]. CNAPS is a few-shot adaptive classifier based
on conditional neural processes (CNP) [7]. It is the state of
the art approach for few-shot image classification [30]. It
uses a pre-trained feature extractor augmented with FiLM
Block 1 Block 2
Film Layers
Layer 1
Block 1 Block 2
Film Layers
Layer 2
Block 1 Block 2
Film Layers
Layer 3
Block 1 Block 2
Film Layers
Layer 4
Pre Post
Task Encoder
Figure 3: Overview of the feature extractor adaptation methodology in CNAPS: task encoder gφ(·)provides the adapta-
tion network ψi
φat each block iwith the task representations (gφ(Sτ)to produce FiLM parameters (γj, βj). For details on
the auto-regressive variant (AR-CNAPS), architectural implementations, and FiLM layers see Appendix B. For an in-depth
explanation, refer to the original paper [30].
layers [27] that are adapted for each task using the support
images specific to that task. CNAPS uses a dot-product dis-
tance in a final linear classifier; the parameters of which
are also adapted at test-time to each new task. We describe
CNAPS in greater detail when describing our method.
Prototypical networks [36] do not use a feature adapta-
tion network; they instead use a simple mean pool operation
to form class “prototypes.” Squared Euclidean distances
to these prototypes are then subsequently used for classi-
fication. Their choice of the distance metric was motivated
by the theoretical properties of Bregman divergences [1], a
family of functions of which the squared Euclidean distance
is a member of. These properties allow for a mathematical
correspondence between the use of the squared Euclidean
distance in a Softmax classifier and performing density es-
timation. Expanding on [36] in our paper, we also exploit
similar properties of the squared Mahalanobis distance as a
Bregman divergence [1] to draw theoretical connections to
multi-variate Gaussian mixture models.
Our work differs from CNAPS [30] and Prototypical net-
works [36] in the following ways. First, while CNAPS has
demonstrated the importance of adapting the feature extrac-
tor to a specific task, we show that adapting the classifier
is actually unnecessary to obtain good performance. Sec-
ond, we demonstrate that an improved choice of Bregman
divergence can significantly impact accuracy. Specifically
we show that regularized class-specific covariance estima-
tion from task-specific adapted feature vectors allows the
use of the Mahalanobis distance for classification, achieving
a significant improvement over state of the art. A high-level
diagrammatic comparison of our “Simple CNAPS” archi-
tecture to CNAPS can be found in Figure 4.
More recently, [4] also explored using the Mahalanobis
distance by incorporating its use in Prototypical networks
[36]. In particular they used a neural network to produce
per-class diagonal covariance estimates, however, this ap-
proach is restrictive and limits performance. Unlike [4],
Simple CNAPS generates regularized full covariance esti-
mates from an end-to-end trained adaptation network.
3. Formal Problem Definition
We frame few-shot image classification as an amortized
classification task. Assume that we have a large labelled
dataset D={(xi, yi)}N
i=1 of images xiand labels yi. From
this dataset we can construct a very large number of clas-
sification tasks Dτ D by repeatedly sampling without
replacement from D. Let τZ+uniquely identify a clas-
sification task. We define the support set of a task to be
Sτ={(xi, yi)}Nτ
i=1 and the query set Qτ={(x
i, y
i)}Nτ
i=1
where Dτ=SτQτwhere xi,x
iRDare vectorized
images and yi, y
i {1, ..., K}are class labels. Our objec-
tive is to find parameters θof a classifier fθthat maximizes
Eτ[QQτp(y
i|fθ(x
i,Sτ)].
In practice, Dis constructed by concatenating large im-
age classification datasets and the set of classification tasks.
{Dτ}τ=1 is sampled in a more complex way than simply
without replacement. In particular, constraints are placed
on the relationship of the image label pairs present in the
support set and those present in the query set. For instance,
in few-shot learning, the constraint that the query set labels
are a subset of the support set labels is imposed. With this
constraint imposed, the classification task reduces to cor-
rectly assigning each query set image to one of the classes
present in the support set. Also, in this constrained few-shot
classification case, the support set can be interpreted as be-
ing the “training data” for implicitly training (or adapting)
a task-specific classifier of query set images. Note that in
conjecture with [30,39] and unlike earlier work [36,3,40],
we do not impose any constraints on the support set having
to be balanced and of uniform number of classes, although
we do conduct experiments on this narrower setting too.
4. Method
Our classifier shares feature adaptation architecture with
CNAPS [30], but deviates from CNAPS by replacing their
adaptive classifier with a simpler classification scheme
based on estimating Mahalanobis distances. To explain our
classifier, namely “Simple CNAPS”, we first detail CNAPS
... ...
...
Class Means Class Covariance Estimates
softmax softmax
...
1 2 2 3 4
...
??
... ... ... ...
Feature Extractor
...
Support Images Query Images
Feature Extractor
Support Feature Vectors Query Feature Vectors
Classification Adaptor
Classification Weights
...
...
Feature Extractor
...
Class Means
FC FC
ELU FC ELU
CNAPS Classifier (# parameters: 788K) Simple CNAPS Classifier (# parameters: 0)
Biases
...
Figure 4: Comparison of the feature extraction and classification in CNAPS versus Simple CNAPS: Both CNAPS and
Simple CNAPS share the feature extraction adaptation architecture detailed in Figure 3. CNAPS and Simple CNAPS differ in
how distances between query feature vectors and class feature representations are computed for classification. CNAPS uses
a trained, adapted linear classifier whereas Simple CNAPS uses a differentiable but fixed and parameter-free deterministic
distance computation. Components in light blue have parameters that are trained, specifically fτ
θin both models and ψc
φin the
CNAPS adaptive classification. CNAPS classification requires 778k parameters while Simple CNAPS is fully deterministic.
in Section 4.1, before presenting our model in Section 4.2.
4.1. CNAPS
Conditional Neural Adapative Processes (CNAPS) con-
sist of two elements: a feature extractor and a classifier,
both of which are task-adapted. Adaptation is performed
by trained adaptation modules that take the support set.
The feature extractor architecture used in both CNAPS
and Simple CNAPS is shown in Figure 3. It consists of a
ResNet18 [10] network pre-trained on ImageNet [31] which
also has been augmented with FiLM layers [27]. The pa-
rameters {γj, βj}4
j=1 of the FiLM layers can scale and shift
the extracted features at each layer of the ResNet18, allow-
ing the feature extractor to focus and disregard different fea-
tures on a task-by-task basis. A feature adaptation module
ψf
φis trained to produce {γj,βj}4
j=1 based on the support
examples Sτprovided for the task.
The feature extractor adaptation module ψf
φconsists of
two stages: support set encoding followed by film layer pa-
rameter production. The set encoder gφ(·), parameterized
by a deep neural network, produces a permutation invariant
task representation gφ(Sτ)based on the support images Sτ.
This task representation is then passed to ψj
φwhich then
produces the FiLM parameters {γj,βj}for each block j
in the ResNet. Once the FiLM parameters have been set,
the feature extractor has been adapted to the task. We use
fτ
θto denote the feature extractor adapted to task τ. The
CNAPS paper [30] also proposes an auto-regressive adap-
tation method which conditions each adaptor ψj
φon the out-
put of the previous adapter ψj1
φ. We refer to this variant as
AR-CNAPS but for conciseness we omit the details of this
architecture here, and instead refer the interested reader to
[30] or to Appendix B.1 for a brief overview.
Classification in CNAPS is performed by a task-adapted
linear classifier where the class probabilities for a query
image x
iare computed as softmax(Wfτ
θ(x
i) + b). The
classification weights Wand biases bare produced by
the classifier adaptation network ψc
φforming [W,b] =
[ψc
φ(µ1)ψc
φ(µ2). . . ψc
φ(µK)]Twhere for each class k
in the task, the corresponding row of classification weights
(a) Euclidean Norm (b) Mahalanobis Distance
Figure 5: Problematic nature of the unit-normal as-
sumption: The Euclidean Norm (left) assumes embedded
image features fθ(xi)are distributed around class means
µkaccording to a unit normal. The Mahalanobis distance
(right) considers cluster variance when forming decision
boundaries, indicated by the background colour.
is produced by ψc
φfrom the class mean µk. The class mean
µkis obtained by mean-pooling the feature vectors of the
support examples for class kextracted by the adapted fea-
ture extractor fτ
θ. A visual overview of the CNAPS adapted
classifier architecture is shown in Figure 4, bottom left, red.
4.2. Simple CNAPS
In Simple CNAPS, we also use the same pre-trained
ResNet18 for feature extraction with the same adaptation
module ψf
φ, although, because of the classifier architecture
we use, it becomes trained to do something different than
it does in CNAPS. This choice, like for CNAPS, allows for
a task-specific adaptation of the feature extractor. Unlike
CNAPS, we directly compute
p(y
i=k|fτ
θ(x
i),Sτ) = softmax(dk(fτ
θ(x
i),µk)) (1)
using a deterministic, fixed dk
dk(x,y) = 1
2(xy)T(Qτ
k)1(xy).(2)
Here Qτ
kis a covariance matrix specific to the task and class.
As we cannot know the value of Qτ
kahead of time, it
must be estimated from the feature embeddings of the task-
specific support set. As the number of examples in any par-
ticular support set is likely to be much smaller than the di-
mension of the feature space, we use a regularized estimator
Qτ
k=λτ
kΣτ
k+ (1 λτ
k)Στ+βI . (3)
formed from a convex combination of the class-within-task
and all-classes-in-task covariance matrices Στ
kand Στre-
spectively.
We estimate the class-within-task covariance matrix Στ
k
using the feature embeddings fτ
θ(xi)of all xi Sτ
kwhere
Sτ
kis the set of examples in Sτwith class label k.
Στ
k=1
|Sτ
k|−1X
(xi,yi)∈Sτ
k
(fτ
θ(xi)µk)(fτ
θ(xi)µk)T.
If the number of support instance of that class is one,
i.e. |Sτ
k|= 1, then we define Στ
kto be the zero matrix of
the appropriate size. The all-classes-in-task covariance Στ
is estimated in the same way as the class-within-task except
that it uses all the support set examples xi Sτregardless
of their class.
We choose a particular, deterministic scheme for com-
puting the weighting of class and task specific covariance
estimates, λτ
k=|Sτ
k|/(|Sτ
k|+1). This choice means that
in the case of a single labeled instance for class in the sup-
port set, a single “shot,” Qτ
k= 0.5Στ
k+ 0.5Στ+βI. This
can be viewed as increasing the strength of the regulariza-
tion parameter βrelative to the task covariance Στ. When
|Sτ
k|= 2,λτ
kbecomes 2/3and Qτ
konly partially favors the
class-level covariance over the all-class-level covariance. In
a high-shot setting, λτ
ktends to 1and Qτ
kmainly consists of
the class-level covariance. The intuition behind this formula
for λτ
kis that the higher the number of shots, the better the
class-within-task covariance estimate gets, and the more Qτ
k
starts to look like Στ
k. We considered other ratios and mak-
ing λτ
k’s learnable parameters, but found that out of all the
considered alternatives the simple deterministic ratio above
produced the best results. The architecture of the classifier
in Simple CNAPS appears in Figure 4, bottom-right, blue.
5. Theory
The class label probability calculation appearing in
Equation 1corresponds to an equally-weighted exponential
family mixture model as λ0[36], where the exponen-
tial family distribution is uniquely determined by a regular
Bregman divergence [1]
DF(z,z0) = F(z)F(z0) F(z0)T(zz0)(4)
for a differentiable and strictly convex function F. The
squared Mahalanobis distance in Equation 2is a Breg-
man divergence generated by the convex function F(x) =
1
2xTΣ1xand corresponds to the multivariate normal ex-
ponential family distribution. When all Qτ
kΣτ+βI ,
we can view the class probabilities in Equation 1as the “re-
sponsibilities” in a Gaussian mixture model
p(y
i=k|fτ
θ(x
i),Sτ) = πkN(µk,Qτ
k)
Pk0π0
kN(µk0,Qτ
k)(5)
with equally weighted mixing coefficient πk= 1/k.
This perspective immediately highlights a problem with
the squared Euclidean norm, used by a number of ap-
proaches as shown in Fig. 2. The Euclidean norm, which
corresponds to the squared Mahalanobis distance with
Qτ
k=I, implicitly assumes each cluster is distributed ac-
cording to a unit normal, as seen in Figure 5. By contrast,
the squared Mahalanobis distance considers cluster covari-
ance when computing distances to the cluster centers.
6. Experiments
We evaluate Simple CNAPS on the Meta-Dataset [39]
family of datasets, demonstrating improvements compared
to nine baseline methodologies including the current SoTA,
CNAPS. Benchmark results reported come from [39,30].
6.1. Datasets
Meta-Dataset [39] is a benchmark for few-shot learn-
ing encompassing 10 labeled image datasets: ILSVRC-
2012 (ImageNet) [31], Omniglot [18], FGVC-Aircraft (Air-
craft) [22], CUB-200-2011 (Birds) [41], Describable Tex-
tures (DTD) [2], QuickDraw [14], FGVCx Fungi (Fungi)
[35], VGG Flower (Flower) [25], Traffic Signs (Signs) [12]
and MSCOCO [20]. In keeping with prior work, we re-
port results using the first 8 datasets for training, reserv-
ing Traffic Signs and MSCOCO for “out-of-domain” per-
formance evaluation. Additionally, from the eight training
datasets used for training, some classes are held out for
testing, to evaluate “in-domain” performance. Following
[30], we extend the out-of-domain evaluation with 3 more
datasets: MNIST [19], CIFAR10 [17] and CIFAR100 [17].
We report results using standard test/train splits and bench-
mark baselines provided by [39], but, importantly, we have
cross-validated our critical empirical claims using different
test/train splits and our results are robust across folds (see
Appendix C). For details on task generation, distribution of
shots/ways and hyperparameter settings, see Appendix A.
Mini/tieredImageNet [29,40] are two smaller but more
widely used benchmarks that consist of subsets of ILSVRC-
2012 (ImageNet) [31] with 100 classes (60,000 images) and
608 classes (779,165 images) respectively. For comparison
to more recent work [8,21,26,32] for which Meta-Dataset
evaluations are not available, we use mini/tieredImageNet.
Note that in the mini/tieredImageNet setting, all tasks are of
the same pre-set number of classes and number of support
examples per class, making learning comparatively easier.
6.2. Results
Reporting format: Bold indicates best performance on
each dataset while underlines indicate statistically signif-
icant improvement over baselines. Error bars represent a
95% confidence interval over tasks.
In-domain performance: The in-domain results for Sim-
ple CNAPS and Simple AR-CNAPS, which uses the au-
toregressive feature extraction adaptor, are shown in Table
1. Simple AR-CNAPS outperforms previous SoTA on 7
out of the 8 datasets while matching past SoTA on FGVCx
Fungi (Fungi). Simple CNAPS outperforms baselines on
6 out of 8 datasets while matching performance on FGVCx
Fungi (Fungi) and Describable Textures (DTD). Overall, in-
domain performance gains are considerable in the few-shot
domain with 2-6% margins. Simple CNAPS achieves an
100101102
Number of shots
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Accuracy
CNAPS
Squared Euclidean
Simple CNAPS
Figure 6: Accuracy vs. Shots: Average number of support
examples (in log scale) per class v/s accuracy. TFor each
class in each of the 7,800 sampled Meta-Dataset tasks (13
datasets, 600 tasks each) used at test time, the classifica-
tion accuracy on the class’ query examples was obtained.
These class accuracies were then grouped according to the
class shot, averaged and plotted to show how accuracy of
CNAPS, L2
2and Simple-CNAPS scale with higher shots.
average 73.8% accuracy on in-domain few-shot classifica-
tion, a 4.2% gain over CNAPS, while Simple AR-CNAPS
achieves 73.5% accuracy, a 3.8% gain over AR-CNAPS.
Out-of-domain performance: As shown in Table 2, Sim-
ple CNAPS and Simple AR-CNAPS produce substantial
gains in performance on out-of-domain datasets, each ex-
ceeding the SoTA baseline. With an average out-of-domain
accuracy of 69.7% and 67.6%, Simple CNAPS and Sim-
ple AR-CNAPS outperform SoTA by 8.2% and 7.8%. This
means that Simple CNAPS/AR-CNAPS generalizes to out-
of-domain datasets better than baseline models. Also, Sim-
ple AR-CNAPS under-performs Simple CNAPS, suggest-
ing that the auto-regressive feature adaptation approach may
overfit to the domain of datasets it has been trained on.
Overall performance: Simple CNAPS achieves the best
overall classification accuracy at 72.2% with Simple AR-
CNAPS trailing very closely at 71.2%. Since the overall
performance of the two variants are statistically indistin-
guishable, we recommend Simple CNAPS over Simple AR-
CNAPS as it has fewer parameters.
Comparison to other distance metrics: To test the sig-
nificance of our choice of Mahalanobis distance, we sub-
stitute it within our architecture with other distance metrics
- absolute difference (L1), squared Euclidean (L2
2), cosine
similarity and negative dot-product. Performance compar-
isons are shown in Table 3and 4. We observe that using the
Mahalanobis distance results in the best in-domain, out-of-
domain, and overall average performance on all datasets.
In-Domain Accuracy (%)
Model ImageNet Omniglot Aircraft Birds DTD QuickDraw Fungi Flower
MAML [3] 32.4±1.0 71.9±1.2 52.8±0.9 47.2±1.1 56.7±0.7 50.5±1.2 21.0±1.0 70.9±1.0
RelationNet [38] 30.9±0.9 86.6±0.8 69.7±0.8 54.1±1.0 56.6±0.7 61.8±1.0 32.6±1.1 76.1±0.8
k-NN [39] 38.6±0.9 74.6±1.1 65.0±0.8 66.4±0.9 63.6±0.8 44.9±1.1 37.1±1.1 83.5±0.6
MatchingNet [40] 36.1±1.0 78.3±1.0 69.2±1.0 56.4±1.0 61.8±0.7 60.8±1.0 33.7±1.0 81.9±0.7
Finetune [45] 43.1±1.1 71.1±1.4 72.0±1.1 59.8±1.2 69.1±0.9 47.1±1.2 38.2±1.0 85.3±0.7
ProtoNet [36] 44.5±1.1 79.6±1.1 71.1±0.9 67.0±1.0 65.2±0.8 64.9±0.9 40.3±1.1 86.9±0.7
ProtoMAML [39] 47.9±1.1 82.9±0.9 74.2±0.8 70.0±1.0 67.9±0.8 66.6±0.9 42.0±1.1 88.5±0.7
CNAPS [30] 51.3±1.0 88.0±0.7 76.8±0.8 71.4±0.9 62.5±0.7 71.9±0.8 46.0±1.1 89.2±0.5
AR-CNAPS [30] 52.3±1.0 88.4±0.7 80.5±0.6 72.2±0.9 58.3±0.7 72.5±0.8 47.4±1.0 86.0±0.5
Simple AR-CNAPS 56.5±1.1 91.1±0.6 81.8±0.8 74.3±0.9 72.8±0.7 75.2±0.8 45.6±1.0 90.3±0.5
Simple CNAPS 58.6±1.1 91.7±0.6 82.4±0.7 74.9±0.8 67.8±0.8 77.7±0.7 46.9±1.0 90.7±0.5
Table 1: In-domain few-shot classification accuracy of Simple CNAPS and Simple AR-CNAPS compared to the baselines.
With the exception of (AR-)CNAPS where the reported results are from [30], all other benchmarks are reported from [39].
Out-of-Domain Accuracy (%) Average Accuracy (%)
Model Signs MSCOCO MNIST CIFAR10 CIFAR100 In-Domain Out-Domain Overall
MAML [3] 34.2±1.3 24.1±1.1 NA NA NA 50.4±1.0 29.2±1.2 46.2±1.1
RelationNet [38] 37.5±0.9 27.4±0.9 NA NA NA 58.6±0.9 32.5±0.9 53.3±0.9
k-NN [39] 40.1±1.1 29.6±1.0 NA NA NA 59.2±0.9 34.9±1.1 54.3±0.9
MatchingNet [40] 55.6±1.1 28.8±1.0 NA NA NA 59.8±0.9 42.2±1.1 56.3±1.0
Finetune [45] 66.7±1.2 35.2±1.1 NA NA NA 60.7±1.1 51.0±1.2 58.8±1.1
ProtoNet [36] 46.5±1.0 39.9±1.1 74.3±0.8 66.4±0.7 54.7±1.1 64.9±1.0 56.4±0.9 61.6±0.9
ProtoMAML [39] 52.3±1.1 41.3±1.0 NA NA NA 67.5±0.9 46.8±1.1 63.4±0.9
CNAPS [30] 60.1±0.9 42.3±1.0 88.6±0.5 60.0±0.8 48.1±1.0 69.6±0.8 59.8±0.8 65.9±0.8
AR-CNAPS [30] 60.2±0.9 42.9±1.1 92.7±0.4 61.5±0.7 50.1±1.0 69.7±0.8 61.5±0.8 66.5±0.8
Simple AR-CNAPS 74.7±0.7 44.3±1.1 95.7±0.3 69.9±0.8 53.6±1.0 73.5±0.8 67.6±0.8 71.2±0.8
Simple CNAPS 73.5±0.7 46.2±1.1 93.9±0.4 74.3±0.7 60.5±1.0 73.8±0.8 69.7±0.8 72.2±0.8
Table 2: Middle) Out-of-domain few-shot classification accuracy of Simple CNAPS and Simple AR-CNAPS compared to
the baselines. Right) In-domain, out-of-domain and overall mean classification accuracy of Simple CNAPS and Simple AR-
CNAPS compared to the baselines. With the exception of CNAPS and AR-CNAPS where the reported results come from
[30], all other benchmarks are reported directly from [39].
5 10 15 20 25 30 35 40 45 50
Number of ways (classes)
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Accuracy
CNAPS
Squared Euclidean
Simple CNAPS
Figure 7: Accuracy vs. Ways: Number of ways (classes
in the task) v/s accuracy. Tasks in the test set are grouped
together by number of classes. The accuracies are averaged
to obtain a value for each count of class.
Impact of the task regularizer Στ:We also consider a
variant of Simple CNAPS where all-classes-within-task co-
variance matrix Στis not included in the covariance regu-
larization (denoted with the ”-TR” tag). This is equivalent
to setting λτ
kto 1 in Equation 3. As shown in Table 4, we
observe that, while removing the task level regularizer only
marginally reduces overall performance, the difference on
individual datasets such as ImageNet can be large.
Sensitivity to the number of support examples per class:
Figure 6shows how the overall classification accuracy
varies as a function of the average number of support ex-
amples per class (shots) over all tasks. We compare Simple
CNAPS, original CNAPS, and the L2
2variant of our method.
As expected, the average number of support examples per
class is highly correlated with the performance. All meth-
ods perform better with more labeled examples per support
class, with Simple CNAPS performing substantially better
as the number of shots increases. The surprising discovery
is that Simple CNAPS is effective even when the number
of labeled instances is as low as four, suggesting both that
even poor estimates of the task and class specific covariance
matrices are helpful and that the regularization scheme we
have introduced works remarkably well.
In-Domain Accuracy (%)
Metric ImageNet Omniglot Aircraft Birds DTD QuickDraw Fungi Flower
Negative Dot Product 48.0±1.1 83.5±0.9 73.7±0.8 69.0±1.0 66.3±0.6 66.5±0.9 39.7±1.1 88.6±0.5
Cosine Similarity 51.3±1.1 89.4±0.7 80.5±0.8 70.9±1.0 69.7±0.7 72.6±0.9 41.9±1.0 89.3±0.6
Absolute Distance (L1) 53.6±1.1 90.6±0.6 81.0±0.7 73.2±0.9 61.1±0.7 74.1±0.8 47.0±1.0 87.3±0.6
Squared Euclidean (L22) 53.9±1.1 90.9±0.6 81.8±0.7 73.1±0.9 64.4±0.7 74.9±0.8 45.8±1.0 88.8±0.5
Simple CNAPS -TR 56.7±1.1 91.1±0.7 83.0±0.7 74.6±0.9 70.2±0.8 76.3±0.9 46.4±1.0 90.0±0.6
Simple CNAPS 58.6±1.1 91.7±0.6 82.4±0.7 74.9±0.8 67.8±0.8 77.7±0.7 46.9±1.0 90.7±0.5
Table 3: In-domain few-shot classification accuracy of Simple CNAPS compared to ablated alternatives of the negative dot
product, absolute difference (L1), squared Euclidean (L22) and removing task regularization (λτ
k= 1) denoted by ”-TR”.
Out-of-Domain Accuracy (%) Average Accuracy (%)
Metric Signs MSCOCO MNIST CIFAR10 CIFAR100 In-Domain Out-Domain Overall
Negative Dot Product 53.9±0.9 32.5±1.0 86.4±0.6 57.9±0.8 38.8±0.9 66.9±0.9 53.9±0.8 61.9±0.9
Cosine Similarity 65.4±0.8 41.0±1.0 92.8±0.4 69.5±0.8 53.6±1.0 70.7±0.9 64.5±0.8 68.3±0.8
Absolute Distance (L1) 66.4±0.8 44.7±1.0 88.0±0.5 70.0±0.8 57.9±1.0 71.0±0.8 65.4±0.8 68.8±0.8
Squared Euclidean (L22) 68.5±0.7 43.4±1.0 91.6±0.5 70.5±0.7 57.3±1.0 71.7±0.8 66.3±0.8 69.6±0.8
Simple CNAPS -TR 74.1±0.6 46.9±1.1 94.8±0.4 73.0±0.8 59.2±1.0 73.5±0.8 69.6±0.8 72.0±0.8
Simple CNAPS 73.5±0.7 46.2±1.1 93.9±0.4 74.3±0.7 60.5±1.0 73.8±0.8 69.7±0.8 72.2±0.8
Table 4: Middle) Out-of-domain few-shot classification accuracy of Simple CNAPS compared to ablated alternatives of the
negative dot product, absolute difference (L1), squared Euclidean (L22) and removing task regularization (λτ
k= 1) denoted
by ”-TR”. Right) In-domain, out-of-domain and overall mean classification accuracies of the ablated models.
miniImageNet tieredImageNet
Model 1-shot 5-shot 1-shot 5-shot
ProtoNet [36] 46.14 65.77 48.58 69.57
Gidariss et al. [8] 56.20 73.00 N/A N/A
TADAM [26] 58.50 76.70 N/A N/A
TPN [21] 55.51 69.86 59.91 73.30
LEO [32] 61.76 77.59 66.33 81.44
CNAPS [30] 77.99 87.31 75.12 86.57
Simple CNAPS 82.16 89.80 78.29 89.01
Table 5: Accuracy (%) compared to mini/tieredImageNet
baselines. Performance measures reported for CNAPS and
Simple CNAPS are averaged across 5 different runs.
Sensitivity to the number of classes in the task: In Fig-
ure 7, we examine average accuracy as a function of the
number of classes in the task. We find that, irrespective of
the number of classes in the task, we maintain accuracy im-
provement over both CNAPS and our L2
2variant.
Accuracy on mini/tieredImageNet: Table 5shows that
Simple CNAPS outperforms recent baselines on all of the
standard 1- and 5-shot 5-way classification tasks. These re-
sults should be interpreted with care as both CNAPS and
Simple CNAPS use a ResNet18 [10] feature extractor pre-
trained on ImageNet. Like other models in this table, here
Simple CNAPS was trained for these particular shot/way
configurations. That Simple CNAPS performs well here in
the 1-shot setting, improving even on CNAPS, suggests that
Simple CNAPS is able to specialize to particular few-shot
classification settings in addition to performing well when
the number of shots and ways is unconstrained as it was in
the earlier experiments.
7. Discussion
Few shot learning is a fundamental task in modern AI
research. In this paper we have introduced a new method
for amortized few shot image classification which estab-
lishes a new SoTA performance benchmark by making a
simplification to the current SoTA architecture. Our spe-
cific architectural choice, that of deterministically estimat-
ing and using Mahalanobis distances for classification of
task-adjusted class-specific feature vectors, seems to pro-
duce, via training, embeddings that generally allow for use-
ful covariance estimates, even when the number of labeled
instances, per task and class, is small. The effectiveness of
the Mahalanobis distance in feature space for distinguishing
classes suggests connections to hierarchical regularization
schemes [33] that could enable performance improvements
even in the zero-shot setting. In the future, exploration of
other Bregman divergences can be an avenue of potentially
fruitful research. Additional enhancements in the form of
data and task augmentation can also boost the performance.
8. Acknowledgements
We acknowledge the support of the Natural Sciences
and Engineering Research Council of Canada (NSERC),
the Canada Research Chairs (CRC) Program, the Canada
CIFAR AI Chairs Program, Compute Canada, Intel, and
DARPA under its D3M and LWLL programs.
References
[1] A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Cluster-
ing with bregman divergences. Journal of machine learning
research, 6(Oct):1705–1749, 2005. 2,3,5
[2] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and
A. Vedaldi. Describing textures in the wild. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 3606–3613, 2014. 6,12
[3] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-
learning for fast adaptation of deep networks. In Proceedings
of the 34th International Conference on Machine Learning-
Volume 70, pages 1126–1135. JMLR. org, 2017. 2,3,7,11
[4] S. Fort. Gaussian prototypical networks for few-shot learn-
ing on omniglot. CoRR, abs/1708.02735, 2017. 3
[5] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, M. A.
Ranzato, and T. Mikolov. Devise: A deep visual-semantic
embedding model. In C. J. C. Burges, L. Bottou, M. Welling,
Z. Ghahramani, and K. Q. Weinberger, editors, Advances
in Neural Information Processing Systems 26, pages 2121–
2129. Curran Associates, Inc., 2013. 1
[6] P. Galeano, E. Joseph, and R. E. Lillo. The mahalanobis dis-
tance for functional data with applications to classification.
Technometrics, 57(2):281–291, 2015. 2
[7] M. Garnelo, D. Rosenbaum, C. J. Maddison, T. Ramalho,
D. Saxton, M. Shanahan, Y. W. Teh, D. J. Rezende, and
S. M. A. Eslami. Conditional neural processes. CoRR,
abs/1807.01613, 2018. 2
[8] S. Gidaris and N. Komodakis. Dynamic few-shot visual
learning without forgetting. CoRR, abs/1804.09458, 2018.
6,8
[9] S. Gidaris and N. Komodakis. Generating classification
weights with gnn denoising autoencoders for few-shot learn-
ing. arXiv preprint arXiv:1905.01102, 2019. 1
[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. CoRR, abs/1512.03385, 2015. 4,8,
12,13
[11] M. Z. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga. A
comprehensive survey of deep learning for image captioning.
ACM Comput. Surv., 51(6):118:1–118:36, Feb. 2019. 1
[12] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and
C. Igel. Detection of traffic signs in real-world images: The
german traffic sign detection benchmark. In The 2013 inter-
national joint conference on neural networks (IJCNN), pages
1–8. IEEE, 2013. 6
[13] L. Jiao, F. Zhang, F. Liu, S. Yang, L. Li, Z. Feng, and R. Qu.
A survey of deep learning-based object detection. CoRR,
abs/1907.09408, 2019. 1
[14] J. Jongejan, H. Rowley, T. Kawashima, J. Kim, and N. Fox-
Gieg. The quick, draw!-ai experiment.(2016), 2016. 6,12
[15] J. Kim, T. Kim, S. Kim, and C. D. Yoo. Edge-
labeling graph neural network for few-shot learning. CoRR,
abs/1905.01436, 2019. 1,2
[16] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neu-
ral networks for one-shot image recognition. In ICML deep
learning workshop, volume 2, 2015. 1,2
[17] A. Krizhevsky. Learning multiple layers of features from
tiny images. Technical report, 2009. 6
[18] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum. Human-
level concept learning through probabilistic program induc-
tion. Science, 350(6266):1332–1338, 2015. 6,12
[19] Y. LeCun and C. Cortes. MNIST handwritten digit database.
2010. 6
[20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
manan, P. Doll´
ar, and C. L. Zitnick. Microsoft coco: Com-
mon objects in context. In European conference on computer
vision, pages 740–755. Springer, 2014. 6
[21] Y. Liu, J. Lee, M. Park, S. Kim, and Y. Yang. Trans-
ductive propagation network for few-shot learning. CoRR,
abs/1805.10002, 2018. 6,8
[22] S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi.
Fine-grained visual classification of aircraft. arXiv preprint
arXiv:1306.5151, 2013. 6,12
[23] N. Mishra, M. Rohaninejad, X. Chen, and P. Abbeel.
Meta-learning with temporal convolutions. CoRR,
abs/1707.03141, 2017. 2
[24] A. Nichol, J. Achiam, and J. Schulman. On first-order meta-
learning algorithms. CoRR, abs/1803.02999, 2018. 2
[25] M.-E. Nilsback and A. Zisserman. Automated flower classi-
fication over a large number of classes. In 2008 Sixth Indian
Conference on Computer Vision, Graphics & Image Process-
ing, pages 722–729. IEEE, 2008. 6,12
[26] B. Oreshkin, P. Rodr´
ıguez L´
opez, and A. Lacoste. Tadam:
Task dependent adaptive metric for improved few-shot learn-
ing. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman,
N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural
Information Processing Systems 31, pages 721–731. Curran
Associates, Inc., 2018. 2,6,8
[27] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and
A. Courville. Film: Visual reasoning with a general con-
ditioning layer. In Thirty-Second AAAI Conference on Arti-
ficial Intelligence, 2018. 2,3,4,11
[28] S. Ravi and H. Larochelle. Optimization as a model for few-
shot learning. In 5th International Conference on Learning
Representations, ICLR 2017, Toulon, France, April 24-26,
2017, Conference Track Proceedings, 2017. 2
[29] M. Ren, E. Triantafillou, S. Ravi, J. Snell, K. Swersky,
J. B. Tenenbaum, H. Larochelle, and R. S. Zemel. Meta-
learning for semi-supervised few-shot classification. CoRR,
abs/1803.00676, 2018. 6
[30] J. Requeima, J. Gordon, J. Bronskill, S. Nowozin, and
R. E. Turner. Fast and flexible multi-task classification us-
ing conditional neural adaptive processes. arXiv preprint
arXiv:1906.07697, 2019. 1,2,3,4,6,7,8,11,12,13,14
[31] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
et al. Imagenet large scale visual recognition challenge.
International journal of computer vision, 115(3):211–252,
2015. 4,6,12
[32] A. A. Rusu, D. Rao, J. Sygnowski, O. Vinyals, R. Pascanu,
S. Osindero, and R. Hadsell. Meta-learning with latent em-
bedding optimization. CoRR, abs/1807.05960, 2018. 6,8
[33] R. Salakhutdinov, J. Tenenbaum, and A. Torralba. One-shot
learning with a hierarchical nonparametric bayesian model.
In I. Guyon, G. Dror, V. Lemaire, G. Taylor, and D. Silver,
editors, Proceedings of ICML Workshop on Unsupervised
and Transfer Learning, volume 27 of Proceedings of Ma-
chine Learning Research, pages 195–206, Bellevue, Wash-
ington, USA, 02 Jul 2012. PMLR. 8
[34] V. G. Satorras and J. B. Estrach. Few-shot learning with
graph neural networks. In International Conference on
Learning Representations, 2018. 1,2
[35] B. Schroeder and Y. Cui. Fgvcx fungi classification
challenge 2018. https://github.com/visipedia/
fgvcx_fungi_comp, 2018. 6,12
[36] J. Snell, K. Swersky, and R. Zemel. Prototypical networks
for few-shot learning. In Advances in Neural Information
Processing Systems, pages 4077–4087, 2017. 1,2,3,5,7,8,
11
[37] M. Sornam, K. Muthusubash, and V. Vanitha. A survey on
image classification and activity recognition using deep con-
volutional neural network architecture. In 2017 Ninth In-
ternational Conference on Advanced Computing (ICoAC),
pages 121–126, Dec 2017. 1
[38] F. Sung, Y. Yang, L. Zhang, T. Xiang, P. H. Torr, and T. M.
Hospedales. Learning to compare: Relation network for
few-shot learning. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 1199–
1208, 2018. 2,7
[39] E. Triantafillou, T. Zhu, V. Dumoulin, P. Lamblin, K. Xu,
R. Goroshin, C. Gelada, K. Swersky, P.-A. Manzagol,
and H. Larochelle. Meta-dataset: A dataset of datasets
for learning to learn from few examples. arXiv preprint
arXiv:1903.03096, 2019. 3,6,7,11,12
[40] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al.
Matching networks for one shot learning. In Advances in
neural information processing systems, pages 3630–3638,
2016. 1,2,3,6,7
[41] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.
The caltech-ucsd birds-200-2011 dataset. 2011. 6,12
[42] W. Wang, V. W. Zheng, H. Yu, and C. Miao. A survey
of zero-shot learning: Settings, methods, and applications.
ACM Trans. Intell. Syst. Technol., 10(2):13:1–13:37, Jan.
2019. 1
[43] Y. Wang and Q. Yao. Few-shot learning: A survey. CoRR,
abs/1904.05046, 2019. 1,2
[44] C. Xing, N. Rostamzadeh, B. N. Oreshkin, and P. O. Pin-
heiro. Adaptive cross-modal few-shot learning. CoRR,
abs/1902.07104, 2019. 1
[45] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How
transferable are features in deep neural networks? CoRR,
abs/1411.1792, 2014. 2,7
Appendices
A. Experimental Setting
Section 3.2 of [39] explains the sampling procedure to
generate tasks from the Meta-Dataset [39], used during
both training and testing. This results in tasks with vary-
ing of number of shots/ways. Figure 9a and 9b show the
ways/shots frequency graphs at test time. For evaluating
on Meta-Dataset and mini/tiered-ImageNet datasets, we use
episodic training [36] to train models to remain consistent
with the prior works [3,30,36,39]. We train for 110K
tasks, 16 tasks per batch, totalling 6,875 gradient steps us-
ing Adam with learning rate of 0.0005. We validate (on 8
in-domain and 1 out-of-domain datasets) every 10K tasks,
saving the best model/checkpoint for testing. Please visit
the Pytorch implementation of Simple CNAPS for details.
B. (Simple) CNAPS in Details
B.1. Auto-Regressive CNAPS
In [30], an additional auto-regressive variant for adapting
the feature extractor is proposed, referred to as AR-CNAPS.
As shown in Figure 10, AR-CNAPS extends CNAPS by in-
troducing the block-level set encoder gARj
φat each block
j. These set encoders use the output obtained by push-
ing the support Sτthrough all previous blocks 1 : j1
to form the block level set representation gARj
φ(fτj
θ(Sτ)).
This representation is then subsequently used as input to
the adaptation network ψj
φin addition to the task represen-
tation gφ(Sτ). This way the adaptation network is not just
conditioned on the task, but is also aware of the potential
changes in the previous blocks as a result of the adaptation
being performed by the adaptation networks before it (i.e.,
(a) A FiLM layer. (b) A ResNet basic block with FiLM layers.
Figure E.9: (Left) A FiLM layer operating on convolutional feature maps indexed by channel
ch
. (Right) How a
FiLM layer is used within a basic Residual network block [14].
Concatenate

(penalty)
(penalty)
(penalty)
(penalty)
Figure E.10: Adaptation network
f
.
Ribjch
and
Ribjch
denote a vector of regularization weights that are
learned with an l2penalty.
Figure E.10 shows the details of the adaptation network
f
that generates the FiLM layer parameters
for each ResNet layer.
E.2 ResNet18 Architecture details
Throughout our experiments in Section 5, we use a ResNet18 [
14
] as our feature extractor, the
parameters of which we denote
.Table E.5 and Table E.6 detail the architectures of the basic block
(left) and basic scaling block (right) that are the fundamental components of the ResNet that we
employ. Table E.7 details how these blocks are composed to generate the overall feature extractor
network. We use the implementation that is provided by the PyTorch [
52
]
3
, though we adapt the code
to enable the use of FiLM layers.
Table E.5: ResNet-18 basic block b.
Layers
Input
Conv2d (33, stride 1, pad 1)
BatchNorm
FiLM (b,1,b,1)
ReLU
Conv2d (33, stride 1, pad 1)
BatchNorm
FiLM (b,2,b,2)
Sum with Input
ReLU
Table E.6: ResNet-18 basic scaling block b.
Layers
Input
Conv2d (33, stride 2, pad 1)
BatchNorm
FiLM (b,1,b,1)
ReLU
Conv2d (33, stride 1, pad 1)
BatchNorm
FiLM (b,2,b,2)
Downsample Input by factor of 2
Sum with Downsampled Input
ReLU
3https://pytorch.org/docs/stable/torchvision/models.html
5
Figure 8: Architectural overview of the feature extrac-
tor adaptation network ψf
φ:Figure has been adapted from
[30] and showcases the neural architecture used for each
adaptation module ψj
φ(corresponding to residual block j)
in the feature extractor adaptation network ψf
φ.
(a) Number of Tasks vs. Ways (b) Number of Classes vs. Shots
Figure 9: Test-time distribution of tasks: a) Frequency
of number of tasks as grouped by the number of classes in
the tasks (ways). b) Frequency of the number of classes
grouped by the number examples per class (shots). Both
figures are for test tasks sampled when evaluating on the
Meta-Dataset [39].
ψ1
φ:ψj1
φ). The auto-regressive nature of AR-CNAPS al-
lows for a more dynamic adaptation procedure that boosts
performance in certain domains.
B.2. FiLM Layers
Proposed by [27], Feature-wise Linear Modulation
(FiLM) layers were used for visual question answering,
where the feature extractor could be conditioned on the
question. As shown in Figure 11, these layers are in-
serted within residual blocks, where the feature channels
are scaled and linearly shifted using the respective FiLM
parameters γi,ch and βi,ch. This can be extremely powerful
in transforming the extracted feature space. In our work and
[30], these FiLM parameters are conditioned on the support
images in the task Sτ. This way, the adapted feature ex-
tractor fτ
θis able to modify the feature space to extract the
features that allow classes in the task to be distinguished
most distinctly. This is in particular very powerful when
the classification metric is changed to the Mahalanobis dis-
tance, as with a new objective, the feature extractor adap-
tation network ψf
φis able to learn to extract better features
(see difference between with and without ψf
φin Table 7on
CNAPS and Simple CNAPS).
B.3. Network Architectures
We adapt the same architectural choices for the task en-
coder gφ, auto-regressive set encoders gAZ1
φ, ..., gAZJ
φand
the feature extractor adaptation network ψf
φ={ψ1
φ, ..., ψJ
φ}
as [30]. The neural architecture for each adaptation module
inside of ψf
φhas been shown in Figure 8. The neural con-
figurations for the task encoder gφand the auto-regressive
set encoders gAZ1
φ, ..., gAZJ
φused in AR-CNAPS are shown
in Figure 12-a and Figure 12-b respectively. Note that for
the auto-regressive set encoders, there is no need for convo-
lutional layers. The input to these networks come from the
output of the corresponding residual block adapted to that
Block 1 Block 2
Film Layers
Layer 1
Block 1 Block 2
Film Layers
Layer 1
Block 1 Block 2
Film Layers
Layer 1
Block 1 Block 2
Film Layers
Layer 1
Pre Post
Set Encoder Set Encoder Set Encoder
Set Encoder
Task Encoder
Figure 10: Overview of the auto-regresive feature extractor adaptation in CNAPS: in addition to the structure shown in
Figure 3, AR-CNAPS takes advantage of a series of pre-block set encoders gARj
φto furthermore condition the output of each
ψj
φon the set representation gARj
φ(fτj
θ(Sτ)). The set representation is formed by first adapting the previous blocks 1 : j1,
then pushing the support set Sthrough the adapted blocks to form an auto-regressive adapted set representation at block j.
This way, adaptive functions later in the pipeline are more explicitly aware of the changes made by the previous adaptation
networks, and can adjust better accordingly.
FiLM 𝒇𝑖
𝛾𝑖,1 𝛽𝑖,1
+
𝛾𝑖,𝑐ℎ 𝛽𝑖,𝑐ℎ
3x3 BN FiLM
ReLU 3x3 BN FiLM
+ReLU
block
Concatenate

(penalty)
(penalty)
(penalty)
(penalty)
Figure 11: Overview of FiLM layers: Figure is from [30]. Left) FiLM layer operating a series of channels indexed by ch,
scaling and shifting the feature channels as defined by the respective FiLM parameters γi,ch and βi,ch. Right) Placement of
these FiLM modules within a ResNet18 [10] basic block.
level (denoted by fτj
θfor block j) which has already been
processed with convolutional filters.
Unlike CNAPS, we do not use the classifier adaptation
network ψc
φ. As shown in Figure 12-c, the classification
weights adaptor ψc
φconsists of an MLP consisting of three
fully connected (FC) layers with the intermediary none-
linearity ELU, which is the continuous approximation to
ReLU as defined below:
ELU (x) = x x > 0
ex1x0(6)
As mentioned previously, without the need to learn the
three FC layers in ψc
φ, Simple CNAPS has 788,485 fewer
parameters while outperforming CNAPS by considerable
margins.
C. Cross Validation
The Meta-Dataset [39] and its 8 in-domain 2 out-of-
domain split is a setting that has defined the benchmark
for the baseline results provided. The splits, between the
datasets, were intended to capture an extensive set of visual
domains for evaluating the models.
However, despite the fact that all past work directly rely
on the provided set up, we go further by verifying that our
model is not overfitting to the proposed splits and is able
to consistently outperform the baseline with different per-
mutations of the datasets. We examine this through a 4-
fold cross validation of Simple CNAPS and CNAPS on the
following 8 datasets: ILSVRC-2012 (ImageNet) [31], Om-
niglot [18], FGVC-Aircraft [22], CUB-200-2011 (Birds)
[41], Describable Textures (DTD) [2], QuickDraw [14],
FGVCx Fungi [35] and VGG Flower [25]. During each
fold, two of the datasets are exluded from training, and both
Simple CNAPS and CNAPS are trained and evaluated in
that setting.
As shown by the classification results in Table 6, in all
four folds of validation, Simple CNAPS is able to outper-
form CNAPS on 7-8 out of the 8 datasets. The in-domain,
out-of-domain, and overall averages for each fold noted in
Table 8also show Simple CNAPS’s accuracy gains over
CNAPS with substantial margins. In fact, the fewer num-
ber of in-domain datasets in the cross-validation (6 vs. 8)
Classification Accuracy (%)
Model ILSVRC Omniglot Aircraft CUB DTD QuickDraw Fungi Flower
CNAPS 49.6±1.1 87.2±0.8 81.0±0.7 69.7±0.9 61.3±0.7 72.0±0.8 *32.2±1.0 *70.9±0.8
Simple CNAPS 55.6±1.1 90.9±0.8 82.2±0.7 75.4±0.9 74.3±0.7 75.5±0.8 *39.9±1.0 *88.0±0.8
CNAPS 50.3±1.1 86.5±0.8 77.1±0.7 71.6±0.9 *64.3±0.7 *33.5±0.9 46.4±1.1 84.0±0.6
Simple CNAPS 58.1±1.1 90.8±0.8 83.8±0.7 75.2±0.9 *74.6±0.7 *64.0±0.9 47.7±1.1 89.9±0.6
CNAPS 51.5±1.1 87.8±0.8 *38.2±0.8 *58.7±1.0 62.4±0.7 72.5±0.8 46.9±1.1 89.4±0.5
Simple CNAPS 56.0±1.1 91.1±0.8 *66.6±0.8 *68.0±1.0 71.3±0.7 76.1±0.8 45.6±1.1 90.7±0.5
CNAPS *42.4±0.9 *59.6±1.4 77.2±0.8 69.3±0.9 62.9±0.7 69.1±0.8 40.9±1.0 88.2±0.5
Simple CNAPS *49.1±0.9 *76.0±1.4 83.0±0.8 74.5±0.9 74.4±0.7 74.8±0.8 44.0±1.0 91.0±0.5
Table 6: Cross-validated classification accuracy results. Note that * denotes that this dataset was excluded from training, and
therefore, signifies out-of-domain performance. Values in bold indicate significant statistical gains over CNAPS.
Average Accuracy with ψf
φ(%) Average Accuracy without ψf
φ(%)
Metric/Model Variant In-Domain Out-Domain Overall In-Domain Out-Domain Overall
Negative Dot Product 66.9±0.9 53.9±0.8 61.9±0.9 38.4±1.0 44.7±1.0 40.8±1.0
CNAPS 69.6±0.8 59.8±0.8 65.9±0.8 54.4±1.0 55.7±0.9 54.9±0.9
Absolute Distance (L1) 71.0±0.8 65.4±0.8 68.8±0.8 54.9±1.0 62.2±0.8 57.7±0.9
Squared Euclidean (L2
2) 71.7±0.8 66.3±0.8 69.6±0.8 55.3±1.0 61.8±0.8 57.8±0.9
Simple CNAPS -TR 73.5±0.8 69.6±0.8 72.0±0.8 52.3±1.0 61.7±0.9 55.9±1.0
Simple CNAPS 73.8±0.8 69.7±0.8 72.2±0.8 56.0±1.0 64.8±0.8 59.3±0.9
Table 7: Comparing in-domain, out-of-domain and overall accuracy averages of each metric/model variant when feature
extractor adaptation is performed (denoted as ”with ψf
φ”) vs. when no adaptation is performed (denoted as ”without ψf
φ”).
Values in bold signify best performance in the column while underlined values signify superior performance of Simple
CNAPS (and the -TR variant) compared to the CNAPS baseline.
Average Classification Accuracy (%)
Fold Model In-Domain Out-Domain Overall
1 CNAPS 70.1±0.4 51.6±0.4 65.5±0.4
1 S. CNAPS 75.7±0.3 64.0±0.4 72.7±0.3
2 CNAPS 69.3±0.4 48.9±0.3 64.2±0.4
2 S. CNAPS 74.3±0.4 69.3±0.4 73.0±0.3
3 CNAPS 68.4±0.4 48.5±0.4 63.4±0.4
3 S. CNAPS 71.8±0.4 67.3±0.5 70.7±0.4
4 CNAPS 67.9±0.3 51.0±0.7 63.7±0.4
4 S. CNAPS 73.6±0.3 62.6±0.6 70.9±0.4
Avg CNAPS 69.0±1.4 50.0±1.8 64.2±1.6
Avg S. CNAPS 73.8±1.3 65.8±1.8 71.8±1.4
Table 8: Cross-validated in-domain, out-of-domain and
overall classification accuracies averaged across each fold
and combined. Note that for conciseness, Simple CNAPS
has been shortened to ”S. CNAPS”. Simple CNAPS values
in bold indicate statistically significant gains over CNAPS.
actually leads to wider gaps between Simple CNAPS and
CNAPS. This suggests Simple CNAPS is a more powerful
alternative in the low domain setting. Furthermore, using
these results, we illustrate that our gains are not specific to
the Meta-Dataset setup.
D. Ablation study of the Feature Extractor
Adaptation Network
In addition to the choice of metric ablation study ref-
erenced in Section 6.2, we examine the behaviour of the
model when the feature extractor adaptation network ψf
φ
has been turned off. In such setting, the feature extrac-
tor would only consist of the pre-trained ResNet18 [10]
fθ. Consistent to [30], we refer to this setting as ”No
Adaptation” (or “No Adapt” for short). We compare the
“No Adapt” variant to the feature extractor adaptive case
for each of the metrics/model variants examined in Section
6.2. The in-domain, out-of-domain and overall classifica-
tion accuracies are shown in Table 7. As shown, without
ψf
φall models lose approximately 15, 5, and 12 percentage
points across in-domain, out-of-domain and overall accu-
racy, while Simple CNAPS continues to hold the lead es-
pecially in out-of-domain classification accuracy. It’s in-
teresting to note that without the task specific regulariza-
tion term (denoted as ”-TR”), there’s a considerable perfor-
mance drop in the “No Adaptation” setting; while when the
feature extractor adaptation network ψf
φis present, the dif-
ference is marginal. This signifies two important observa-
tions. First, it shows the importance of of learning the fea-
ture extractor adaptation module end-to-end with the Ma-
halanobis distance, as it’s able adapt the feature space best
suited for using the squared Mahalanobis distance. Second,
the adaptation function ψf
φcan reduce the importance of
the task regularizer by properly de-correlating and normal-
izing variance within the feature vectors. However, where
this is not possible, as in the “No Adaptation” case, the all-
classes-task-level covariance estimate as an added regular-
izer in Equation 2becomes crucial in maintaining superior
performance.
AvgPool
Flatten
FC
ReLU
FC
ReLU
FC
ReLU
FC
Mean Pool
FC
+
+
+
Conv2d
Conv2d
Conv2d
Conv2d
Conv2d
Mean Pool
FC
ELU
FC
ELU
FC
+
Figure 12: Overview of architectures used in (Simple)
CNAPS: a) Auto-regressive set encoder gARj
φ. Note that
since this is conditioned on the channel outputs of the con-
volutional filter, it’s not convolved any further. b) Task en-
coder gφthat mean-pools convolutionally filtered support
examples to produce the task representation. c) architec-
tural overview of the classifier adaptation network ψc
φcon-
sisting of a 3 layer MLP with a residual connection. Dia-
grams are based on Table E.8, E.9, and E.11 in [30].
E. Projection Networks
We additionally explored metric learning where in ad-
dition to changing the distance metric, we considered pro-
jecting each support feature vector fτ
θ(xi)and query vector
fτ
θ(x
i)to a new decision space where then squared Maha-
lanobis distance was to be used for classification. Specif-
ically, we trained a projection network uφsuch that for
Equations 2and 3,µk,Στ
kand Στwere calculated based on
Average Classification Accuracy (%)
Model In-Domain Out-Domain Overall
Simple CNAPS +P 72.4±0.9 67.1±0.8 70.4±0.8
Simple CNAPS 73.8±0.8 69.7±0.8 72.2±0.8
Table 9: Comparing the in-domain, out-of-domain and
overall classification accuracy of Simple CNAPS +P (with
projection networks) to Simple CNAPS. Values in bold
show the statistically significant best result.
the projected feature vectors {uφ(fτ
θ(xi))}xi∈Sτ
kas oppose
to the feature vector set {fτ
θ(xi)}xi∈Sτ
k. Similarly, the pro-
jected query feature vector uφ(fτ
θ(x
i)) was used for classi-
fying the query example as oppose to the bare feature vector
fτ
θ(x
i)used within Simple CNAPS. We define uφin our
experiments to be the following:
uφ(fτ
θ(x
i)) = W1(ELU (W2(ELU(W3fτ
θ(x
i))))) (7)
where ELU, a continuous approximation to ReLU as previ-
ously noted, is used as the choice of non-linearity and W1,
W2and W3are learned parameters.
We refer to this variant of our model as “Simple CNAPS
+P” with the “+P” tag signifying the addition of the pro-
jection function uφ. The results for this variant of Simple
CNAPS are compared to the base Simple CNAPS in Table
9. As shown, the projection network generally results in
lower performance, although not to statistically significant
degrees in in-domain and overall accuracies. Where the ad-
dition of the projection network results in substantial loss
of performance is in the out-of-domain setting with Sim-
ple CNAPS +P’s average accuracy of 67.1±0.8 compared to
69.7±0.8 for the Simple CNAPS. We hypothesize the sig-
nificant loss in out-of-domain performance to be due to the
projection network overfitting to the in-domain datasets.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Few-shot learning has become essential for producing models that generalize from few examples. In this work, we identify that metric scaling and metric task conditioning are important to improve the performance of few-shot algorithms. Our analysis reveals that simple metric scaling completely changes the nature of few-shot algorithm parameter updates. Metric scaling provides improvements up to 14% in accuracy for certain metrics on the mini-Imagenet 5-way 5-shot classification task. We further propose a simple and effective way of conditioning a learner on the task sample set, resulting in learning a task-dependent metric space. Moreover, we propose and empirically test a practical end-to-end optimization procedure based on auxiliary task co-training to learn a task-dependent metric space. The resulting few-shot learning model based on the task-dependent scaled metric achieves state of the art on mini-Imagenet. We confirm these results on another few-shot dataset that we introduce in this paper based on CIFAR100. Our code is publicly available at https://github.com/ElementAI/TADAM.
Article
Full-text available
Object detection is one of the most important and challenging branches of computer vision, which has been widely applied in people’s life, such as monitoring security, autonomous driving and so on, with the purpose of locating instances of semantic objects of a certain class. With the rapid development of deep learning algorithms for detection tasks, the performance of object detectors has been greatly improved. In order to understand the main development status of object detection pipeline thoroughly and deeply, in this survey, we analyze the methods of existing typical detection models and describe the benchmark datasets at first. Afterwards and primarily, we provide a comprehensive overview of a variety of object detection methods in a systematic manner, covering the one-stage and two-stage detectors. Moreover, we list the traditional and new applications. Some representative branches of object detection are analyzed as well. Finally, we discuss the architecture of exploiting these object detection methods to build an effective and efficient system and point out a set of development trends to better follow the state-of-the-art algorithms and further research.
Article
Full-text available
Generating a description of an image is called image captioning. Image captioning requires to recognize the important objects, their attributes and their relationships in an image. It also needs to generate syntactically and semantically correct sentences. Deep learning-based techniques are capable of handling the complexities and challenges of image captioning. In this survey paper, we aim to present a comprehensive review of existing deep learning-based image captioning techniques. We discuss the foundation of the techniques to analyze their performances, strengths and limitations. We also discuss the datasets and the evaluation metrics popularly used in deep learning based automatic image captioning.
Conference Paper
Full-text available
Deep learning, over a decade it becomes the booming field for researchers since the technique has the capability to overcome the drawbacks of already used traditional algorithms which is dependent on hand designed features. Currently four different type of architecture used in deep learning which is an Autoencoder, Deep Belief Network, Convolutional Neural Network and Restricted Boltzmann Machine. According to the reported research, Convolutional Neural Network is very efficient on image classification and speech recognition. The main aim of this survey is to broadly cover the applications of convolutional networks in the field of computer-aided diagnosis for the dreadful diseases and also in the field of agriculture. Finally, the limitations of Convolutional network and expected future research topics to be done using this network have been discussed.
Article
Most machine-learning methods focus on classifying instances whose classes have already been seen in training. In practice, many applications require classifying instances whose classes have not been seen previously. Zero-shot learning is a powerful and promising learning paradigm, in which the classes covered by training instances and the classes we aim to classify are disjoint. In this paper, we provide a comprehensive survey of zero-shot learning. First of all, we provide an overview of zero-shot learning. According to the data utilized in model optimization, we classify zero-shot learning into three learning settings. Second, we describe different semantic spaces adopted in existing zero-shot learning works. Third, we categorize existing zero-shot learning methods and introduce representative methods under each category. Fourth, we discuss different applications of zero-shot learning. Finally, we highlight promising future research directions of zero-shot learning.
Article
In few-shot classification, we are interested in learning algorithms that train a classifier from only a handful of labeled examples. Recent progress in few-shot classification has featured meta-learning, in which a parameterized model for a learning algorithm is defined and trained on episodes representing different classification problems, each with a small labeled training set and its corresponding test set. In this work, we advance this few-shot classification paradigm towards a scenario where unlabeled examples are also available within each episode. We consider two situations: one where all unlabeled examples are assumed to belong to the same set of classes as the labeled examples of the episode, as well as the more challenging situation where examples from other distractor classes are also provided. To address this paradigm, we propose novel extensions of Prototypical Networks (Snell et al., 2017) that are augmented with the ability to use unlabeled examples when producing prototypes. These models are trained in an end-to-end way on episodes, to learn to leverage the unlabeled examples successfully. We evaluate these methods on versions of the Omniglot and miniImageNet benchmarks, adapted to this new framework augmented with unlabeled examples. We also propose a new split of ImageNet, consisting of a large set of classes, with a hierarchical structure. Our experiments confirm that our Prototypical Networks can learn to improve their predictions due to unlabeled examples, much like a semi-supervised algorithm would.