PreprintPDF Available

Multi-Modal Fusion by Meta-Initialization

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

When experience is scarce, models may have insufficient information to adapt to a new task. In this case, auxiliary information - such as a textual description of the task - can enable improved task inference and adaptation. In this work, we propose an extension to the Model-Agnostic Meta-Learning algorithm (MAML), which allows the model to adapt using auxiliary information as well as task experience. Our method, Fusion by Meta-Initialization (FuMI), conditions the model initialization on auxiliary information using a hypernetwork, rather than learning a single, task-agnostic initialization. Furthermore, motivated by the shortcomings of existing multi-modal few-shot learning benchmarks, we constructed iNat-Anim - a large-scale image classification dataset with succinct and visually pertinent textual class descriptions. On iNat-Anim, FuMI significantly outperforms uni-modal baselines such as MAML in the few-shot regime. The code for this project and a dataset exploration tool for iNat-Anim are publicly available at https://github.com/s-a-malik/multi-few .
Content may be subject to copyright.
Multi-Modal Fusion by Meta-Initialization
Matthew T. Jackson
Department of Engineering Science
University of Oxford
jackson@robots.ox.ac.uk
Shreshth A. Malik*
Department of Engineering Science
University of Oxford
shreshth@robots.ox.ac.uk
Michael T. Matthews
Department of Computer Science
University College London
Yousuf Mohamed-Ahmed
Department of Computer Science
University College London
Abstract
When experience is scarce, models may have insufficient information to adapt
to a new task. In this case, auxiliary information—such as a textual descrip-
tion of the task—can enable improved task inference and adaptation. In this
work, we propose an extension to the Model-Agnostic Meta-Learning algorithm
(MAML), which allows the model to adapt using auxiliary information as well as
task experience. Our method, Fusion by Meta-Initialization (FuMI), conditions
the model initialization on auxiliary information using a hypernetwork, rather than
learning a single, task-agnostic initialization. Furthermore, motivated by the short-
comings of existing multi-modal few-shot learning benchmarks, we constructed
iNat-Anim—a large-scale image classification dataset with succinct and visually
pertinent textual class descriptions. On iNat-Anim, FuMI significantly outper-
forms uni-modal baselines such as MAML in the few-shot regime. The code for
this project and a dataset exploration tool for iNat-Anim are publicly available at
https://github.com/s-a-malik/multi-few.
1 Introduction
Learning effectively in resource-constrained environments is an open challenge in machine learning
[1, 2, 3]. Yet humans are capable of rapidly learning new tasks from limited experience, in part by
drawing on auxiliary information about the task. This information can be particularly helpful in
the few-shot regime, as it can highlight features that have not been seen directly in task experience,
but are necessary to solve the task. For example, Figure 1 shows an example image classification
task where a text description of the class contains discriminative information that is not contained in
the training (support) images. Designing algorithms that can incorporate auxiliary information into
meta-learning approaches has consequently attracted much attention [4, 5, 6, 7, 8, 9, 10].
Model-agnostic meta-learning (MAML) [1] is a popular method for few-shot learning. However, it
cannot incorporate auxiliary task information. In this work, we propose Fusion by Meta-Initialization
(FuMI), an extension of MAML which uses a hypernetwork [11] to learn a mapping from auxiliary
task information to a parameter initialization. While MAML learns an initialization that facilitates
rapid learning across all tasks, FuMI conditions the initialization on the specific task to enable
improved adaption.
Existing multi-modal few-shot learning benchmarks largely rely on hand-crafted feature vectors for
each class [12, 13], or use noisy language descriptions from sources such as Wikipedia [14, 15].
Equal Contribution
Preprint. Under review.
arXiv:2210.04843v1 [cs.LG] 10 Oct 2022
Figure 1: An example few-shot learning task, using images and class descriptions from our proposed
dataset, iNat-Anim. Here, we see the class description contains information (the colour of the bird’s
breast) which is not found in the class images (as they are all turned away).
For this reason, we release iNat-Anim—a large animal species image classification dataset with
high quality descriptions of visual features. On this benchmark, we find that FuMI significantly
outperforms MAML in the very-few-shot regime.
2 Background
In the meta-learning framework [1], we suppose tasks are drawn from a task distribution
p(T)
. At
meta-train time, the model
fθ
is evaluated on a series of tasks
Ti Dtrain
, where
Dtrain
is a finite set
of samples from
p(T)
. This gives task loss
LTi
, which is used to update the model parameters
θ
in
accordance with the meta-learning algorithm. At meta-test time, the trained model is evaluated on all
tasks in Dtest, another set of samples from p(T).
In an
N
-shot,
K
-way multi-modal classification problem
2
, a task
T= (S,Q)
is defined by a support
set
S={({xi,j }N
j=1, ti, yi)}K
i=1
and a query set
Q={({xi,j }M
j=1, yi)}K
i=1
, where
M
is the number
of query shots. The support set contains
N
samples and auxiliary class information
ti
for each of the
K
classes, which are used by the meta-learner to train an adapted model. Once this has been trained,
the adapted model is evaluated on the unseen query set, giving task loss
LQ
. In the context of our
work,
ti
denotes the textual description of the class
yi
, meaning each class has a textual description
and Nsupport images. Figure 1 shows an example task using the notation outlined here.
3 Data
Existing Multi-Modal Few-shot Benchmarks.
While there are a number of popular uni-modal
few-shot learning benchmarks [16, 17, 18], multi-modal benchmarks are less common. Some
works simply extend few-shot benchmarks by using the class label as auxiliary information [6, 19].
Benchmarks explicitly incorporating auxiliary modalities include Animals with Attributes (AWA)
[12] and Caltech-UCSD-Birds (CUB) [13] which augment images of animals/birds with hand-crafted
class attributes. While semantic class features can be highly discriminative, they require manual
labelling and are thus difficult to obtain at scale. Recent work instead uses the more general approach
of using natural language descriptions, for example, through augmenting CUB with Wikipedia articles
[14, 15]. However, these articles are subject to change and visual information is sparse, thus reducing
the relative benefit of the auxiliary information.
The iNat-Anim Dataset.
Motivated by these shortcomings, we constructed the iNat-Anim
3
dataset.
iNat-Anim consists of 195,605 images across 673 animal species, which is orders of magnitude larger
than existing benchmarks (AWA and CUB). The images are a subset of the iNaturalist 2021 CVPR
challenge [20] and have been augmented with textual descriptions from Animalia [21] to provide
2
For consistency with our dataset, the problem setting formulation is for classification. However our method
can also be applied to regression and reinforcement learning.
3https://doi.org/10.5281/zenodo.6703088
2
auxiliary information about each species. The descriptions are typically short and are qualitatively
pertinent to the visual characteristics of the animal (Figure 1). See Appendix C for further details.
4 Method
Algorithm 1
FuMI for few-shot classification,
with differences from MAML in red.
Require: p(T): distribution over tasks
Require: α, β: step size hyperparameters
Randomly initialize θBody,φ
while not done do
Sample task (S,Q)p(T)
for all class information tiin Sdo
θHead
i=gφ(ti)
end for
θ= (θHead, θBody)
Adapt parameters θ0=θαθLS(fθ)
Update network body initialization
θBody θBody βθBody LQ(fθ0)
Update hypernetwork
φφβφLQ(fθ0)
end while
We propose Fusion by Meta-Initialization
(FuMI): a gradient-based model for multi-modal
few-shot learning. This model extends MAML
by conditioning the meta-initialization of task-
specific model parameters on their associated
task information, thereby incorporating the aux-
iliary information into the tuned model.
Suppose we are training a neural network for
K
-
way classification, with a fully-connected final
layer (head). The parameters of the final layer
θHead
can be partitioned such that each
θHead
i
gen-
erates the class probability density
p(ci|x)
for
a particular class
ci
. MAML learns an initial-
ization
θ= (θHead, θBody)
and updates it with
gradient descent in the inner-loop, uninformed
by the auxiliary task information
t
. However,
given
t
, we may instead condition the initializa-
tion of each class head
θHead
i
on the auxiliary
information for its associated class ti, thereby generating a class-specific initialization.
In FuMI (Algorithm 1), we use a hypernetwork
gφ
[11] to generate this initialization for the final
layer, by computing
θHead
i=gφ(ti)
for each class description
ti
. As in MAML, a shared initialization
θBody
is used for the remainder of the network. All network weights
θ
are then tuned by gradient
descent in the inner loop, giving
θ0
. In the outer loop, the query set loss
LQ(fθ0)
is used to update
both the network body initialization θBody and hypernetwork parameters φ.
5 Experiments
5.1 Setup
Experimental set-up
The multi-modal few-shot problem is described in Section 2. We evaluated
5-way classification accuracy on iNat-Anim with up to 10 shots. We report the average meta-test
accuracy for each model across 5 random seeds. The meta-test split consisted of 1,000 randomly
sampled tasks where all classes in this split had not previously been seen in training. Each test task
had 20 randomly-sampled query images from each class, ensuring there was no dataset imbalance.
Baselines
We compared few-shot learning performance of FuMI to MAML as a natural uni-modal
baseline. We additionally compare performance to metric-based meta-learning approaches: 1)
Prototypical Networks [2], which computes a mean image embedding (prototype) for each class and
classifies query images as the class corresponding to the closest prototype by Euclidean distance,
2) AM3 [6], a multi-modal extension to Prototypical Networks, which learns an adaptable convex
combination of the image prototype with another prototype computed from the auxiliary modality.
We used the same pre-trained image and text encoders (BERT [22] and ResNet-152 [23]) for all
models to enable fair comparison across methods. Appendix B discusses implementation details.
5.2 Results
Multi-modal fusion improves performance in the very-few-shot regime.
Figure 2 shows the
relative performance gain of FuMI compared to MAML. We find that using the task-specific initial-
ization provides significant improvements given very limited task data, whilst performance is similar
with additional examples. This is as expected, since the relative information gain from auxiliary
information is greater when there are fewer examples per class.
3
Table 1: Few-shot classification accuracy for uni-modal (top) and multi-modal (bottom) models on
iNat-Anim over 5 random seeds.
Model 5-way Test Accuracy (%)
0-shot 1-shot 3-shot 5-shot 10-shot
Proto. Nets [2] - 71.7(2) 83.9(3) 85.9(2) 88.3(2)
MAML [1] - 72(1) 81(1) 84(2) 87.1(1)
AM3 [6] 71.0(8) 80.8(4) 85.9(5) 86.3(6) 88.5(2)
FuMI (ours) - 78.9(4) 82.7(6) 85.1(4) 87.1(2)
0246810
Number of Shots
0
2
4
6
8
Accuracy Gain vs. MAML (%)
FuMI
Figure 2: Few-shot classification accuracy of FuMI
on iNat-Anim compared to MAML. Error bars
show the uncertainty across 5 random seeds.
Metric-based methods outperform gradient-
based methods on iNat-Anim.
Table 1
shows the results for FuMI against the other uni-
and multi-modal baselines. We find that FuMI
under-performs compared to the other meta-
learning approaches. We note that gradient-
based meta-learning models can be particularly
sensitive to hyperparameters. Metric-based ap-
proaches were observed to be more robust on
our dataset.
6 Related Work
A range of other MAML extensions have been
recently proposed, with improvements including
training stability [24], avoiding computational overhead from second-order derivatives [25, 26], and
exploration in meta-reinforcement learning [27]. Vuorio et al. [28] use the entire uni-modal support
set (rather than auxiliary task information) to modulate the initialization of the entire network. Raghu
et al. [29] find no decrease in performance when updating only the network head in the inner loop,
thereby concluding that the features learned in the network body are directly reused across tasks.
Based on this, in addition to early experimentation, we use the hypernetwork to directly initialize only
the network head in FuMI. An alternative approach to few-shot learning is metric-based meta-learning
[2, 6, 30, 31], which we evaluate on iNat-Anim in Section 5.
7 Conclusions
Contributions
In this work, we have introduced Fusion by Meta-Initialization, a multi-modal
gradient-based meta-learning algorithm. FuMI significantly outperforms MAML baselines given very
limited data, highlighting the effectiveness of auxiliary information on few-shot performance. To fill
the need for large-scale benchmarks, we also constructed iNat-Anim, a few-shot image classification
dataset with high-quality class descriptions. We hope that this will enable further work on the
intersection of meta-learning and multi-modal models.
Limitations and Further Work
Methodologically, we note that the gradient-based inner-loop of
FuMI makes it vulnerable to catastrophic forgetting of the auxiliary information used for initialization.
Insights from continual learning could help mitigate against this [32]. Experimentally, we plan to
broaden our evaluation of FuMI to further image-text few-shot benchmarks [13]. In addition, it would
be informative to evaluate on other modalities (e.g. audio [33]) as well as multi-modal regression and
reinforcement learning tasks.
4
Acknowledgments and Disclosure of Funding
MJ and SM acknowledge funding from EPSRC Centre for Doctoral Training in Autonomous Intelli-
gent Machines and Systems (Grant No: EP/S024050/1). MJ also acknowledges funding from AWS
in collaboration with the Oxford-Singapore Human Machine Collaboration initiative.
Work done partially while at University College London. We would also like to thank Yihong Chen
and Tim Rocktäschel for initially facilitating the project.
References
[1]
Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-agnostic meta-learning for fast
adaptation of deep networks”. In: International conference on machine learning. PMLR. 2017,
pp. 1126–1135.
[2]
Jake Snell, Kevin Swersky, and Richard Zemel. “Prototypical networks for few-shot learning”.
In: Advances in neural information processing systems 30 (2017).
[3]
Yaqing Wang et al. “Generalizing from a few examples: A survey on few-shot learning”. In:
ACM computing surveys (csur) 53.3 (2020), pp. 1–34.
[4]
Yao Ma et al. “Multimodality in meta-learning: A comprehensive survey”. In: Knowledge-
Based Systems (2022), p. 108976.
[5]
Mengmeng Ma et al. “SMIL: Multimodal learning with severely missing modality”. In:
Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. 3. 2021, pp. 2302–
2310.
[6]
Chen Xing et al. “Adaptive cross-modal few-shot learning”. In: Advances in Neural Information
Processing Systems 32 (2019).
[7]
Zeynep Akata et al. “Label-embedding for image classification”. In: IEEE transactions on
pattern analysis and machine intelligence 38.7 (2015), pp. 1425–1438.
[8]
Xin Wang et al. “Tafe-net: Task-aware feature embeddings for low shot learning”. In: Pro-
ceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019,
pp. 1831–1840.
[9]
Yao-Hung Hubert Tsai, Liang-Kang Huang, and Ruslan Salakhutdinov. “Learning robust
visual-semantic embeddings”. In: Proceedings of the IEEE International conference on Com-
puter Vision. 2017, pp. 3571–3580.
[10]
Edgar Schonfeld et al. “Generalized zero-and few-shot learning via aligned variational au-
toencoders”. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2019, pp. 8247–8255.
[11]
David Ha, Andrew Dai, and Quoc V Le. “Hypernetworks”. In: arXiv preprint
arXiv:1609.09106 (2016).
[12]
Yongqin Xian et al. “Zero-Shot Learning—A Comprehensive Evaluation of the Good, the
Bad and the Ugly”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 41.9
(2019), pp. 2251–2265. DOI:10.1109/TPAMI.2018.2857768.
[13] Catherine Wah et al. “The caltech-ucsd birds-200-2011 dataset”. In: (2011).
[14]
Tzuf Paz-Argaman et al. “ZEST: Zero-shot learning from text descriptions using textual
similarity and visual summarization”. In: arXiv preprint arXiv:2010.03276 (2020).
[15]
Mohamed Elhoseiny et al. “Link the head to the" beak": Zero shot learning from noisy text
description at part precision”. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. 2017, pp. 5640–5649.
[16]
Oriol Vinyals et al. “Matching networks for one shot learning”. In: Advances in neural
information processing systems 29 (2016).
[17]
Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. “Human-level concept
learning through probabilistic program induction”. In: Science 350.6266 (2015), pp. 1332–
1338.
[18]
Grant Van Horn et al. “The inaturalist species classification and detection dataset”. In: Pro-
ceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 8769–
8778.
[19]
Eli Schwartz et al. “Baby steps towards few-shot learning with multiple semantics”. In: Pattern
Recognition Letters 160 (2022), pp. 142–147.
5
[20]
Grant Van Horn and Oisin Mac Aodha. iNat Challenge 2021.URL:
https://sites.google.
com/view/fgvc8/competitions/inatchallenge2021?authuser=0.
[21] Online animals encyclopedia.https://animalia.bio/.
[22]
Jacob Devlin et al. “Bert: Pre-training of deep bidirectional transformers for language under-
standing”. In: arXiv preprint arXiv:1810.04805 (2018).
[23]
Kaiming He et al. “Deep residual learning for image recognition”. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2016, pp. 770–778.
[24]
Antreas Antoniou, Harrison Edwards, and Amos Storkey. “How To Train Your MAML”. In:
ICLR (2018).
[25]
Xingyou Song et al. “ES-MAML: Simple Hessian-Free Meta Learning”. In: International
Conference on Learning Representations. 2020. URL:
https://openreview.net/forum?
id=S1exA2NtDB.
[26]
Alex Nichol, Joshua Achiam, and John Schulman. “On first-order meta-learning algorithms”.
In: arXiv preprint arXiv:1803.02999 (2018).
[27]
Bradly Stadie et al. “The importance of sampling inmeta-reinforcement learning”. In: Advances
in Neural Information Processing Systems 31 (2018).
[28]
Risto Vuorio et al. “Multimodal model-agnostic meta-learning via task-aware modulation”. In:
Advances in Neural Information Processing Systems 32 (2019).
[29]
Aniruddh Raghu et al. “Rapid Learning or Feature Reuse? Towards Understanding the Effec-
tiveness of MAML”. In: ICLR. 2020.
[30]
Gregory Koch, Richard Zemel, Ruslan Salakhutdinov, et al. “Siamese neural networks for
one-shot image recognition”. In: ICML deep learning workshop. Vol. 2. Lille. 2015.
[31]
Han Hu et al. “Relation Networks for Object Detection”. In: Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR). June 2018.
[32]
Matthias De Lange et al. “A continual learning survey: Defying forgetting in classification
tasks”. In: IEEE transactions on pattern analysis and machine intelligence 44.7 (2021),
pp. 3366–3385.
[33]
Valentin Vielzeuf et al. “Centralnet: a multilayer approach for multimodal fusion”. In: Pro-
ceedings of the European Conference on Computer Vision (ECCV) Workshops. 2018, pp. 0–0.
[34]
Adam Paszke et al. “Pytorch: An imperative style, high-performance deep learning library”.
In: Advances in neural information processing systems 32 (2019).
[35]
Thomas Wolf et al. “Huggingface’s transformers: State-of-the-art natural language processing”.
In: arXiv preprint arXiv:1910.03771 (2019).
[36]
Tristan Deleu et al. “Torchmeta: A meta-learning library for pytorch”. In: arXiv preprint
arXiv:1909.06576 (2019).
[37]
Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv
preprint arXiv:1412.6980 (2014).
[38]
Nitish Srivastava et al. “Dropout: a simple way to prevent neural networks from overfitting”.
In: The journal of machine learning research 15.1 (2014), pp. 1929–1958.
Checklist
1. For all authors...
(a)
Do the main claims made in the abstract and introduction accurately reflect the paper’s
contributions and scope? [Yes]
(b) Did you describe the limitations of your work? [Yes] See Section 7
(c)
Did you discuss any potential negative societal impacts of your work? [Yes] See
Broader Impacts section (Appendix A.
(d)
Have you read the ethics review guidelines and ensured that your paper conforms to
them? [Yes]
2. If you are including theoretical results...
(a) Did you state the full set of assumptions of all theoretical results? [N/A]
(b) Did you include complete proofs of all theoretical results? [N/A]
3. If you ran experiments...
6
(a)
Did you include the code, data, and instructions needed to reproduce the main ex-
perimental results (either in the supplemental material or as a URL)? [Yes]
https:
//github.com/s-a-malik/multi-few
(b)
Did you specify all the training details (e.g., data splits, hyperparameters, how they
were chosen)? [Yes] See Section 5.1 and Appendix B
(c)
Did you report error bars (e.g., with respect to the random seed after running experi-
ments multiple times)? [Yes] See Table 1 and Figure 2
(d)
Did you include the total amount of compute and the type of resources used (e.g., type
of GPUs, internal cluster, or cloud provider)? [Yes] See Appendix B
4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...
(a)
If your work uses existing assets, did you cite the creators? [Yes] See Section 3 and
Appendix C
(b) Did you mention the license of the assets? [Yes] See Appendix C
(c)
Did you include any new assets either in the supplemental material or as a URL? [Yes]
https://doi.org/10.5281/zenodo.6703088
(d)
Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? [Yes] See Appendix C
(e)
Did you discuss whether the data you are using/curating contains personally identifiable
information or offensive content? [Yes] See Appendix C
5. If you used crowdsourcing or conducted research with human subjects...
(a)
Did you include the full text of instructions given to participants and screenshots, if
applicable? [N/A]
(b)
Did you describe any potential participant risks, with links to Institutional Review
Board (IRB) approvals, if applicable? [N/A]
(c)
Did you include the estimated hourly wage paid to participants and the total amount
spent on participant compensation? [N/A]
A Broader Impact
In this work we seek to develop better methods for few-shot learning. Few-shot learning has the
potential to democratise access to powerful machine learning methods as it enables their usage
in resource-constrained environments and minority groups which may not be well-represented in
datasets which have largely been curated in western cultures. However, it could also have potential
negative implications, for example, it could be used in facial recognition, reducing the privacy of
individuals.
The dataset we release with this work (iNat-Anim) was developed to help evaluate multi-modal
few-shot learning. It consists of less noisy, short textual descriptions than previous works [13].
This enables method development in the field with smaller models and therefore it could reduce the
environmental impact of training large models for research purposes. While we inspected the data as
much as possible, we have not checked all of the descriptions obtained from the Animalia website.
There always remains a risk with using web-scraped data that harmful or biased descriptions could be
present, and as such the data must be used with care.
B Implementation Details
All models were implemented in PyTorch [34]. Each training run took 1.5 to 3 hours on a single
free-tier Google Colaboratory GPU. We used BERT [22] (
bert-base-uncased
) from the Hugging
Face library [35] as the pre-trained text encoder for all models. We followed the standard pre-
processing routine for BERT, which involved truncating descriptions that were longer than the
maximum sequence length for the model. We use ResNet-152 [23] as the pre-trained image encoder
for all models. This has a feature dimension of 2048. The final layer (head) was fine-tuned but all
other parameters were frozen during training.
The Torchmeta library [36] was used to construct meta-splits for the dataset. We used a 60:20:20
train:validation:test class splitting. Due to computational restrictions we could not perform extensive
7
hyperparameter tuning. Instead, hyperparameters were chosen via simple heuristics that maximized
accuracy on the validation split, using suggestions from the literature as starting points. The validation
split was also used to select the best model checkpoint using the validation loss. Our code has been
open-sourced4. Poignant hyperparameters were as follows:
AM3/Prototypical Networks
We set the prototype dimension to 512, and used a single hidden
layer neural network with hidden dimension 512 for each of the
g
and
h
networks. We found that the
number of tasks per batch of 5, 3, 2 and 1 and query set size during training of 10, 8, 8 and 8 worked
well heuristically for 1, 3, 5 and 10 shots respectively. In all cases, we used the Adam optimizer [37]
with learning rate
0.001
. Additionally, we used dropout [38] with
p= 0.2
and an L2 weight decay of
0.0005 to prevent overfitting. For the zero-shot AM3 model, we forced
λ= 0
to remove dependence
of the prototype on the support images. For the uni-modal prototypical network baseline [2], we
simply used our AM3 implementation but manually forced
λ= 1
, which removes the dependence of
the prototype on the text.
FuMI/MAML
We use 4 tasks per batch, and a query set size during training of 32 for all shots. For
both FuMI and MAML, the model consisted of three fully-connected layers, with hidden layer widths
of 256 and 64. The FuMI hypernetwork also consisted of two fully-connected layers, with a hidden
layer width of 256. A dropout rate of
p= 0.25
was used. Again, we used the Adam optimizer with
learning rate 0.00003 and an L2 weight decay of 0.0005 for outer-loop training. For the inner-loop, a
step size of 0.01 was used, with 5 training updates on the support set at meta-train time. At meta-test
time, 50, 50, 100 and 100 inner-loop updates were performed on the support set, for 1, 3, 5 and 10
shots respectively.
C iNat-Anim Details
The images are a subset of the images from the iNaturalist 2021 CVPR challenge [20] and have
been augmented with textual descriptions of each species from Animalia [21], an online animal
encyclopedia. Full permission for website scraping and dataset publication was obtained from
the owners of Animalia prior to release of the dataset. We place a CC BY-NC 4.0
5
licence on
the textual descriptions scraped from Animalia, whilst retaining the per-image licensing from the
relevant subset of iNaturalist. The descriptions are curated by the website owners and we manually
inspected a significant proportion for quality. To best of our knowledge, the descriptions do not
contain any personally identifiable or offensive material. Figure 3 shows the distributions of classes
and description lengths across the dataset.
348
129 195
Class
Birds (
Aves
)
Reptiles (
Reptilia
)
Mammals
(
Mammalia
)
50 75 100 125 150 175 200
Number of words in description
0
20
40
60
80
100
120
140
160
Count
Figure 3: The left pane shows the distribution of species in iNat-Anim across birds, reptiles and
mammals. The right pane shows a histogram of the number of words in each description of each
species.
4https://github.com/s-a-malik/multi-few
5https://creativecommons.org/licenses/by-nc/4.0/
8
ResearchGate has not been able to resolve any citations for this publication.
Article
Learning from one or few visual examples is one of the key capabilities of humans since early infancy, but is still a significant challenge for modern AI systems. While considerable progress has been achieved in few-shot learning from a few image examples, much less attention has been given to the verbal descriptions that are usually provided to infants when they are presented with a new object. In this paper, we focus on the role of additional semantics that can significantly facilitate few-shot visual learning. Building upon recent advances in few-shot learning with additional semantic information, we demonstrate that further improvements are possible by combining multiple and richer semantics (category labels, attributes, and natural language descriptions). Using these ideas, we offer the community new results on the popular miniImageNet and CUB few-shot benchmarks, comparing favorably to the previous state-of-the-art results for both visual only and visual plus semantics-based approaches. We also performed an ablation study investigating the components and design choices of our approach. Code available on github.com/EliSchwartz/mutiple-semantics.
Article
Meta-learning has gained wide popularity as a training framework that is more data-efficient than traditional machine learning methods. However, its generalization ability in complex task distributions, such as multimodal tasks, has not been thoroughly studied. Recently, some studies on multimodality-based meta-learning have emerged. This survey provides a comprehensive overview of the multimodality-based meta-learning landscape in terms of the methodologies and applications. We first formalize the definition of meta-learning in multimodality, along with the research challenges in this growing field, such as how to enrich the input in few-shot learning (FSL) or zero-shot learning (ZSL) in multimodal scenarios and how to generalize the models to new tasks. We then propose a new taxonomy to discuss typical meta-learning algorithms in multimodal tasks systematically. We investigate the contributions of related papers and summarize them by our taxonomy. Finally, we propose potential research directions for this promising field.
Article
Artificial neural networks thrive in solving the classification problem for a particular rigid task, acquiring knowledge through generalized learning behaviour from a distinct training phase. The resulting network resembles a static entity of knowledge, with endeavours to extend this knowledge without targeting the original task resulting in a catastrophic forgetting. Continual learning shifts this paradigm towards networks that can continually accumulate knowledge over different tasks without the need to retrain from scratch. We focus on task incremental classification, where tasks arrive sequentially and are delineated by clear boundaries. Our main contributions concern 1) a taxonomy and extensive overview of the state-of-the-art, 2) a novel framework to continually determine the stability-plasticity trade-off of the continual learner, 3) a comprehensive experimental comparison of 11 state-of-the-art continual learning methods and 4 baselines. We empirically scrutinize method strengths and weaknesses on three benchmarks, considering Tiny Imagenet and large-scale unbalanced iNaturalist and a sequence of recognition datasets. We study the influence of model capacity, weight decay and dropout regularization, and the order in which the tasks are presented, and qualitatively compare methods in terms of required memory, computation time and storage.
Article
Machine learning has been highly successful in data-intensive applications but is often hampered when the data set is small. Recently, Few-shot Learning (FSL) is proposed to tackle this problem. Using prior knowledge, FSL can rapidly generalize to new tasks containing only a few samples with supervised information. In this article, we conduct a thorough survey to fully understand FSL. Starting from a formal definition of FSL, we distinguish FSL from several relevant machine learning problems. We then point out that the core issue in FSL is that the empirical risk minimizer is unreliable. Based on how prior knowledge can be used to handle this core issue, we categorize FSL methods from three perspectives: (i) data, which uses prior knowledge to augment the supervised experience; (ii) model, which uses prior knowledge to reduce the size of the hypothesis space; and (iii) algorithm, which uses prior knowledge to alter the search for the best hypothesis in the given hypothesis space. With this taxonomy, we review and discuss the pros and cons of each category. Promising directions, in the aspects of the FSL problem setups, techniques, applications, and theories, are also proposed to provide insights for future research.