Content uploaded by Harsh Sikka

Author content

All content in this area was uploaded by Harsh Sikka on Oct 20, 2020

Content may be subject to copyright.

Multimodal Modular Meta-Learning

Harshvardhan D. Sikka

Georgia Institute of Technology

Manifold Computing

OpenMined

harsh@manifoldcomputing.com

Atharva A. Tendle

University of Nebraska-Lincoln

Manifold Computing

OpenMined

atharva.tendle@huskers.unl.edu

Amr Kayid

German University in Cairo

Manifold Computing

OpenMined

amrmkayid@gmail.com

Abstract

Many real world prediction problems involve structured tasks across multiple

modalities. We propose to extend previous work in modular meta learning to the

multimodal setting. Speciﬁcally, we present an algorithmic approach to apply task

aware modulation to a modular meta learning system that decomposes structured

multimodal problems into a set of modules that can be reassembled to learn new

tasks. We also propose a series of experiments to compare this approach with

state of the art modular and multimodal meta learning approaches on multimodal

function prediction and image classiﬁcation tasks.

1 Introduction

Leveraging previous experiences to acquire new skills is relatively easy for humans, but it presents

signiﬁcant theoretical and computational challenges for machine learning systems. Current machine

learning systems are "specialists" which excel in the tasks they are trained for but often fall apart

when attempting a different task. Meta-learning is a ﬁeld that has the potential to create "generalist"

algorithms, and it does so through the process of "learning to learn" [

1

,

2

]. These "generalist"

algorithms can adapt to newer tasks by leveraging prior experiences. In recent years there have been

advancements that allow machines to adapt to new tasks by learning an internal representation over

the tasks in the training-data distribution [

3

–

6

]. These advancements have achieved some degree of

success in unimodal domains, and some have been extended to the multimodal setting. For example,

MMAML [

7

] attempts to solve the multi-modal problem by creating a framework that identiﬁes the

mode of sampled tasks and then modulates meta-learned priors that better ﬁt the mode.

A wide variety of interesting and relevant real world problems demonstrate multimodality along with

some inherent structure, including robotics tasks, autonomous navigation, and multimodal language

and vision tasks. Many model agnostic meta learning approaches will intuitively struggle in these

domains due to the constraints of the learning algorithms and biases they employ to generalize to new

tasks. For example, in the MMAML methodology introduced in [

7

] the structure of the architecture

stays constant and therefore the performance is limited to the priors generated by their modulation

network, leaving room for improvement.

Modular approaches provide an interesting avenue for addressing structured problem domains and

have achieved some success with solving hierarchical problems in the past [

8

–

11

]. An interesting

direction in Meta-learning is learning a set of reusable modules from the decomposition of a structured

task, and then recombining those modules to solve new tasks [12].

Previous attempts at solving real world multimodal problems highlighted the inherent structure in

many of these problem spaces. In this paper, we propose an extension to modular meta-learning that

learns a set of modules and combines them for tasks in a variety of domains. The text is structured

as follows: In Section 2, we discuss related work in the areas of modularity, multimodality, and

meta-learning. In Section 3, we discuss some critical preliminaries to develop a modular meta

learning approach for multiple modalities. Following this, in Section 4 we break down our approach

and explicitly outline a pseudocode meta learning algorithm, T4ML. Finally, we discuss experimental

aims in detail in Section 5.

2 Related Work

Our work builds primarily from two sources: multimodal meta-learning and modular meta-learning.

Meta-learning empowers machines with the ability of learning to learn by designing models that

rapidly learn new skills with a few training examples. Notable examples of meta-learning are Model-

Agnostic Meta-Learning MAML [

3

] and related optimization-based methods [

4

–

6

,

13

]. MAML is a

general optimization algorithm that aims to estimate a good initialization of a model’s parameters to

achieve optimal fast learning on a new task with only a small number of gradient steps. However,

having a common initialization for all tasks can restrict the performance on a multimodal task

distribution.

Multimodal Model-Agnostic Meta-Learning [

7

] is a more powerful model-agnostic meta-learning

framework for the multimodal setting. It augments MAML to identify tasks sampled from a mul-

timodal task distribution and adapts quickly through gradient updates. This framework achieved

superior generalization performance in multimodal few-shot regression, image classiﬁcation, and

reinforcement learning tasks.

Recently, investigating the structure of neural networks and designing modular networks has become

important for achieving efﬁcient performance. Modularity is an important principle as it provides a

natural way of achieving compositionality and generalization, and has been successfully applied to

building static neural networks [

8

,

14

–

16

]. Moreover some studies have found that some types of

modular structures emerged in standard neural networks [

17

,

18

]. New strategies have been proposed

for combining the modularity of neural networks with meta learning [

12

,

19

,

20

], with a general trend

of learning modules that can be recombined to solve new tasks, leading to better performance and

combinatorial generalization.

We aim to develop a more efﬁcient and adaptable framework that is able to deal with multimodal

task distributions while providing modularity and using efﬁcient neural network architectures for new

tasks.

3 Preliminaries

As presented in [

3

], a task is deﬁned by the joint distribution

PT(x, y)

over the input, output pairs

(x, y)

. Meta-learning aims to learn functions that approximate mappings for

K

number of input and

output data

(xk, yk)Kt

k=1

across a number of different tasks

t∈T

. Data for each task

t

is split into

training and testing datasets, Dtrain

tand Dtest

t.

MAML:

The goal of MAML [

3

] is to ﬁnd an initialization of parameters

θ

for the meta learning

algorithm such that convergence to good performance on a new task can be accomplished with

relatively few gradient steps trained on

Dtrain

t

and evaluated for generalization on

Dtest

t

. The

initialization

θ

is found by training on groups of tasks and evaluating computed parameters from

those tasks to calculate the test losses on the whole test data for the batch of tasks. The gradients

of the losses are then used to update

θ

. We adopt the deﬁnition of unimodality and multimodality

presented in [

7

]. If the task distribution contains tasks that belong to a single input domain, it is

considered a unimodal distribution. Alternatively, if there are multiple label and input domains, we

consider the distribution to be multimodal.

Multimodal MAML:

Presented in [

7

], MMAML extends MAML in a framework that allows for

learning novel tasks in a multimodal setting. The central idea involves using a modulation network

to predict the modality of the task and initialize optimal parameters

θ

for that modality in the task

2

network. The modulation network works by taking in the input output data

(xk, yk)Kt

k=1

and passing

them to a task encoder,

h

. The encoder produces an embedding

v

, which is then used to compute the

task-speciﬁc parameters

τ

that are used to later modulate the meta-learned parameters of the task

network. vand τare formalized as the following:

v=h({(xk, yk)})Kt

k=1;wh)(1)

τ=gi(v;wg)N

i=1 (2)

Modulations are practically achieved by applying transformations to each building block in the task

network, which is an arbitrarily parameterized function like a Neural Network. Building blocks are

denoted by

i

, and transformations scale and shift the outputs of the neurons in a given block. The

function

g

is made up of feedforward neural networks, each trained to ﬁnd

τ

for a single block in

the task network. After modulation, the task network undergoes a few steps of gradient descent to

achieve optimal performance on the task t.

Modular Meta Learning:

In [

12

], the authors present BounceGrad, an approach that learns a set

of modules and combines them to map to new tasks. Starting with a compositional rule and a set of

modules, the authors present a hypothesis space deﬁned by the set of functional mappings

(C, F, Θ)

.

Modules

f

in the basis set

F

are neural networks with varied architectures, parameterized by

θ∈Θ

.

C

corresponds to a compositional scheme for the generation of complex functions from simpler ones,

and involves operators that allow for the composition of the neural modules mentioned earlier.

S∈S

is a particular structure in the space of all particular structure generated by C through the composition

of neural modules

f

. BounceGrad consists of 2 phases, ﬁrst learning the optimal structure

S

and

subsequently ﬁnding the

Θ

that minimizes the average generalization performance. During the ﬁrst

phase,

Θ

is ﬁxed, and the candidate structure

S

is tested on the training split from a set of Data held

out of the main data, known as the meta-test data. The formulation for Phase 1 is as follows

S∗

Θ= arg min

S∈S

e(Dtrain

meta−test, S, Θ) (3)

where

e(D, S, Θ) = Σ{(x,y)∈D}L(SΘ(x), y )

deﬁnes the loss of a candidate structure with parameters

Θ

. During the second phase, the structure is now speciﬁed, and the goal is to ﬁnd parameters for

the modules that can be used to solve the training tasks. The authors use validation sets for the

meta-training tasks to prevent ﬁnding parameters that overﬁt:

J(Θ) = Σm

j=1e(Dtest

j,arg min

S∈S

e(Dtrain

j, S, Θ),Θ) (4)

The authors used simulated annealing [

21

] to search for an optimal structure S, starting with an initial

structure and randomly proposing changes that are accepted or rejected. In this work, we propose to

extend the BounceGrad approach to the multimodal setting via task aware modulation. We describe

our algorithm in the following section.

4 Method

Our goal is to develop a modular approach to multimodal meta-learning through task aware mod-

ulation (T4ML). We present an initial T4ML algorithm, which learns a collection of modules that

can be combined for a new task samples from a multimodal task distribution. A visual overview of

the approach, as well as the complete concrete algorithm in the form of pseudo-code is included in

Figure 1 and Algorithm 1, respectively.

T4ML builds on the two phase optimization approach outlined in 3 and 4 by adding a third phase

involving task aware modulation. First, after being provided with a basis set of modules

F

, a

compositional scheme

C

, and an initial set of parameters for the modules

Θ

, T4ML searches for

an optimal structure

S∗

in the same way as 3. Following BounceGrad, the simulated annealing

search is performed starting with an initial candidate structure and randomly modifying it based on

the constraints of

C

to ﬁnd potentially better candidates. This search is performed over a batch of

3

tasks, as outlined explicitly in lines 3 through 8 of Algorithm 1. Once an optimal structure

S∗

is

found, Phase 2 begins.

Θ

is now found such that it minimizes the generalization performance of the

candidate structure as described in 4. This process is described in lines 9 through 14 of Algorithm 1.

With optimal structure

S∗

and optimal initilizations

Θ

found, T4ML introduces task aware modulation

as described in MAML. A modulation network is introduced, consisting of a task encoder and a

series of deep feed forward neural networks (DNNs). The task encoder generates task embeddings

v

as described in 1. This is passed to the DNNs, which generate modulation parameters

τ

as outlined

in 2.

S∗

functions as the task network in the MMAML setting, and task aware initiliazations are

found through applying modulations to the module parameters

Θ

,

Θi. . . τi

, where

i= 1, ..., N

for

N

modules. Different modulations can be used, including softmax based modulation. Different

modulation operators will be explored in the experiments outlined in Section 5. During Phase 3, the

modulation network and task network are trained end to end on the batch of tasks, as outlined in lines

15 through 23 of Algorithm 1. On a per task basis,

j

, the gradients for the loss with regards to the

data samples

K

are calculated, and the parameters of

Θ

are updated. This process is outlined in lines

16 through 19 of Algorithm 1. After training has been done on all tasks in the batch, the parameters

are updated for the task network

Θ

. The parameters for the modulation network are also updated

in the same way, denoted by

wh

for the task encoder

v

and

wg

for the DNNs that output

τ

. These

updates are described in lines 11 through 13 in Algorithm 1.

We aim to develop and demonstrate the described algorithm, and compare it with state of the art

approaches in Modular and Multimodal meta-learning, as described in Section 5.

Figure 1:

Left:

T4ML Overview. Step 1 involves ﬁnding optimal network structure

S∗

and initializa-

tion Θ. Step 2 modulates S∗

Θusing the generated parameters τ.Right: T4ML algorithm.

4

5 Experimental Approach

To assess the effectiveness of the proposed method in multi-modal settings, we compare it against

other meta-learning approaches across multi-modal domains. We adopt and extend the experimental

procedures demonstrated in previous meta-learning papers [

7

,

12

] for clarity and reproduceability.

T4ML presents an initial exploration into a general approach for modular, multi-modal meta-learning,

and as such the methods compared in the experimental approach are speciﬁcally those that are

known to operate across domains well. Task domains to be explored include multimodal few shot

regression and multimodal image classiﬁcation. These tasks serve as important benchmarks in the

broader meta-learning community, and lay the foundation for research into more complicated problem

formulations like Reinforcement Learning or Robotics. We seek to compare the following baseline

meta-learning methods on the mentioned tasks:

MAML: Model-agnostic meta learners with a ﬁxed task network across different task domains.

MMAML: The architecture of the task network in the MMAML setup will be identical to MAML.

BounceGrad

: The modules in this approach will be shared with T4ML to demonstrate a meaningful

baseline for modular approaches that weren’t designed with multimodal tasks in mind.

In comparing these distinct methods with T4ML, we hope to highlight the usefulness of modular

approaches specialized for multi-modal domains. We outline the Experimental approach for each

domain in detail below.

5.1 Regression Domain

We will begin by testing the various baselines on a combination of different function prediction

tasks. We follow the general premise introduced in [

12

] and extend it with the setup of the Sine

function prediction task introduced in [

3

]. We setup 4 different one dimensional functions: sinusoidal

functions, linear functions, sums of common non-linear functions, and quadratic functions. Data

is samples with gaussian noise added to the output values. Pairs of input and output samples will

be sampled from the function being tested and passed to the different meta-learning baselines for

learning. The model is tasked with predicting output values for associated inputs.

Method Conﬁgurations

MAML and MMAML will both make use of deep feedforward neural

networks as task networks. MMAML will use an LSTM as the modulation network because of its

success with sequential inputs and its use in the original MMAML conﬁguration. As compared to

the previous baselines, BounceGrad and T4ML make use of a compositional structure to construct

the equivalent of the task network in MMAML. The general compositional scheme for sinusoidal

functions, linear functions, and quadratic functions will be

h(x) = fi(fj(x))

and

h(x) = fi(x) +

fj(x)

for sums of non-linear functions, following the general setup of experiments in [

12

]. The

compositional space

F

consists of 10 feedforward neural network modules, half of which have 1

hidden layer and half of which have 2. T4ML makes use of the same LSTM based modulation

network as MMAML. Modulation approaches will include FiLM [22] and softmax [23].

5.2 Image Classiﬁcation

For Multimodal Image Classiﬁcation, we generally follow the experimental procedure set forth in [

7

].

The task can be surmised as classifying images into a set of classes with a few number of samples

available. We will combine several popular image datasets into a multimodal few-shot image dataset.

The datasets to be used for this task are Mini-ImageNet [

24

], FC100 [

25

], CUB [

26

], AIRCRAFT [

27

],

and OMNIGLOT [

28

]. classiﬁcation Models are trained on different combinations of modalities,

mainly 3 and 5 mode combinations. This procedure follows the some of the experiments presented in

the original MMAML paper, and allows for a direct comparison between T4ML and MMAML.

Method Conﬁgurations

In these experiments, MAML and MMAML will now make use of

small 5 layer convolutional neural networks, along with the same LSTM modulation network. The

compositional scheme used by BounceGrad and T4ML will be

h(x) = fi(fj(x))

owing to the

heirarchical nature of the image domain, and the compositional space will consist of CNN modules,

with 3 and 5 layer combinations. T4ML will continue to use an LSTM for task aware modulation,

and will use FiLM and softmax.

5

References

[1]

Jürgen Schmidhuber. Evolutionary principles in self-referential learning; On learning how to

learn: The meta-meta-...hook. PhD thesis, Institut f. Informatik, Tech. Univ. Munich, 1987.

[2] S. Thrun and L. Pratt. Learning to learn. Springer, 1998.

[3]

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adapta-

tion of deep networks. arXiv preprint arXiv:1703.03400, 2017.

[4]

Alex Nichol, Joshua Achiam, and John Schulman. On ﬁrst-order meta-learning algorithms.

arXiv preprint arXiv:1803.02999, 2018.

[5]

Tianhe Yu, Chelsea Finn, Annie Xie, Sudeep Dasari, Tianhao Zhang, Pieter Abbeel, and Sergey

Levine. One-shot imitation from observing humans via domain-adaptive meta-learning. arXiv

preprint arXiv:1802.01557, 2018.

[6]

Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. In

Advances in Neural Information Processing Systems, pages 9516–9527, 2018.

[7]

Risto Vuorio, Shao-Hua Sun, Hexiang Hu, and Joseph J Lim. Multimodal model-agnostic meta-

learning via task-aware modulation. In Advances in Neural Information Processing Systems,

pages 1–12, 2019.

[8]

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks. In

Proceedings of the IEEE conference on computer vision and pattern recognition, pages 39–48,

2016.

[9]

Gasser Auda and Mohamed Kamel. Modular neural networks: a survey. International Journal

of Neural Systems, 9(02):129–151, 1999.

[10]

Bart LM Happel and Jacob MJ Murre. Design and evolution of modular neural network

architectures. Neural networks, 7(6-7):985–1004, 1994.

[11]

Harshvardhan Sikka. Creating, managing, and understanding large, sparse, multitask neural

networks. 2020.

[12]

Ferran Alet, Tomás Lozano-Pérez, and Leslie P Kaelbling. Modular meta-learning. arXiv

preprint arXiv:1806.10166, 2018.

[13]

Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn.

Bayesian model-agnostic meta-learning. In Advances in Neural Information Processing Systems,

pages 7332–7342, 2018.

[14]

Michael B Chang, Abhishek Gupta, Sergey Levine, and Thomas L Grifﬁths. Automati-

cally composing representation transformations as a means for generalization. arXiv preprint

arXiv:1807.04640, 2018.

[15]

Dzmitry Bahdanau, Shikhar Murty, Michael Noukhovitch, Thien Huu Nguyen, Harm de Vries,

and Aaron Courville. Systematic generalization: what is required and can it be learned? arXiv

preprint arXiv:1811.12889, 2018.

[16]

Louis Kirsch, Julius Kunze, and David Barber. Modular networks: Learning to decompose

neural computation. In Advances in Neural Information Processing Systems, pages 2408–2418,

2018.

[17]

Chihiro Watanabe, Kaoru Hiramatsu, and Kunio Kashino. Modular representation of layered

neural networks. Neural Networks, 97:62–73, 2018.

[18]

Daniel Filan, Shlomi Hod, Cody Wild, Andrew Critch, and Stuart Russell. Neural networks are

surprisingly modular. arXiv preprint arXiv:2003.04881, 2020.

[19]

Rohan Chitnis, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. Learning quickly to plan

quickly using modular meta-learning. In 2019 International Conference on Robotics and

Automation (ICRA), pages 7865–7871. IEEE, 2019.

6

[20]

Yutian Chen, Abram L Friesen, Feryal Behbahani, Arnaud Doucet, David Budden, Matthew W

Hoffman, and Nando de Freitas. Modular meta-learning with shrinkage. arXiv preprint

arXiv:1909.05557, 2019.

[21] Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. Optimization by simulated annealing.

science, 220(4598):671–680, 1983.

[22]

Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film:

Visual reasoning with a general conditioning layer. arXiv preprint arXiv:1709.07871, 2017.

[23]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,

Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information

processing systems, pages 5998–6008, 2017.

[24] Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016.

[25]

Boris Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. Tadam: Task dependent adaptive

metric for improved few-shot learning. In Advances in Neural Information Processing Systems,

pages 721–731, 2018.

[26]

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The

caltech-ucsd birds-200-2011 dataset. 2011.

[27]

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-

grained visual classiﬁcation of aircraft. arXiv preprint arXiv:1306.5151, 2013.

[28]

Brenden Lake, Ruslan Salakhutdinov, Jason Gross, and Joshua Tenenbaum. One shot learning

of simple visual concepts. In Proceedings of the annual meeting of the cognitive science society,

volume 33, 2011.

7