Conference PaperPDF Available

Active Word Learning through Self-supervision

Authors:

Abstract and Figures

Models of cross-situational word learning typically characterize the learner as a passive observer, but a language learning child can actively participate in verbal and non-verbal communication. We present a computational study of cross-situational word learning to investigate whether a curious word learner who actively influences linguistic input in each context has an advantage over a passive learner. Our computational model learns to map words to objects in real images by self-supervision through simulating both word comprehension and production. We examine different curiosity measures as guiding input selection, and analyze the relative impact of each method. Our results suggest that active learning leads to higher overall performance, and a formulation of curiosity which relies both on subjective novelty and plasticity yields the best performance and learning stability.
Content may be subject to copyright.
Active Word Learning through Self-supervision
Lieke Gelderloos (l.j.gelderloos@uvt.nl)
Department of Cognitive Science and Artificial Intelligence
Tilburg University
Alireza Mahmoudi Kamelabad1(a.m.kamelabad@gmail.com)
CIMeC - Center for Mind/Brain Sciences
University of Trento
Afra Alishahi (a.alishahi@uvt.nl)
Department of Cognitive Science and Artificial Intelligence
Tilburg University
Abstract
Models of cross-situational word learning typically character-
ize the learner as a passive observer, but a language learn-
ing child can actively participate in verbal and non-verbal
communication. We present a computational study of cross-
situational word learning to investigate whether a curious word
learner who actively influences linguistic input in each context
has an advantage over a passive learner. Our computational
model learns to map words to objects in real images by self-
supervision through simulating both word comprehension and
production. We examine different curiosity measures as guid-
ing input selection, and analyze the relative impact of each
method. Our results suggest that active learning leads to higher
overall performance, and a formulation of curiosity which re-
lies both on subjective novelty and plasticity yields the best
performance and learning stability.
Keywords: Cross-situational word learning; Computational
modelling; Active learning; Curiosity.
Introduction
An important task in language acquisition is learning which
words refer to which objects in the world. Cross-situational
word learning is the process by which learners match words
to objects by tracking word-object co-occurrences over many
instances. Often when a word is encountered, there are many
candidate objects in the context that the word might refer to.
However, when one takes into account multiple occurrences
of the word, some object(s) will be consistently present, mak-
ing them more likely candidates.
Models of cross-situational word learning typically char-
acterize the learner as a passive observer. However, a lan-
guage learning child can actively participate in verbal and
non-verbal communication. Thereby, they may be actively
shaping their linguistic input. Bloom, Margulis, Tinker, and
Fujita (1996) found that in early child-caregiver interaction,
children often introduce a new topic into the conversation,
and parents are likely to follow up by continuing to talk about
the same topic. Another process through which children may
shape their own language input is joint attention: certain ob-
jects are the focus of attention of both participants in a con-
versation. Caregivers are sensitive and responsive to chil-
dren’s attention in several ways, and talk about objects when
1Project carried out while visiting the Department of Cognitive
Science and Artificial Intelligence at Tilburg University.
they are the focus of attention (Chang, de Barbaro, & De´
ak,
2016). Several studies have found that caregivers following
the child’s attention (as opposed to drawing a child’s atten-
tion towards a certain object), and talking about objects that
are already in focus, is correlated with word learning (Akhtar,
Dunham, & Dunham, 1991; Tomasello & Farrar, 1986).
In this work, we consider the possibility that language
learners use their active role in communication to elicit the
most informative linguistic input from their interlocutors. In
several other domains, children attend to or sample informa-
tive data. When searching for rewards, children, more than
adults, explore uncertain options (Schulz, Wu, Ruggeri, &
Meder, 2019). Kidd, Piantadosi, and Aslin (2012, 2014) show
aGoldilocks effect in infants’ attention to visual and auditory
stimuli. Infants attend especially to stimuli that are somewhat
complex, but not too complex; somewhat predictable, but not
entirely, according to their current knowledge state. We pro-
pose that children may also display such curiosity-driven be-
haviour during language acquisition, and study the potential
effects of an active input selection mechanism on word learn-
ing in a computational setup.
In artificial intelligence research, different implementa-
tions of active learning have been proposed and studied, in-
cluding definitions based on novelty, predictability, or task
success in the long run (see Oudeyer and Kaplan (2009) for
an overview). In a cognitively motivated study, Twomey and
Westermann (2018) show that a model of visual category
learning achieves maximal learning results when it selects its
input according to a curiosity metric that balances the novelty
of the stimulus and potential knowledge update. This concept
was applied to the task of cross-situational word learning in
a computational study of Keijser, Gelderloos, and Alishahi
(2019). They show that an agent that curiously selects objects
to receive linguistic input for eventually learns word-object
mappings more accurately and robustly. However, in their
study, learning to understand which word maps to which ob-
ject is a supervised process – an unrealistic assumption with
respect to the human language acquisition process. Also, they
only investigate a single formulation of curiosity, and it is not
clear which component of this formulation yields the gain ob-
served in their simulation results.
1050
©2020 e Author(s). is work is licensed under a Creative
Commons Aribution 4.0 International License (CC BY).
Inspired by Keijser et al. (2019), in this paper we use a
computational study of cross-situational word learning to in-
vestigate whether a curious word learner who actively selects
the linguistic input in each context has an advantage over a
passive learner. In our study,
we simulate word learning as a self-supervised process,
where instead of relying on corrective feedback on labels
of objects or referents of words from the environment, the
model relies on consistency when applying its own ac-
quired linguistic knowledge to both word comprehension
and production tasks;
we use real images as visual context, and use pre-identified
objects and their annotated labels as learning material to
our computational model;
we examine different curiosity measures as guiding input
selection, and analyze the relative impact of each method
on the overall performance and stability of the word learn-
ing model.
Our results suggest that a curious learner who actively influ-
ences linguistic input has an advantage over a passive learner.
Furthermore, a formulation of curiosity which relies both on
subjective novelty and plasticity yields the best performance
and learning stability, compared to using each of these factors
alone.
Method
Task
We operationalize word reference learning as a dual process,
involving both comprehension, or understanding which ob-
ject a word refers to, and production, or the ability to use a
word to describe a given object. The task then is twofold: for
the receptive part of the agent, the goal is to be able to identify
the true referent of a word among a set of candidate objects;
and for the productive part, the goal is to refer to an object by
the appropriate word.
Comprehension The learner receives as input a set of ob-
jects from a visual scene, and a word. Objects are vectors
carrying high-level visual information, each representing one
patch of the image processed by a pretrained object recogni-
tion model. The task is to determine which of the objects is
the best referent for the input word.
Production The learner receives one visual object as input,
and its task is to determine which word is the best label to
refer to this object. As in comprehension, the object is repre-
sented as a visual vector. Unlike in comprehension, the pro-
duction module does not take into account the rest of the vi-
sual scene.2
2In reality, the most effective word or expression to describe an
object depends not just on the object in question, but also on what
other objects are present in the scene.
Embeddingword
VGGobject1
pobjectn
VGGobject
Comprehension Production
pobject1
Out:distributionoverobjects
In:objectpairedwithword
Asmanypairsasobjectsinthescene
Out:distributionovervocabulary
In:objectinscene
Figure 1: Architecture of the model, adapted from Keijser et
al. (2019).
Data
The data consist of images with multiple objects in them, and
a word associated with each object. We use the dataset that
was introduced by Keijser et al. (2019). It is based on the
Flickr30 Entities corpus (Plummer et al., 2015), which con-
tains annotations for the bounding boxes for referring expres-
sions in the captions of the Flickr30 dataset (Young, Lai, Ho-
dosh, & Hockenmaier, 2014).
Keijser et al. (2019) filtered out any expressions that de-
scribe multiple bounding boxes. The referring expressions
are simplified by selecting only the most frequent word in
the expressions as referent per object. Objects therefore were
only included if at least two referring expressions contained
the same word. Several heuristics were used to select those
words that denote the object. Cardinal numbers one through
ten as well as colour names were filtered out. Very frequent,
but visually irrelevant words were also filtered out, such as
articles and possessives. The data was simplified by keeping
only one object per word per image, thereby removing ambi-
guity. Only those images were included for which after filter-
ing there were still at least two objects. The dataset contains
24,670 images, 4237 unique words, and 86,748 word-object
pairs. 1000 images were used as validation data and another
1000 for testing. A full description of the dataset can be found
in Keijser et al. (2019).
Model
The model consists of a production and a comprehension
module. The architecture is sketched in Figure 1. The out-
put of each module can serve as input to the other. This al-
lows for introspection: when the production module outputs
a word, the comprehension module can in turn try to interpret
it; and vice versa, when the comprehension module selects an
object as referent for a word, the production module can try
to name the object. The learner can check if it understands its
own language production. It is this introspective property of
the model that makes learning under self-supervision possi-
ble.
1051
Comprehension The comprehension module learns to map
a given word to its referent in the visual scene consisting of
a number of objects. We represent objects by visual feature
vectors extracted from the last fully connected VGG-16 ob-
ject recognition model (Simonyan & Zisserman, 2015), pre-
trained on the ImageNet dataset (Deng et al., 2009). Words
are represented by 256-dimensional vectors (or embeddings),
which are learned during training. The candidacy of every
object in a scene as the referent of a given word is consid-
ered in parallel: when the module receives a word as input,
it concatenates the word embedding to the visual feature of
every object in the scene separately. This concatenation of
word embedding and a single object representation is input
to a 256-unit hidden layer followed by a sigmoid activation
function, which is fully connected to a single output unit, also
followed by sigmoid activation. The object with the highest
output value is most likely to be the referent.
Production The production module learns to output a
word, given an object. The input is again in the form of
a VGG vector, which is fed to a 256 unit hidden layer fol-
lowed by sigmoid activation, which is fully connected to
the vocabulary-sized output layer: every unit representing a
word. The word unit with the highest output value is the best
candidate to describe the target object.
Self-supervised learning Although the model consists of
two modules that can be used independently, the whole agent
is trained in one go. The setup is inspired by Rohrbach,
Rohrbach, Hu, Darrell, and Schiele (2016) which use a sim-
ilar setup for training a computer vision model to ground re-
ferring expressions in the visual scene. During training, once
a word is processed by the comprehension module, we use
the softmaxed output vector as attention over the objects in
the scene. Input to the production module consists of the sum
of the visual feature vectors of all the objects in the scene,
weighted by the output of the comprehension module. The
whole agent, including the production and the comprehension
module, can now be updated in one go, by comparing the out-
put values of the production module with the word that was
input to the comprehension module in the first place.
Details of the implementation The model was imple-
mented in PyTorch (Paszke et al., 2019). It was updated ac-
cording to the cross-entropy between the one-hot encoding
of the input word and the output of the production module.
The model was trained using Adam optimization (Kingma &
Ba, 2015) in minibatches of 40 instances for 40 epochs. To
decide on an initial learning rate, we ran the model in all con-
ditions with a learning rate from .1 to .00001 for 40 epochs.
A learning rate of .001 yielded the best results on validation
data for both the comprehension and production modules and
was used for training all models reported in the results sec-
tion. We ran 20 different initializations per condition.3
3The implementation itself as well as code used to re-
port results and further analysis is available on Github:
https://github.com/horotat/curiosity
Input selection
In any given environment, a language learning child has the
choice to direct their attention to whatever object in their
vicinity is most interesting to them. On top of that, they have
influence over the environment itself – by moving to a dif-
ferent location or manipulating objects. Although our learner
has no influence over its environment (it can not choose which
image it sees), it can choose for which object in the scene it
receives language input. We compare a learner that receives
language input for a randomly drawn object to learners that
select an object to receive input about according to an esti-
mate of learning potential. Importantly, since all learners see
each image once per epoch, all learners have access to the
same number of data points.
The input selection is an introspective process: to select an
object, the learner inspects its knowledge of all the objects
in the scene. For every object, first the production module
tries to find the corresponding word. This word is then fed as
input to the comprehension module, which in turn tries to find
the corresponding object. The contrast between the object
input to the production module and the distribution over the
objects output by the comprehension module is the basis for
estimating which object holds most learning potential.
There are different ways to estimate learning potential.
Approaches in reinforcement learning often include a
definition of expected reward and cannot be applied easily
to our learning problem. The metrics we use for input
selection are based on Twomey and Westermann (2018):
subjective novelty, plasticity, and curiosity. These metrics
were defined for category learning and translate well to our
set-up. Subjective novelty favours the most unknown objects,
plasticity selects on how much the learner expects to learn
from a given input word, and curiosity is the product of those
two. Each of them is calculated for every object in the scene.
The object that maximizes the metric is selected to receive
linguistic input for.
Subjective novelty is defined in equation 1. tis the true dis-
tribution over objects; that is, it is a one-hot vector encoding
the object we are calculating subjective novelty for. ois the
guessed output by the comprehension module, and nis the
number of objects in the scene. Subjective novelty, then, is
the average absolute difference between the comprehension
module’s guesses and the true distribution. Intuitively, sub-
jective novelty selects the object for which the learner expects
to be most wrong, either because it is misnamed in produc-
tion, or because the produced word is misinterpreted in com-
prehension.
s(t,o) = n
i=1(|tioi|))
n(1)
Plasticity is defined in equation 2. Because every element of
ois the result of a sigmoid function, oi(1oi)is the deriva-
tive of o. This is the value on which model updates are based.
1052
Intuitively, the larger plasticity is, the more effective an up-
date to the model will be, in the sense that the comprehension
modules guesses will change a lot.
p(o) = n
i=1oi(1oi)
n(2)
Curiosity is defined in equation 3. It is the product of sub-
jective novelty and plasticity, averaged over the objects in the
scene. Curiosity balances plasticity and subjective novelty,
favoring objects for which both are high.
c(t,o) = n
i=1(|tioi|)oi(1oi)
n(3)
Results
Table 1 shows the accuracy of trained models in each con-
dition on held-out test data. For testing, we do not use
the input selection mechanisms, but rather test every single
word/object in every image. Therefore, these scores are com-
parable across models trained using different input selection
mechanisms. We trained 20 models in every condition, each
starting from a different randomly initialized state. For ev-
ery run, we select the model after the epoch with maximum
validation scores (selection was separate for production and
comprehension). We report the average accuracy and stan-
dard deviation over all runs in every condition. We also re-
port a baseline for both comprehension and production. For
comprehension, this is the score a random guesser would ob-
tain. For production, it is the accuracy when always guessing
the most frequent word. Please note that the production task
is considerably more difficult than the comprehension task.
The difference is more extreme than the baseline gives away;
whereas the comprehension module has to decide between
only 2 to 10 objects in the scene, the production module al-
ways has as many options as there are words in the vocabu-
lary.
When we look at the accuracy scores for the comprehen-
sion module, we see that models trained with input selec-
tion according to curiosity ultimately attain the highest ac-
curacy scores. However, neither models trained with plastic-
ity nor subjective novelty as selection mechanism outperform
models trained with random object selection. In fact, mod-
els trained using subjective novelty perform at chance level.
When we look at the standard deviations for the conditions
in which learning is successful, we see that this is lowest for
the curiosity condition, meaning that the differently initial-
ized models reach more comparable scores in this condition.
The general pattern of scores in comprehension is also re-
flected in the scores for the production module, with the ex-
ception that here, models trained using subjective novelty for
input selection do beat the baseline by some margin, although
they still score the lowest by far. Here, too, curiosity is the
highest accuracy condition and also has the smallest standard
deviation, whereas neither plasticity nor subjective novelty
outperform models trained with random input selection.
Table 1: Average accuracy and standard deviation on test data
Comprehension Production
Acc. SD Acc. SD
Random .5458 .0746 .2093 .0139
Plasticity .5119 .1035 .1801 .0223
Subjective novelty .2874 .0026 .1214 .0082
Curiosity .6626 .0190 .2132 .0046
Baseline .2863 .0893
In order to understand these results, we also look at the
intermediate scores during training of all models on training
and test data. Note that all models see exactly the same ob-
jects in the test setting, namely all objects in the test set, but
in the training setting, they only see one object per image that
was selected according to their input selection method. The
scores on test data are therefore more comparable than those
on training data.
The results during training are visualized in Figures 2 and
3. Every line represents a single model’s training trajectory.
Comprehension scores of models in the plasticity, curious,
and random conditions show the general pattern we might ex-
pect when looking at the averaged data. Models in the curious
condition immediately outperform those in other conditions,
with models in the random condition gradually catching up
to some extent. The trajectories of different models in the cu-
riosity condition lie close together, whereas those in the ran-
dom and plasticity condition are more spread out. The high
training scores for many models in the subjective novelty con-
dition, but baseline performance on the test set, indicate that
these models are prone to overfitting.
When we look at the production scores in Figure 3, we see
that all models are prone to overfitting on this task. Never-
theless, we also see clear differences in test scores between
the conditions. As in the comprehension task, models trained
using curious input selection immediately outperform mod-
els in other conditions. More than in the comprehension task,
models receiving random input eventually catch up, although
there is more variance amongst models in this condition than
amongst models trained using curiosity. Models in the sub-
jective novelty condition are again especially prone to overfit-
ting; initially there is some learning that generalizes to the test
set, but this knowledge is gradually forgotten as the models
tune to the specifics of the training data.
Analysis of input selection metrics
As we saw in the last section, selecting input based on cu-
riosity improves performance in both the comprehension and
production tasks, but its components, plasticity and subjec-
tive novelty, are outperformed by random selection. In order
to understand how curiosity can help learning when its mov-
ing parts by themselves do not, we take a closer look at their
object selection behaviour. One may expect that curiosity,
being the product of subjective novelty and plasticity, some-
1053
Figure 2: Comprehension train and test accuracy during training. Every line represents the training trajectory of a single model.
Figure 3: Production train and test accuracy during training. Please note that the y-axis for test accuracy is scaled for visibility.
times mimics one, and sometimes the other. However, be-
cause neither of the two components of curiosity yield com-
parable results by themselves, we expect that curiosity may
make its own choices altogether. In images with three objects
a,b, and c, where ahas maximal plasticity and bhas maximal
subjective novelty, perhaps curiosity, rather than choosing ob-
ject aor b, selects object c.
To analyze the overlap in selection behaviour between the
three mechanisms, we select the optimal input object accord-
ing to all three selection mechanisms for all 1,000 validation
images, and then do a pairwise comparison between the se-
lections. The analysis is done before any training and after
each training epoch. Since models in the input selection con-
ditions are necessarily shaped by the selection mechanism
with which they were trained, we decide to do this analysis
on models trained in the random condition. The results of
this analysis are illustrated in Figure 4. Each line represents
the number of images for which two, or all three mechanisms
made the same choice, averaged over all 20 models in the
random condition.
Before any training (visualized as ‘epoch 0’) curiosity and
subjective novelty highly overlap, on average choosing the
same object in 953.55 out of 1,000 images, whereas plasticity
often selects a different object, overlapping with curiosity in
336.35 images on average. In epochs after training, it is plas-
ticity and curiosity that overlap (steadily growing from 571.7
after epoch 1 to 747.85 after epoch 40). There is compara-
tively little overlap between curiosity and subjective novelty,
at an average 267.9 after epoch 1, slightly decreasing until
220.2 after epoch 3, then again growing steadily until 370.45
after epoch 40. In many cases after training, when curios-
ity and subjective novelty overlap, there is overlap between
1054
Figure 4: Overlapping choices of the different selection
mechanisms for images in the validation set during training.
Overlap scores are averaged over the 20 models in the random
condition.
all three cases. The overlap between curiosity and subjec-
tive novelty, but not plasticity, comprises 126.35 images after
epoch 1, and this proportion decreases until epoch 8, where it
stabilizes around 80 out of 1,000 images.
After 1 epoch of training, curiosity chooses a different ob-
ject than both plasticity and subjective novelty in 301.95 im-
ages on average. The proportion of unique choices by curios-
ity gradually decreases until 170.7 after epoch 40.
In summary, before any training, curiosity mainly agrees
with subjective novelty. In the early stages of training, cu-
riosity usually aligns with plasticity, and in a large minority
of cases chooses distinctly from both plasticity and subjec-
tive novelty. After training, the overlap between curiosity and
subjective novelty alone is small, particularly from epoch 8
on. The proportion of unique choices by curiosity becomes
smaller over training, while agreement between curiosity and
plasticity grows.
Discussion
We simulated word learning as a self-supervised process.
The learner learns to comprehend and produce words through
self-supervision. Using the introspective quality of the model,
we studied the effect of active input selection on the word
learning process. Just as Keijser et al. (2019) have shown in a
supervised word learning task, we find that curious input se-
lection leads to better performance, faster learning, and more
robust convergence, as compared to random input. In addi-
tion, we also find that neither plasticity nor subjective nov-
elty by itself leads to similar improvements. In fact, random
input selection outperforms subjective novelty and plasticity.
Input selection based on subjective novelty is very prone to
overfitting. A possible explanation is that by selecting ob-
jects it expects to be wrong about, subjective novelty is likely
to select ‘exceptions to the rule’ and leads models to fit to
idiosyncracies of the training set, rather than to generaliz-
able knowledge. Although the learning trajectory of mod-
els trained under plasticity looks more qualitatively similar to
that of models in the random or curiosity condition, eventual
performance is lower.
Since curiosity is a function of plasticity and subjective
novelty, it is somewhat surprising that neither of these mecha-
nisms by themselves yield comparable advantages to random
input selection. We analyzed the overlap between all three
mechanisms. For a completely untrained model, selection ac-
cording to curiosity shows near complete overlap with selec-
tion according to subjective novelty, but this overlap quickly
disappears. Although curiosity and plasticity show consider-
able overlap during training, there is still a significant portion
of images where curiosity selects a different object than both
plasticity and novelty, particularly in the early epochs. Cu-
riosity, then, seems to be doing more than simply balancing
the selection of subjective novelty and plasticity.
Of course, this work only shows what effect active selec-
tion of input might have on learning trajectories, when learn-
ers use active input selection strategies. We have no empirical
evidence that, in fact, they do. This work should be seen as
a proof of concept; if we find that learners actively solicit in-
put according to a certain definition of curiosity, then this can
influence their learning, and models of word learning should
take it into account. Whether learners do employ such a strat-
egy, must be established in empirical research. Lab studies
can give insight into what learners do when they can explicitly
control their input (Kachergis, Yu, & Shiffrin, 2013). How-
ever, to understand whether this is a natural part of the lan-
guage acquisition process, it is necessary to study children’s
introduction of topics in child-caregiver interaction.
References
Akhtar, N., Dunham, F., & Dunham, P. J. (1991). Directive
interactions and early vocabulary development: The
role of joint attentional focus. Journal of Child Lan-
guage,18(1), 41–49.
Bloom, L., Margulis, C., Tinker, E., & Fujita, N. (1996).
Early conversations and word learning: Contributions
from child and adult. Child Development,67(6), 3154-
3175.
Chang, L., de Barbaro, K., & De´
ak, G. (2016). Contingencies
between infants’ gaze, vocal, and manual actions and
mothers’ object-naming: Longitudinal changes from 4
to 9 months. Developmental Neuropsychology,41(5-
8), 342–361.
Deng, J., Dong, W., Socher, R., Li, L., Kai Li, & Li Fei-Fei.
(2009). ImageNet: A large-scale hierarchical image
1055
database. In 2009 IEEE Conference on Computer Vi-
sion and Pattern Recognition (pp. 248–255).
Kachergis, G., Yu, C., & Shiffrin, R. M. (2013). Ac-
tively learning object names across ambiguous situa-
tions. Topics in Cognitive Science,5(1), 200-213.
Keijser, D., Gelderloos, L., & Alishahi, A. (2019). Curious
topics: A curiosity-based model of first language word
learning. In Proceedings of the 41st Annual Confer-
ence of the Cognitive Science Society (pp. 1991–1997).
Cognitive Science Society.
Kidd, C., Piantadosi, S. T., & Aslin, R. N. (2012). The
Goldilocks effect: Human infants allocate attention
to visual sequences that are neither too simple nor
too complex. PLOS ONE,7(5). doi: 10.1371/jour-
nal.pone.0036399
Kidd, C., Piantadosi, S. T., & Aslin, R. N. (2014). The
Goldilocks effect in infant auditory attention. Child
Development,85(5), 1795-1804.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochas-
tic optimization. In 3rd International Conference on
Learning Representations.
Oudeyer, P.-Y., & Kaplan, F. (2009). What is intrin-
sic motivation? A typology of computational ap-
proaches. Frontiers in Neurorobotics,1. doi:
10.3389/neuro.12.006.2007
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
Chanan, G., Killeen, T., Lin, Z., Gimelshein, N.,
Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito,
Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner,
B., Fang, L., Bai, J., & Chintala, S. (2019). PyTorch:
An imperative style, high-performance deep learning
library. In Advances in Neural Information Processing
Systems 32 (pp. 8024–8035).
Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C.,
Hockenmaier, J., & Lazebnik, S. (2015). Flickr30k En-
tities: Collecting region-to-phrase correspondences for
richer image-to-sentence models. In The IEEE Inter-
national Conference on Computer Vision (ICCV) (pp.
2641–2649).
Rohrbach, A., Rohrbach, M., Hu, R., Darrell, T., & Schiele,
B. (2016). Grounding of textual phrases in images by
reconstruction. In European Conference on Computer
Vision (pp. 817–834).
Schulz, E., Wu, C. M., Ruggeri, A., & Meder, B. (2019).
Searching for rewards like a child means less general-
ization and more directed exploration. Psychological
Science,30(11), 1561-1572.
Simonyan, K., & Zisserman, A. (2015). Very deep convo-
lutional networks for large-scale image recognition. In
3rd International Conference on Learning Representa-
tions.
Tomasello, M., & Farrar, M. J. (1986). Joint attention
and early language. Child Development,57(6), 1454–
1463.
Twomey, K. E., & Westermann, G. (2018). Curiosity-based
learning in infants: a neurocomputational approach.
Developmental Science,21(4).
Young, P., Lai, A., Hodosh, M., & Hockenmaier, J. (2014).
From image descriptions to visual denotations: New
similarity metrics for semantic inference over event de-
scriptions. Transactions of the Association for Compu-
tational Linguistics,2, 67–78.
1056
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
How do children and adults differ in their search for rewards? We considered three different hypotheses that attribute developmental differences to (a) children’s increased random sampling, (b) more directed exploration toward uncertain options, or (c) narrower generalization. Using a search task in which noisy rewards were spatially correlated on a grid, we compared the ability of 55 younger children (ages 7 and 8 years), 55 older children (ages 9–11 years), and 50 adults (ages 19–55 years) to successfully generalize about unobserved outcomes and balance the exploration–exploitation dilemma. Our results show that children explore more eagerly than adults but obtain lower rewards. We built a predictive model of search to disentangle the unique contributions of the three hypotheses of developmental differences and found robust and recoverable parameter estimates indicating that children generalize less and rely on directed exploration more than adults. We did not, however, find reliable differences in terms of random sampling.
Article
Full-text available
Infants are curious learners who drive their own cognitive development by imposing structure on their learning environment as they explore. Understanding the mechanisms by which infants structure their own learning is therefore critical to our understanding of development. Here we propose an explicit mechanism for intrinsically motivated information selection that maximizes learning. We first present a neurocomputational model of infant visual category learning, capturing existing empirical data on the role of environmental complexity on learning. Next we “set the model free”, allowing it to select its own stimuli based on a formalization of curiosity and three alternative selection mechanisms. We demonstrate that maximal learning emerges when the model is able to maximize stimulus novelty relative to its internal states, depending on the interaction across learning between the structure of the environment and the plasticity in the learner itself. We discuss the implications of this new curiosity mechanism for both existing computational models of reinforcement learning and for our understanding of this fundamental mechanism in early development.
Article
Full-text available
Infants’ early motor actions help organize social interactions, forming the context of caregiver speech. We investigated changes across the first year in social contingencies between infant gaze and object exploration, and mothers’ speech. We recorded mother–infant object play at 4, 6, and 9 months, identifying infants’ and mothers’ gaze and hand actions, and mothers’ object naming and general utterances. Mothers named objects more when infants vocalized, looked at objects or the mother’s face, or handled multiple objects. As infants aged, their increasing object exploration created opportunities for caregiver contingencies and changed how gaze and hands accompany object naming over time.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Article
Full-text available
Intrinsic motivation, centrally involved in spontaneous exploration and curiosity, is a crucial concept in developmental psychology. It has been argued to be a crucial mechanism for open-ended cognitive development in humans, and as such has gathered a growing interest from developmental roboticists in the recent years. The goal of this paper is threefold. First, it provides a synthesis of the different approaches of intrinsic motivation in psychology. Second, by interpreting these approaches in a computational reinforcement learning framework, we argue that they are not operational and even sometimes inconsistent. Third, we set the ground for a systematic operational study of intrinsic motivation by presenting a formal typology of possible computational approaches. This typology is partly based on existing computational models, but also presents new ways of conceptualizing intrinsic motivation. We argue that this kind of computational typology might be useful for opening new avenues for research both in psychology and developmental robotics.
Conference Paper
Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth spatial localization of phrases, thus it is desirable to learn from data with no or little grounding supervision. We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly. During training our approach encodes the phrase using a recurrent network language model and then learns to attend to the relevant image region in order to reconstruct the input phrase. At test time, the correct attention, i.e., the grounding, is evaluated. If grounding supervision is available it can be directly applied via a loss over the attention mechanism. We demonstrate the effectiveness of our approach on the Flickr30k Entities and ReferItGame datasets with different levels of supervision, ranging from no supervision over partial supervision to full supervision. Our supervised variant improves by a large margin over the state-of-the-art on both datasets.
Article
We propose to use the visual denotations of linguistic expressions (i.e. the set of images they describe) to define novel denotational similarity metrics, which we show to be at least as beneficial as distributional similarities for two tasks that require semantic inference. To compute these denotational similarities, we construct a denotation graph, i.e. a subsumption hierarchy over constituents and their denotations, based on a large corpus of 30K images and 150K descriptive captions.
Article
Infants must learn about many cognitive domains (e.g., language, music) from auditory statistics, yet capacity limits on their cognitive resources restrict the quantity that they can encode. Previous research has established that infants can attend to only a subset of available acoustic input. Yet few previous studies have directly examined infant auditory attention, and none have directly tested theorized mechanisms of attentional selection based on stimulus complexity. This work utilizes model-based behavioral methods that were recently developed to examine visual attention in infants (e.g., Kidd, Piantadosi, & Aslin, 2012). The present results demonstrate that 7- to 8-month-old infants selectively attend to nonsocial auditory stimuli that are intermediately predictable/complex with respect to their current implicit beliefs and expectations. These findings provide evidence of a broad principle of infant attention across modalities and suggest that sound-to-sound transitional statistics heavily influence the allocation of auditory attention in human infants.
Article
Previous research shows that people can use the co-occurrence of words and objects in ambiguous situations (i.e., containing multiple words and objects) to learn word meanings during a brief passive training period (Yu & Smith, 2007). However, learners in the world are not completely passive but can affect how their environment is structured by moving their heads, eyes, and even objects. These actions can indicate attention to a language teacher, who may then be more likely to name the attended objects. Using a novel active learning paradigm in which learners choose which four objects they would like to see named on each successive trial, this study asks whether active learning is superior to passive learning in a cross-situational word learning context. Finding that learners perform better in active learning, we investigate the strategies and discover that most learners use immediate repetition to disambiguate pairings. Unexpectedly, we find that learners who repeat only one pair per trial-an easy way to infer this pair-perform worse than those who repeat multiple pairs per trial. Using a working memory extension to an associative model of word learning with uncertainty and familiarity biases, we investigate individual differences that correlate with these assorted strategies.