Article

Learning Words from Sights and Sounds: A Computational Model

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper presents an implemented computational model of word acquisition which learns directly from raw multimodal sensory input. Set in an information theoretic framework, the model acquires a lexicon by finding and statistically modeling consistent cross-modal structure. The model has been implemented in a system using novel speech processing, computer vision, and machine learning algorithms. In evaluations the model successfully performed speech segmentation, word discovery and visual categorization from spontaneous infant-directed speech paired with video images of single objects. These results demonstrate the possibility of using state-of-the-art techniques from sensory pattern recognition and machine learning to implement cognitive models which can process raw sensor data without the need for human transcription or labeling.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... CELL Perhaps the first serious computational implementation of learning spoken language via visual grounding was described by Roy (1999), and further developed in Roy and Pentland (2002) as the CELL (Cross-channel Early Lexical Learning) model. The model originated in Roy (1999), and there were a number of subsequent versions and tweaks of CELL (e.g. ...
... The final step is to apply K-means clustering to each modality separately, and compute an audio-visual affinity between each pair of clusters by summing over the candidate pairwise similarities for items in the pair of clusters. As such, this approach is somewhat reminiscent of association mining techniques used in Roy and Pentland (2002); Yu et al. (2005). The resulting lexicon is evaluated in terms of cluster purity metrics as well as qualitatively. ...
... According toRoy and Pentland (2002), semantic accuracy can be measured for the baseline because the visual prototype was carried through from input to output and "this model assumes that when a speech segment is selected as a prototype for a lexical candidate, the best choice of its meaning is whatever context co-occurred with the speech prototype." 3 Additionally, the visual features are extracted by a hard-wired component, but this is a reasonable setup at least in the context of modeling human language acquisition: the human visual system is largely functional by the time children start learning language in earnest. ...
Preprint
Full-text available
This survey provides an overview of the evolution of visually grounded models of spoken language over the last 20 years. Such models are inspired by the observation that when children pick up a language, they rely on a wide range of indirect and noisy clues, crucially including signals from the visual modality co-occurring with spoken utterances. Several fields have made important contributions to this approach to modeling or mimicking the process of learning language: Machine Learning, Natural Language and Speech Processing, Computer Vision and Cognitive Science. The current paper brings together these contributions in order to provide a useful introduction and overview for practitioners in all these areas. We discuss the central research questions addressed, the timeline of developments, and the datasets which enabled much of this work. We then summarize the main modeling architectures and offer an exhaustive overview of the evaluation metrics and analysis techniques.
... One of the very first models of Visually Grounded Speech (VGS) is the CELL (Crosschannel Early Lexical Learning) model developed by Roy & Pentland (2002) and Roy (2003). This model was explicitly developed so as to understand how the interaction of visual and auditory stimuli enabled lexical acquisition. ...
... In a model similar to Roy & Pentland (2002), Yu et al. (2005) test if audio-visual mapping was enhanced by gaze information. To test this, they had a picture book in a foreign language read by a native speaker and recorded. ...
... Their results shows that audio-visual mapping was easier when gaze information was available for the model than when it was not. Consequently, their experiment shows that co-occurrence statistics do not seem to be enough, even though they enable computational models to learn a few reliable word-object mappings such as in Roy & Pentland (2002). Word-object mapping can be learned more reliably when additional information (such as attention of the speaker) is available. ...
Thesis
In recent years, deep learning methods allowed the creation of neural models that are able to process several modalities at once. Neural models of Visually Grounded Speech (VGS) are such kind of models and are able to jointly process a spoken input and a matching visual input. They are commonly used to solve a speech-image retrieval task: given a spoken description, they are trained to retrieve the closest image that matches the description. Such models sparked interest in linguists and cognitive scientists as they are able to model complex interactions between two modalities --- speech and vision --- and can be used to simulate child language acquisition and, more specifically, lexical acquisition.In this thesis, we study a recurrent-based model of VGS and analyse the linguistic knowledge such models are able to derive as a by-product of the main task they are trained to solve. We introduce a novel data set that is suitable to train models of visually grounded speech. Contrary to most data sets that are in English, this data set is in Japanese and allows us to study the impact of the input language on the representations learnt by the neural models.We then focus on the analysis of the attention mechanisms of two VGS models, one trained on the English data set, the other on the Japanese data set, and show the models have developed a language-general behaviour by using their attention weights to focus on specific nouns in the spoken input. Our experiments reveal that such models are able to adopt a language-specific behaviour by taking into account particularities of the input language so as to better solve the task they are given.We then study if VGS models are able to map isolated words to their visual referents. This allows us to investigate if the model has implicitly segmented the spoken input into sub-units. We further investigate how isolated words are stored in the weights of the network by borrowing a methodology stemming from psycholinguistics, the gating paradigm, and show that word onset plays a major role in successful activation.Finally, we introduce a simple method to introduce segment boundary information in a neural model of speech processing. This allows us to test if the implicit segmentation that takes place in the network is as effective as an explicit segmentation. We investigate several types of boundaries, ranging from phone to word boundaries, and show the latter yield the best results. We observe that giving the network several boundaries at the same is beneficial. This allows the network to take into account the hierarchical nature of the linguistic input.
... In contrast to viewing language learning as a composition of different learning tasks, an alternative picture of the process can also be painted: what if processes such as word segmentation or phonetic category acquisition are not necessary stepping stones for speech comprehension, but that language learning could be bootstrapped by meaning-driven predictive learning, where the learner attempts to connect the (initially unsegmented) auditory stream to the objects and events in the observable surroundings Räsänen and Rasilo, 2015; also referred to as discriminative learning in Baayen et al., 2015; see also Ramscar and Port, 2016). While tackling this idea has been challenging in empirical terms, a number of computational studies have explored this idea along the years (e.g., but not limited to, Yu et al., 2005;Roy and Pentland, 2002;Räsänen and Rasilo, 2015;Chrupała et al., 2017;Alishahi et al., 2017;Räsänen and Khorrami, 2019;ten Bosch et al., 2008;Ballard and Yu, 2004). These models have demonstrated successful learning of speech comprehension skills in terms of connecting words in continuous speech to their visual referents with minimal or fully absent prior linguistic knowledge. ...
... A number of existing computational studies and machine learning algorithms have studied the use of concurrent speech and visual input to bootstrap language learning from sensory experience. In the early works (e.g., Roy and Pentland, 2002;Ballard and Yu, 2004;Räsänen et al., 2008;ten Bosch et al., 2008;Driesen and Van hamme, 2011;Yu et al., 2005;Mangin et al., 2015;Räsänen and Rasilo, 2015), visual information has been primarily used to support concurrent word segmentation, identification, and meaning acquisition. The basic idea in these models has been to combine cross-situational word learning (Smith and Yu, 2008)-the idea that infants learn word meanings by tracking co-occurrence probabilities of word forms and their visual referents across multiple learning situations-with simultaneous "statistical learning" of patterns from the acoustic speech signal. ...
... In parallel, a number of robot studies have investigated the grounding of speech patterns into concurrent percepts or actions (e.g., Salvi et al., 2012;Iwahashi, 2003). However, the acoustic input of some studies has been pre-processed to phoneme-like features (Roy and Pentland, 2002;Ballard and Yu, 2004;Salvi et al., 2012) or word segments (Salvi et al., 2012) using supervised learning. Alternatively, visual input to the models have been rather simplified, such as simulated categorical symbols for visual referents (e.g., ten Bosch et al., 2008;Räsänen and Rasilo, 2015;Driesen and Van hamme, 2011). ...
Preprint
Full-text available
Decades of research has studied how language learning infants learn to discriminate speech sounds, segment words, and associate words with their meanings. While gradual development of such capabilities is unquestionable, the exact nature of these skills and the underlying mental representations yet remains unclear. In parallel, computational studies have shown that basic comprehension of speech can be achieved by statistical learning between speech and concurrent referentially ambiguous visual input. These models can operate without prior linguistic knowledge such as representations of linguistic units, and without learning mechanisms specifically targeted at such units. This has raised the question of to what extent knowledge of linguistic units, such as phone(me)s, syllables, and words, could actually emerge as latent representations supporting the translation between speech and representations in other modalities, and without the units being proximal learning targets for the learner. In this study, we formulate this idea as the so-called latent language hypothesis (LLH), connecting linguistic representation learning to general predictive processing within and across sensory modalities. We review the extent that the audiovisual aspect of LLH is supported by the existing computational studies. We then explore LLH further in extensive learning simulations with different neural network models for audiovisual cross-situational learning, and comparing learning from both synthetic and real speech data. We investigate whether the latent representations learned by the networks reflect phonetic, syllabic, or lexical structure of input speech by utilizing an array of complementary evaluation metrics related to linguistic selectivity and temporal characteristics of the representations. As a result, we find that representations associated...
... Word segmentation algorithms usually take as input phonological, symbolic text-like representations such as phonemes or syllables, with few exceptions (e.g., Ludusan, Seidl, Dupoux andCristia, 2015 andRoy andPentland, 2002, who applied segmentation algorithms on raw speech data). There is evidence that even newborns have access to syllables (or vowels) as perceptual units (Jusczyk, Jusczyk, Kennedy, Schomberg & Koenig, 1995), and that representation of phoneme sequences is available as early as by four months (Seidl, Cristià, Bernard & Onishi, 2009). ...
... Word segmentation algorithms usually take as input phonological, symbolic text-like representations such as phonemes or syllables, with few exceptions (e.g., Ludusan, Seidl, Dupoux andCristia, 2015 andRoy andPentland, 2002, who applied segmentation algorithms on raw speech data). There is evidence that even newborns have access to syllables (or vowels) as perceptual units (Jusczyk, Jusczyk, Kennedy, Schomberg & Koenig, 1995), and that representation of phoneme sequences is available as early as by four months (Seidl, Cristià, Bernard & Onishi, 2009). ...
Preprint
Full-text available
How can infants detect where words or morphemes start and end in the continuous stream of speech? Previous computational studies have investigated this question mainly for English, where morpheme and word boundaries are often isomorphic. Yet in many languages, words are often multimorphemic, such that word and morpheme boundaries do not align. Our study employed corpora of two languages that differ in the complexity of inflectional morphology, Chintang (Sino-Tibetan) and Japanese (in Experiment 1), as well as corpora of artificial languages ranging in morphological complexity, as measured by the ratio and distribution of morphemes per word (in Experiments 2 and 3). We used two baselines and three conceptually diverse word segmentation algorithms, two of which rely purely on sublexical information using distributional cues, and one that builds a lexicon. The algorithms’ performance was evaluated on both word- and morpheme-level representations of the corpora.Segmentation results were better for the morphologically simpler languages than for the morphologically more complex languages, in line with the hypothesis that languages with greater inflectional complexity could be more difficult to segment into words. We further show that the effect of morphological complexity is relatively small, compared to that of algorithm and evaluation level. We therefore recommend that infant researchers look for signatures of the different segmentation algorithms and strategies, before looking for differences in infant segmentation landmarks across languages varying in complexity.
... Once a number of associations are learned, the agent compares the (representation of the) image of any new object with the ones it knows using a nearest neighbour algorithm and associates the new object with the name of the known object it is more similar to 23 . Similar statistical or connectionist algorithms for associating words in an utterance with objects in view have also been developed for other artificial agents (Sales et al., 1996;Nenov and Dyer, 1997;Oates et al., 2000;Roy and Pentland, 2000;Wachsmuth et al., 2000;Roy, 2002;Bredeche et al., 2003) 24 and recently, the learning of image-language associations for applications such as multimedia information retrieval has also been attempted (Ahmad et al., 2002). We have generally observed that in these approaches to learning: ...
... 2000; Roy and Pentland, 2000;Wachsmuth et al., 2000). For example, Wachsmuth et al. use Bayesian networks for associating words and images: given an image and its corresponding verbal description, the visual classes of the objects depicted in the former and the linguistic classes of the words in the latter are inferred. ...
Thesis
Full-text available
This thesis explores the issue of vision-language integration from the Artificial Intelligence perspective of building intentional artificial agents able to combine their visual and linguistic abilities automatically. While such a computational vision-language integration is a sine qua non requirement for developing a wide range of intelligent multimedia systems, the deeper issue still remains in the research background. What does integration actually mean? Why is it needed in Artificial Intelligence systems, how is it currently achieved and how far can we go in developing fully automatic vision-language integration prototypes? Through a parallel theoretical investigation of visual and linguistic representational systems, the nature and characteristics of the subjects of this integration study, vision and language, are determined. Then, the notion of their computational integration itself is explored. An extensive review of the integration resources and mechanisms used in a wide-range of vision-language integration prototypes leads to a descriptive definition of this integration as a process of establishing associations between images and language. The review points to the fact that state of the art prototypes fail to perform real integration, because they rely on human intervention at key integration stages, in order to overcome difficulties related to features vision and language inherently lack. In looking into these features so as to discover the real need for integrating vision and language in multimodal situations, intentionality-related issues appear to play a central role in justifying integration. These features are correlated with Searle's theory of intentionality and the Symbol Grounding problem. This leads to a view of the traditionally advocated grounding of language in visual perceptions as a bi-directional, not one-directional, process. It is argued that vision-language integration is rather a case of double-grounding, in which linguistic representations are grounded in visual ones for getting direct access to the physical world, while visual representations, in their turn, are grounded in linguistic ones for acquiring a controlled access to mental aspects of the world. Last, the feasibility of developing a prototype able to achieve this double-grounding with minimal human intervention is explored. VLEMA is presented, a prototype which is fed with automatically reconstructed building-interior scenes, which it subsequently describes in natural language. The prototype includes a number of unique features which point to new directions in building agents endowed with real vision-language integration abilities.
... Word segmentation algorithms usually take as input phonological, symbolic text-like representations such as phonemes or syllables, with few exceptions (e.g., Ludusan, Seidl, Dupoux andCristia, 2015 andRoy andPentland, 2002, who applied segmentation algorithms on raw speech data). There is evidence that even newborns have access to syllables (or vowels) as perceptual units (Jusczyk, Jusczyk, Kennedy, Schomberg & Koenig, 1995), and that representation of phoneme sequences is available as early as by four months (Seidl, Cristià, Bernard & Onishi, 2009). ...
... Word segmentation algorithms usually take as input phonological, symbolic text-like representations such as phonemes or syllables, with few exceptions (e.g., Ludusan, Seidl, Dupoux andCristia, 2015 andRoy andPentland, 2002, who applied segmentation algorithms on raw speech data). There is evidence that even newborns have access to syllables (or vowels) as perceptual units (Jusczyk, Jusczyk, Kennedy, Schomberg & Koenig, 1995), and that representation of phoneme sequences is available as early as by four months (Seidl, Cristià, Bernard & Onishi, 2009). ...
Article
Full-text available
How can infants detect where words or morphemes start and end in the continuous stream of speech? Previous computational studies have investigated this question mainly for English, where morpheme and word boundaries are often isomorphic. Yet in many languages, words are often multimorphemic, such that word and morpheme boundaries do not align. Our study employed corpora of two languages that differ in the complexity of inflectional morphology, Chintang (Sino-Tibetan) and Japanese (in Experiment 1), as well as corpora of artificial languages ranging in morphological complexity, as measured by the ratio and distribution of morphemes per word (in Experiments 2 and 3). We used two baselines and three conceptually diverse word segmentation algorithms, two of which rely purely on sublexical information using distributional cues, and one that builds a lexicon. The algorithms' performance was evaluated on both word- and morpheme-level representations of the corpora. Segmentation results were better for the morphologically simpler languages than for the morphologically more complex languages, in line with the hypothesis that languages with greater inflectional complexity could be more difficult to segment into words. We further show that the effect of morphological complexity is relatively small, compared to that of algorithm and evaluation level. We therefore recommend that infant researchers look for signatures of the different segmentation algorithms and strategies, before looking for differences in infant segmentation landmarks across languages varying in complexity.
... Analysing these models does not only help to understand their technological limitations, but may also yield insight on the cognitive processes at work in humans (Dupoux, 2018) who learn from contextually grounded speech utterances (either visually, haptically, socially, etc.). This is with this idea in mind that one of the first computational model of visually grounded word acquisition was introduced by Roy and Pentland (2002). More recently, Harwath et al. (2016) and were among the first to propose neural models integrating these two modalities. ...
... SHORTLIST (Norris, 1994) is another model which builds upon COHORT and TRACE by taking into consideration other features such as word stress. 1 To sum up, models of spoken word recognition consider that a set of words matching to a certain extent the spoken input is simultaneously activated and these models involve at some point a form of competition between the set of activated words before reaching the stage of recognition. Roy and Pentland (2002) were among the first to propose a computational model, known as CELL, that integrates both speech and vision to study child language acquisition. However, CELL required both speech and images to be pre-processed, where canonical shapes were first extracted from images and further represented as histograms; and speech was discretised into phonemes. ...
... Analysing these models does not only help to understand their technological limitations, but may also yield insight on the cognitive processes at work in humans (Dupoux, 2018) who learn from contextually grounded speech utterances (either visually, haptically, socially, etc.). This is with this idea in mind that one of the first computational model of visually grounded word acquisition was introduced by Roy and Pentland (2002). More recently, Harwath et al. (2016) and were among the first to propose neural models integrating these two modalities. ...
... SHORTLIST (Norris, 1994) is another model which builds upon COHORT and TRACE by taking into consideration other features such as word stress. 1 To sum up, models of spoken word recognition consider that a set of words matching to a certain extent the spoken input is simultaneously activated and these models involve at some point a form of competition between the set of activated words before reaching the stage of recognition. Roy and Pentland (2002) were among the first to propose a computational model, known as CELL, that integrates both speech and vision to study child language acquisition. However, CELL required both speech and images to be pre-processed, where canonical shapes were first extracted from images and further represented as histograms; and speech was discretised into phonemes. ...
Preprint
In this paper, we study how word-like units are represented and activated in a recurrent neural model of visually grounded speech. The model used in our experiments is trained to project an image and its spoken description in a common representation space. We show that a recurrent model trained on spoken sentences implicitly segments its input into word-like units and reliably maps them to their correct visual referents. We introduce a methodology originating from linguistics to analyse the representation learned by neural networks -- the gating paradigm -- and show that the correct representation of a word is only activated if the network has access to first phoneme of the target word, suggesting that the network does not rely on a global acoustic pattern. Furthermore, we find out that not all speech frames (MFCC vectors in our case) play an equal role in the final encoded representation of a given word, but that some frames have a crucial effect on it. Finally, we suggest that word representation could be activated through a process of lexical competition.
... Attempts to model or simulate the acquisition of spoken language via grounding in the visual modality date to the beginning of this century (Roy and Pentland, 2002) but have gained momentum recently with the revival of neural networks (e.g. Synnaeve et al., 2014;Harwath and Glass, 2015;Harwath et al., 2016;Harwath et al., 2018;Merkx et al., 2019;Havard et al., 2019a;Rouditchenko et al., 2020;Khorrami and Räsänen, 2021;Peng and Harwath, 2021). ...
... Early attempts at simulating grounded language learning focus on interactions between adults and young children while playing with a set of objects from different categories (Roy, 1999Gorniak and Roy, 2003;Mukherjee and Roy, 2003). In a representative study from this series, Roy and Pentland (2002) use speech recorded from such interactions paired with different views of the visible objects to identify linguistic units (i.e. words) and visual categories, and to map these two modalities together. ...
Preprint
Full-text available
Attempts to computationally simulate the acquisition of spoken language via grounding in perception have a long tradition but have gained momentum in the past few years. Current neural approaches exploit associations between the spoken and visual modality and learn to represent speech and visual data in a joint vector space. A major unresolved issue from the point of ecological validity is the training data, typically consisting of images or videos paired with spoken descriptions of what is depicted. Such a setup guarantees an unrealistically strong correlation between speech and the visual world. In the real world the coupling between the linguistic and the visual is loose, and often contains confounds in the form of correlations with non-semantic aspects of the speech signal. The current study is a first step towards simulating a naturalistic grounding scenario by using a dataset based on the children's cartoon Peppa Pig. We train a simple bi-modal architecture on the portion of the data consisting of naturalistic dialog between characters, and evaluate on segments containing descriptive narrations. Despite the weak and confounded signal in this training data our model succeeds at learning aspects of the visual semantics of spoken language.
... Word segmentation algorithms usually take as input phonological, symbolic text-like representations such as phonemes or syllables, with few exceptions (e.g., Ludusan et al., 2015 andRoy andPentland, 2002, who applied segmentation algorithms on raw speech data). There is evidence that even newborns have access to syllables (or vowels) as perceptual units (Jusczyk et al., 1995), and that representation of phoneme sequences is available as early as by four months (Seidl et al., 2009). ...
... Word segmentation algorithms usually take as input phonological, symbolic text-like representations such as phonemes or syllables, with few exceptions (e.g., Ludusan et al., 2015 andRoy andPentland, 2002, who applied segmentation algorithms on raw speech data). There is evidence that even newborns have access to syllables (or vowels) as perceptual units (Jusczyk et al., 1995), and that representation of phoneme sequences is available as early as by four months (Seidl et al., 2009). ...
Thesis
Full-text available
Language is acquired by children all around the globe, but probably along different developmental paths, at varying rates and with varying outcomes, depending on the input provided. In this dissertation, we take a closer look at the astonishing diversity of input children grow up hearing, and we ask how this diversity matters to language acquisition. For this, we employ highly interdisciplinary methods and involve several projects. We consider the type of language and culture as two principal sources of diversity, and we will investigate them in two distinct parts of this dissertation.
... To follow such an idea, several computational models of LA have utilized the idea of concurrent audiovisual learning from acoustic speech input. Pioneering models such as CELL by Roy and Pentland [12] and that of Yu and Ballard [13] were followed by others, such as visually-conditioned higher-order Markov chains [11,14,15], audiovisual non-negative matrix factorization [16,17] (see also [18,19] for related work) and DP-n-grams [20]. These models have demonstrated successful word learning from acoustic speech when the speech comes with related visual information, provided as an unaligned bag of categorical labels for each utterance. ...
... However, according to our knowledge, none of the existing models for joint segmentation, meaning acquisition, and subword unit learning have been tested with real naturalistic input available to language learning infants (see also [5]). Instead, the models (or robots) have used supervised phone recognizer front-ends [12,13,27], or simplified enacted high-quality caregiver speech (e.g., using CAREGIVER corpus [28], as in [14][15][16][17]). This leaves it unclear whether the audiovisual learning strategy also scales up to the real-world experiences of human infants, for whom speech and visual experience may be much more unconstrained than in the idealized speech corpora. ...
Preprint
Full-text available
Earlier research has suggested that human infants might use statistical dependencies between speech and non-linguistic multimodal input to bootstrap their language learning before they know how to segment words from running speech. However, feasibility of this hypothesis in terms of real-world infant experiences has remained unclear. This paper presents a step towards a more realistic test of the multimodal bootstrapping hypothesis by describing a neural network model that can learn word segments and their meanings from referentially ambiguous acoustic input. The model is tested on recordings of real infant-caregiver interactions using utterance-level labels for concrete visual objects that were attended by the infant when caregiver spoke an utterance containing the name of the object, and using random visual labels for utterances during absence of attention. The results show that beginnings of lexical knowledge may indeed emerge from individually ambiguous learning scenarios. In addition, the hidden layers of the network show gradually increasing selectivity to phonetic categories as a function of layer depth, resembling models trained for phone recognition in a supervised manner.
... To follow such an idea, several computational models of LA have utilized the idea of concurrent audiovisual learning from acoustic speech input. Pioneering models such as CELL by Roy and Pentland [12] and that of Yu and Ballard [13] were followed by others, such as visually-conditioned higher-order Markov chains [11,14,15], audiovisual non-negative matrix factorization [16,17] (see also [18,19] for related work) and DP-n-grams [20]. These models have demonstrated successful word learning from acoustic speech when the speech comes with related visual information, provided as an unaligned bag of categorical labels for each utterance. ...
... However, according to our knowledge, none of the existing models for joint segmentation, meaning acquisition, and subword unit learning have been tested with real naturalistic input available to language learning infants (see also [5]). Instead, the models (or robots) have used supervised phone recognizer front-ends [12,13,27], or simplified enacted high-quality caregiver speech (e.g., using CAREGIVER corpus [28], as in [14][15][16][17]). This leaves it unclear whether the audiovisual learning strategy also scales up to the real-world experiences of human infants, for whom speech and visual experience may be much more unconstrained than in the idealized speech corpora. ...
Conference Paper
Full-text available
Earlier research has suggested that human infants might use statistical dependencies between speech and non-linguistic multimodal input to bootstrap their language learning before they know how to segment words from running speech. However, feasibility of this hypothesis in terms of real-world infant experiences has remained unclear. This paper presents a step towards a more realistic test of the multimodal bootstrapping hypothesis by describing a neural network model that can learn word segments and their meanings from referentially ambiguous acoustic input. The model is tested on recordings of real infant-caregiver interactions using utterance-level labels for concrete visual objects that were attended by the infant when caregiver spoke an utterance containing the name of the object, and using random visual labels for utterances during absence of attention. The results show that beginnings of lexical knowledge may indeed emerge from individually ambiguous learning scenarios. In addition, the hidden layers of the network show gradually increasing selectivity to phonetic categories as a function of layer depth, resembling models trained for phone recognition in a supervised manner.
... This ability stands in stark contrast to previous computational models of language learning that use statistical association over large amounts of training data (e.g. work by Yu and Ballard (2004) and Roy and Pentland (2002)). We report initial results on a project that attempts to achieve one-shot learning of word meanings by incorporating some of the heuristics that children appear to use. ...
... We choose to learn words and concepts, rather than phonemes as in work by Roy and Pentland (2002). Roy and Pentland tackled symbol grounding, word learning, and segmentation simultaneously, assuming no prior lexical, syntactic, or semantic knowledge. ...
Article
We describe ongoing research towards building a cognitively plausible system for near one-shot learning of the meanings of attribute words and object names, by grounding them in a sensory model. The system learns incrementally from human demonstrations recorded with the Microsoft Kinect, in which the demonstrator can use unrestricted natural language descriptions. We achieve near-one shot learning of simple objects and attributes by focusing solely on examples where the learning agent is confident, ignoring the rest of the data. We evaluate the system's learning ability by having it generate descriptions of presented objects, including objects it has never seen before, and comparing the system response against collected human descriptions of the same objects. We propose that our method of retrieving object examples with a k-nearest neighbor classifier using Mahalanobis distance corresponds to a cognitively plausible representation of objects. Our initial results show promise for achieving rapid, near one-shot, incremental learning of word meanings.
... The presence of a dog can be used as a supervisory signal to train language learning systems to parse and segment words that consistently occur in the same context, and semantically group words that occur in similar visual contexts. In order to develop AI systems or models of human learning with similar multimodal language learning skills, a number of models and learning algorithms have been proposed throughout the years (e.g., [1][2][3][4][5][6]). These systems have used various types of speech data with simulated or robot vision-based visual input. ...
... To represent a variety of participants, two baseline models have been produced. 1 One uses a low budget (72 GPU hours); the other uses a high budget (165 GPU hours), corresponding to the following submission tracks: ...
Preprint
Full-text available
We present the visually-grounded language modelling track that was introduced in the Zero-Resource Speech challenge, 2021 edition, 2nd round. We motivate the new track and discuss participation rules in detail. We also present the two baseline systems that were developed for this track.
... Fazly et al. 2010;Siskind 1996). In some approaches, this meaning is grounded in perception, but is typically limited to objects (Hsiao et al. 2008;Roy et al. 2002). Other approaches (e.g. ...
Thesis
Full-text available
Most theories and models of language acquisition so far have adopted a ‘mapping’ paradigm according to which novel words or constructions are ‘mapped’ onto existing, priorly acquired or innate concepts. Departing from this mapping approach, the thesis develops a computational model of the co-emergence of linguistic and conceptual structures with a particular focus on the case of action verbs. The model is inspired by emergentist theories of language acquisition and transfers the underlying ideas also to the domain of action learning. The emergentist cross-modal learning process spells out how a learner can distill the essence of the meaning of a verbal construction as a process of incremental generalization of the meaning of action verbs, starting from a meaning that is specific to a certain situation in which the verb has been encountered. The meaning of action verbs is understood as evoking a grounded simulation rather than a static concept. We show that cross-modal learning can provide an advantage over uni-modal models especially when observations are ambiguous and hard to differentiate. The connection between the theoretical foundation and the technical implementation is bidirectional. On the one hand, the technical properties of the model such as the fully incremental and data-driven approach to learn concepts within a modality that are grounded in concepts with similar semantics of another modality are relevant in many technical applications that are common in human-computer interaction and robotics. On the other hand technical implementations that are closely based on key concepts of theoretical frameworks can as well serve as computer implemented theories that allow other researchers to interactively test hypothesis that arise from the theoretical foundation. The thesis details connections to ongoing interdisciplinary research in linguistics and cognitive sciences.
... In computational studies, researchers have built models that implement in-principle learning algorithms, and created training sets to test the abilities of the models to find statistical regularities in the input data. Some earlier work in modeling word learning has used sensory data collected from adult learners or robots [11,14], while more recent models take symbolic data or simplified inputs [14]. Little is known about whether these models can scale up to address the same problems faced by infants in real-world learning. ...
Conference Paper
Full-text available
Human infants have the remarkable ability to learn the associations between object names and visual objects from inherently ambiguous experiences. Researchers in cognitive science and developmental psychology have built formal models that implement in-principle learning algorithms, and then used pre-selected and pre-cleaned datasets to test the abilities of the models to find statistical regularities in the input data. In contrast to previous modeling approaches , the present study used egocentric video and gaze data collected from infant learners during natural toy play with their parents. This allowed us to capture the learning environment from the perspective of the learner's own point of view. We then used a Convolutional Neural Network (CNN) model to process sensory data from the infant's point of view and learn name-object associations from scratch. As the first model that takes raw egocentric video to simulate infant word learning, the present study provides a proof of principle that the problem of early word learning can be solved, using actual visual data perceived by infant learners. Moreover, we conducted simulation experiments to systematically determine how attentional properties of infants' sensory experiences may affect word learning.
... In our model, agents observe signals in their physical environment as well as in their social environment; their codebooks are grounded in their physical and social experiences [26], [27]. Signals in an agent's physical environment are randomly generated from a local continuous source distribution, whereas signals in an agent's social environment are received from peers in the network. ...
Preprint
We consider a strategic network quantizer design setting where agents must balance fidelity in representing their local source distributions against their ability to successfully communicate with other connected agents. We study the problem as a network game and show existence of Nash equilibrium quantizers. For any agent, under Nash equilibrium, the word representing a given partition region is the conditional expectation of the mixture of local and social source probability distributions within the region. Since having knowledge of the original source of information in the network may not be realistic, we show that under certain conditions, the agents need not know the source origin and yet still settle on a Nash equilibrium using only the observed sources. Further, the network may converge to equilibrium through a distributed version of the Lloyd-Max algorithm. In contrast to traditional results in the evolution of language, we find several vocabularies may coexist in the Nash equilibrium, with each individual having exactly one of these vocabularies. The overlap between vocabularies is high for individuals that communicate frequently and have similar local sources. Finally, we argue that error in translation along a chain of communication does not grow if and only if the chain consists of agents with shared vocabulary. Numerical results are given.
... In computational studies, researchers have built models that implement in-principle learning algorithms, and created training sets to test the abilities of the models to find statistical regularities in the input data. Some work in modeling word learning has used sensory data collected from adult learners or robots (Roy & Pentland, 2002;Yu & Ballard, 2007;Rasanen & Khorrami, 2019), while many models take symbolic data or simplified inputs (Frank et al., 2009;Kachergis & Yu, 2017;K. Smith, Smith, & Blythe, 2011;Fazly, Alishahi, & Stevenson, 2010;Yu & Ballard, 2007). ...
Conference Paper
Full-text available
Human infants have the remarkable ability to learn the associations between object names and visual objects from inherently ambiguous experiences. Researchers in cognitive science and developmental psychology have built formal models that implement in-principle learning algorithms, and then used preselected and pre-cleaned datasets to test the abilities of the models to find statistical regularities in the input data. In contrast to previous modeling approaches, the present study used egocentric video and gaze data collected from infant learners during natural toy play with their parents. This allowed us to capture the learning environment from the perspective of the learner’s own point of view. We then used a Convolutional Neural Network (CNN) model to process sensory data from the infant’s point of view and learn name-object associations from scratch. As the first model that takes raw egocentric video to simulate infant word learning, the present study provides a proof of principle that the problem of early word learning can be solved, using actual visual data perceived by infant learners. Moreover, we conducted simulation experiments to systematically determine how visual, perceptual, and attentional properties of infants’ sensory experiences may affect word learning.
... Grounded Reasoning. Early attempts to relate language understanding (Winograd, 1972) or learning (Roy & Pentland, 2002) to the physical world mostly rely on manually created linguistic and physical rules (Hermann et al., 2017;Berant et al., 2013). Recent studies have claimed that pre-trained large-scale LMs have already memorized enough world knowledge (Roberts et al., 2020;Brown et al., 2020), and enhanced reasoning ability can be achieved by proper prompting (Nye et al., 2021;Wei et al., 2021;Sanh et al., 2021). ...
Preprint
Full-text available
Successful and effective communication between humans and AI relies on a shared experience of the world. By training solely on written text, current language models (LMs) miss the grounded experience of humans in the real-world -- their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning. We present Mind's Eye, a paradigm to ground language model reasoning in the physical world. Given a physical reasoning question, we use a computational physics engine (DeepMind's MuJoCo) to simulate the possible outcomes, and then use the simulation results as part of the input, which enables language models to perform reasoning. Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind's Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average). Smaller language models armed with Mind's Eye can obtain similar performance to models that are 100x larger. Finally, we confirm the robustness of Mind's Eye through ablation studies.
... These enable robots to acquire various knowledge by inferring the latent variables from their own observations. A further advancement of such cognitive systems allows the robots to find meanings of words by treating a linguistic input as another modality [13][14][15]. Cognitive models have recently become more complex in realizing various cognitive capabilities: grammar acquisition [16], language model learning [17], hierarchical concept acquisition [18,19], spatial concept acquisition [20], motion skill acquisition [21], and task planning [7] (see Fig. 1). It results in an increase in the development cost of each cognitive system. ...
Article
Full-text available
This paper describes a framework for the development of an integrative cognitive system based on probabilistic generative models (PGMs) called Neuro-SERKET. Neuro-SERKET is an extension of SERKET, which can compose elemental PGMs developed in a distributed manner and provide a scheme that allows the composed PGMs to learn throughout the system in an unsupervised way. In addition to the head-to-tail connection supported by SERKET, Neuro-SERKET supports tail-to-tail and head-to-head connections, as well as neural network-based modules, i.e., deep generative models. As an example of a Neuro-SERKET application, an integrative model was developed by composing a variational autoencoder (VAE), a Gaussian mixture model (GMM), latent Dirichlet allocation (LDA), and automatic speech recognition (ASR). The model is called VAE + GMM + LDA + ASR. The performance of VAE + GMM + LDA + ASR and the validity of Neuro-SERKET were demonstrated through a multimodal categorization task using image data and a speech signal of numerical digits.
... Visual grounding of speech is a form of self-supervised learning (Virginia de Sa, 1994), which is powerful in part because it offers a way of training models with a discriminative objective that does not depend on traditional transcriptions or annotations. The first work in this direction relied on phone strings to represent the speech (Roy & Pentland, 2002;Roy, 2003), but more recently this learning has been shown to be possible directly on the speech signal (Synnaeve et al., 2014;Harwath & Glass, 2015;Harwath et al., 2016). Subsequent work on visually-grounded models of speech has investigated improvements and alternatives to the modeling or training algorithms (Leidal et al., 2017;Kamper et al., 2017c;Havard et al., 2019a;Merkx et al., 2019;Scharenborg et al., 2018;a;Ilharco et al., 2019;Eloff et al., 2019a), application to multilingual settings (Harwath et al., 2018a;Kamper & Roth, 2017;Azuh et al., 2019;Havard et al., 2019a), analysis of the linguistic abstractions, such as words and phones, which are learned by the models Harwath et al., 2018b;Drexler & Glass, 2017;Havard et al., 2019b), and the impact of jointly training with textual input (Holzenberger et al., 2019;Chrupała, 2019;Pasad et al., 2019). ...
Preprint
Full-text available
In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method is capable of capturing both word-level and sub-word units, depending on how it is configured. What differentiates this paper from prior work on speech unit learning is the choice of training objective. Rather than using a reconstruction-based loss, we use a discriminative, multimodal grounding objective which forces the learned units to be useful for semantic image retrieval. We evaluate the sub-word units on the ZeroSpeech 2019 challenge, achieving a 27.3\% reduction in ABX error rate over the top-performing submission, while keeping the bitrate approximately the same. We also present experiments demonstrating the noise robustness of these units. Finally, we show that a model with multiple quantizers can simultaneously learn phone-like detectors at a lower layer and word-like detectors at a higher layer. We show that these detectors are highly accurate, discovering 279 words with an F1 score of greater than 0.5.
... In computational studies, researchers have built models that implement in-principle learning algorithms, and created training sets to test the abilities of the models to find statistical regularities in the input data. Some work in modeling word learning has used sensory data collected from adult learners or robots (Roy & Pentland, 2002;Yu & Ballard, 2007;Rasanen & Khorrami, 2019), while many models take symbolic data or simplified inputs (Frank et al., 2009;Kachergis & Yu, 2017;K. Smith, Smith, & Blythe, 2011;Fazly, Alishahi, & Stevenson, 2010;Yu & Ballard, 2007). ...
Preprint
Full-text available
Human infants have the remarkable ability to learn the associations between object names and visual objects from inherently ambiguous experiences. Researchers in cognitive science and developmental psychology have built formal models that implement in-principle learning algorithms, and then used pre-selected and pre-cleaned datasets to test the abilities of the models to find statistical regularities in the input data. In contrast to previous modeling approaches, the present study used egocentric video and gaze data collected from infant learners during natural toy play with their parents. This allowed us to capture the learning environment from the perspective of the learner's own point of view. We then used a Convolutional Neural Network (CNN) model to process sensory data from the infant's point of view and learn name-object associations from scratch. As the first model that takes raw egocentric video to simulate infant word learning, the present study provides a proof of principle that the problem of early word learning can be solved, using actual visual data perceived by infant learners. Moreover, we conducted simulation experiments to systematically determine how visual, perceptual, and attentional properties of infants' sensory experiences may affect word learning.
... On the methodological level, and given that our experiments crucially involve dyads consisting of humans and a small humanoid robot, the work falls into the area of Human-Robot Interaction (HRI). In no less important ways it builds up upon recent research on symbol grounding in developmental and cognitive robotics [7,34,35,39,40]. We adopt a constructive, or synthetic, approach to cognitive science that has been termed 'cognitive developmental robotics' elsewhere [2]. ...
Article
“No” is one of the first ten words used by children and embodies the first form of linguistic negation. Despite its early occurrence, the details of its acquisition remain largely unknown. The circumstance that “no” cannot be construed as a label for perceptible objects or events puts it outside the scope of most modern accounts of language acquisition. Moreover, most symbol grounding architectures will struggle to ground the word due to its non-referential character. The presented work extends symbol grounding to encompass affect and motivation. In a study involving the child-like robot iCub, we attempt to illuminate the acquisition process of negation words. The robot is deployed in speech-wise unconstrained interaction with participants acting as its language teachers. The results corroborate the hypothesis that affect or volition plays a pivotal role in the acquisition process. Negation words are prosodically salient within prohibitive utterances and negative intent interpretations such that they can be easily isolated from the teacher’s speech signal. These words subsequently may be grounded in negative affective states. However, observations of the nature of prohibition and the temporal relationships between its linguistic and extra-linguistic components raise questions over the suitability of Hebbian-type algorithms for certain types of language grounding.
... Deep learning has been used to learn robust feature representation for speech over varying speaker and background characteristics such as in [13,14,15,16]. [2,3,4,17,18,19,20] explored unsupervised learning of speech features using visual context. Cross-lingual translation research has focused on text-to-text translation [21,22] as well as speechto-text from one language to another [23,24,25]. ...
... In lieu of labels, self-supervised learning algorithms leverage informative context found e.g., in another modality. An early example of this is the CELL model introduced by [11] which learned to associate words, represented by phoneme strings, with the visual images they described. Recently, [12,13,14] introduced models capable of learning the semantic correspondences between raw speech waveforms and natural images at the pixel level. ...
... Finally, the acquired information is consumed by a high-level task planner that triggers the pertinent behaviors. Although processing raw audio signals is technically possible [6,33], this is still unexplored in RoboCup@Home. Typically, no external filters are used, leaving filtering to the microphone and the ASR engine [7]. ...
Preprint
Full-text available
Scientific competitions are crucial in the field of service robotics. They foster knowledge exchange and allow teams to test their research in unstandardized scenarios and compare result. Such is the case of RoboCup@Home. However, keeping track of all the technologies and solution approaches used by teams to solve the tests can be a challenge in itself. Moreover, after eleven years of competitions, it's easy to delve too much into the field, losing perspective and forgetting about the user's needs and long term goals. In this paper, we aim to tackle this problems by presenting a summary of the trending solutions and approaches used in RoboCup@Home, and discussing the attained achievements and challenges to overcome in relation with the progress required to fulfill the long-term goal of the league. Hence, considering the current capabilities of the robots and their limitations, we propose a set of milestones to address in upcoming competitions. With this work we lay the foundations towards the creation of roadmaps that can help to direct efforts in testing and benchmarking in robotics competitions.
... Machine learning for pattern recognition is obviously a hot topic in the research community, with applications in image [21] and sound [32] processing. Regarding applications to cache-timing attacks, the authors of [39] used support vector machines, a machine learning algorithm, to classify vectors of cache access timings into a sequence of operations of a modular exponentiation. ...
Article
Full-text available
Cache-timing attacks are serious security threats that exploit cache memories to steal secret information. We believe that the identification of a sequence of function calls from cache-timing data measurements is not a trivial step when building an attack. We present a recurrent neural network model able to automatically retrieve a sequence of operations from cache timings. Inspired from natural language processing, our model is able to learn on partially labelled data. We use the model to unfold an end-to-end automated attack on OpenSSL ECDSA on the secp256k1 curve. Our attack is able to extract the 256 bits of the secret key by automatic analysis of about 2400 traces without any human processing.
... 3 Apart from phrase localization, the most promising data source is image captioning datasets with sentence-to-image mappings (or discovered from multimodal documents, as in Hessel et al. (2019)). Image captions belong to a specific type of language called grounded language (Roy and Pentland, 2002;Hermann et al., 2017), which has an explicit grounding to external existence or physical actions. However, grounded language has a large discrepancy to other types of natural language (e.g., News, Wiki, and Textbooks). ...
Preprint
Humans learn language by listening, speaking, writing, reading, and also, via interaction with the multimodal real world. Existing language pre-training frameworks show the effectiveness of text-only self-supervision while we explore the idea of a visually-supervised language model in this paper. We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora. Therefore, we develop a technique named "vokenization" that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images (which we call "vokens"). The "vokenizer" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora. Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks such as GLUE, SQuAD, and SWAG. Code and pre-trained models publicly available at https://github.com/airsplay/vokenization
... In order to bridge the gap between formal linguistics and bio-inspired systems, several valuable computational models have been developed that bring together language and an agent's multimodal perception and action. In their seminal Cross-channel Early Lexical Learning (CELL) model, Roy and Pentland (2002) demonstrate word learning from real sound and vision input. Each of these inputs is processed into a fixed-length vector, then lexical items arise by associations between vectors that represent the corresponding speech and an object's shape. ...
Preprint
Full-text available
Human infants are able to acquire natural language seemingly easily at an early age. Their language learning seems to occur simultaneously with learning other cognitive functions as well as with playful interactions with the environment and caregivers. From a neuroscientific perspective, natural language is embodied, grounded in most, if not all, sensory and sensorimotor modalities, and acquired by means of crossmodal integration. However, characterising the underlying mechanisms in the brain is difficult and explaining the grounding of language in crossmodal perception and action remains challenging. In this paper, we present a neurocognitive model for language grounding which reflects bio-inspired mechanisms such as an implicit adaptation of timescales as well as end-to-end multimodal abstraction. It addresses developmental robotic interaction and extends its learning capabilities using larger-scale knowledge-based data. In our scenario, we utilise the humanoid robot NICO in obtaining the EMIL data collection, in which the cognitive robot interacts with objects in a children's playground environment while receiving linguistic labels from a caregiver. The model analysis shows that crossmodally integrated representations are sufficient for acquiring language merely from sensory input through interaction with objects in an environment. The representations self-organise hierarchically and embed temporal and spatial information through composition and decomposition. This model can also provide the basis for further crossmodal integration of perceptually grounded cognitive representations.
... To overcome the grounding problem, an important question to ask is: do agents really need to learn language grounding from scratch through random exploration in an environment where success is determined by chance? Perhaps nature has a different answer; previous studies in cognitive science and evolutionary linguistics [21,38,41,42] have provided evidence for the hypothesis that communication first started from sounds whose meaning are grounded in the physical environment, then creatures adapted to make sense of those sounds and make use of them. Inspired by language learning in natural species, we propose a novel framework for grounding multi-agent communication: first ground speaking through learned representations of the world, then learn listening to interpret these grounded utterances. ...
Preprint
Full-text available
Communication requires having a common language, a lingua franca, between agents. This language could emerge via a consensus process, but it may require many generations of trial and error. Alternatively, the lingua franca can be given by the environment, where agents ground their language in representations of the observed world. We demonstrate a simple way to ground language in learned representations, which facilitates decentralized multi-agent communication and coordination. We find that a standard representation learning algorithm -- autoencoding -- is sufficient for arriving at a grounded common language. When agents broadcast these representations, they learn to understand and respond to each other's utterances and achieve surprisingly strong task performance across a variety of multi-agent communication environments.
... The actions for the system were selecting a target object and generating the trajectory for the arm movement. Roy et al. proposed a system to learn vocabulary from a continuously spoken utterance [18]. It could also learn associations between words and image objects. ...
Article
Full-text available
Human babies are born without knowledge of any specific language. They acquire language directly from observation and dialogue without being limited by the availability of labeled data. We propose spoken language acquisition agents that simulate the process. Such an ability requires multiple types of learning, including 1) word discovery, 2) symbol grounding, 3) message generation, and 4) pronunciation generation. Several studies have targeted one or combined learning types to elucidate human intelligence and aimed to equip spoken dialogue systems with human-like flexible language learning ability. However, their language ability was partially lacking some of the components. Our agents are the first to integrate them all. Our key concept is to design an architecture to integrate unsupervised, self-supervised, and reinforcement learning to utilize clues naturally existing in raw sensory signals and drive the learning based on the agent’s intrinsic motivation. Experimental results show agents successfully acquire spoken language from scratch by interacting with an environment to act by speaking. Our proposed focusing mechanism significantly improves learning efficiency. We also demonstrate that our agents can learn neural vocoder and the concept of logical negation as a part of language acquisition.
... For a more cognitively realistic approach, the Cross-channel Early Lexical Learning (CELL) model is a simple example of a model that learns directly from raw 'first-personperspective' sensory data to ground words, particularly relating to object shapes [Roy & Pentland, 2002]. In essence, CELL achieves this by inferring the most probable word-tosymbol associations by abstracting consistent word-to-context patterns from multiple situations-specifically, correlations between visual input (of objects with different shapes) and words for shape types. ...
Preprint
Recent hype surrounding the increasing sophistication of language processing models has renewed optimism regarding machines achieving a human-like command of natural language. The area of natural language understanding in artificial intelligence claims to have been making great strides in this area, however, the lack of conceptual clarity in how 'understanding' is used in this and other disciplines have made it difficult to discern how close we actually are. A comprehensive, interdisciplinary overview of current approaches and remaining challenges is yet to be carried out. Beyond linguistic knowledge, this requires considering our species-specific capabilities to categorize, memorize, label and communicate our (sufficiently similar) embodied and situated experiences. Moreover, gauging the practical constraints requires critically analyzing the technical capabilities of current models, as well as deeper philosophical reflection on theoretical possibilities and limitations. In this paper, I unite all of these perspectives -- the philosophical, cognitive-linguistic, and technical -- to unpack the challenges involved in reaching true (human-like) language understanding. By unpacking the theoretical assumptions inherent in current approaches, I hope to illustrate how far we actually are from achieving this goal, if indeed it is the goal.
... Other approaches based on associative learning (e.g. Colunga and Smith, 2005;Landau et al., 1988;Regier, 2005;Roy and Pentland, 2002), have not been formalized to interface with the main experimental paradigm discussed in this paper (Xu and Tenenbaum, 2007b;Spencer et al., 2011;Lewis and Frank, 2018) and so I will not address them further here (which might otherwise require a chapter-length discussion on its own). Instead this chapter focuses primarily on comparison to the Bayesian inference model of word learning. ...
Article
This dissertation investigates the wide-ranging implications of a simple fact: language unfolds over time. Whether as cognitive symbols in our minds, or as their physical realization in the world, if linguistic computations are not made over transient and shifting information as it occurs, they cannot be made at all. This dissertation explores the interaction between the computations, mechanisms, and representations of language acquisition and language processing—with a central theme being the unique study of the temporal restrictions inherent to information processing that I term the immediacy of linguistic computation. This program motivates the study of intermediate representations recruited during online processing and acquisition rather than simply an Input/Output mapping. While ultimately extracted from linguistic input, such intermediate representations may differ significantly from the underlying distributional signal. I demonstrate that, due to the immediacy of linguistic computation, such intermediate representations are necessary, discoverable, and offer an explanatory connection between competence (linguistic representation) and performance (psycholinguistic behavior). The dissertation is comprised of four case studies. First, I present experimental evidence from a perceptual learning paradigm that the intermediate representation of speech consists of probabilistic activation over discrete linguistic categories but includes no direct information about the original acoustic-phonetic signal. Second, I present a computational model of word learning grounded in category formation. Instead of retaining experiential statistics over words and all their potential meanings, my model constructs hypotheses for word meanings as they occur. Uses of the same word are evaluated (and revised) with respect to the learner's intermediate representation rather than to their complete distribution of experience. In the third case study, I probe predictions about the time-course, content, and structure of these intermediate representations of meaning via a new eye-tracking paradigm. Finally, the fourth case study uses large-scale corpus data to explore syntactic choices during language production. I demonstrate how a mechanistic account of production can give rise to highly "efficient" outcomes even without explicit optimization. Taken together these case studies represent a rich analysis of the immediacy of linguistic computation and its system-wide impact on the mental representations and cognitive algorithms of language.
... There is an approach that aims to elucidate the lexical acquisition process by imitating the function of humans and expressing it via machine learning methods [10]- [15]. This type of approach is referred to as a constructive approach. ...
Preprint
Human infants acquire their verbal lexicon from minimal prior knowledge of language based on the statistical properties of phonological distributions and the co-occurrence of other sensory stimuli. In this study, we propose a novel fully unsupervised learning method discovering speech units by utilizing phonological information as a distributional cue and object information as a co-occurrence cue. The proposed method can not only (1) acquire words and phonemes from speech signals using unsupervised learning, but can also (2) utilize object information based on multiple modalities (i.e., vision, tactile, and auditory) simultaneously. The proposed method is based on the Nonparametric Bayesian Double Articulation Analyzer (NPB-DAA) discovering phonemes and words from phonological features, and Multimodal Latent Dirichlet Allocation (MLDA) categorizing multimodal information obtained from objects. In the experiment, the proposed method showed higher word discovery performance than the baseline methods. In particular, words that expressed the characteristics of the object (i.e., words corresponding to nouns and adjectives) were segmented accurately. Furthermore, we examined how learning performance is affected by differences in the importance of linguistic information. When the weight of the word modality was increased, the performance was further improved compared to the fixed condition.
... Several researchers have investigated methods to combine the two cognitive modalities to understand semantic processing (Badler, 1975;Waltz, 1980;Herzog & Wazinski, 1994;Srihari, 1995). Deb Roy proposed a technique to integrate vision and language elicited from infants using a mutual information model (Roy, 2000;Roy & Pentland, 2002). In the last decade, several researchers began studying the multimodal integration problem in relation to sentence prediction and object naming in scenic images (Coco & Keller, 2012;Clarke et al., 2013;Yun et al., 2013aYun et al., , 2013b. ...
Article
Full-text available
Despite many recent advances in the field of computer vision, there remains a disconnect between how computers process images and how humans understand them. To begin to bridge this gap, we propose a framework that integrates human-elicited gaze and spoken language to label perceptually important regions in an image. Our work relies on the notion that gaze and spoken narratives can jointly model how humans inspect and analyze images. Using an unsupervised bitext alignment algorithm originally developed for machine translation, we create meaningful mappings between participants' eye movements over an image and their spoken descriptions of that image. The resulting multimodal alignments are then used to annotate image regions with linguistic labels. The accuracy of these labels exceeds that of baseline alignments obtained using purely temporal correspondence between fixations and words. We also find differences in system performances when identifying image regions using clustering methods that rely on gaze information rather than image features. The alignments produced by our framework can be used to create a database of low-level image features and high-level semantic annotations corresponding to perceptually important image regions. The framework can potentially be applied to any multimodal data stream and to any visual domain. To this end, we provide the research community with access to the computational framework.
... In lieu of labels, self-supervised learning algorithms leverage informative context found e.g., in another modality. An early example of this is the CELL model introduced by [11] which learned to associate words, represented by phoneme strings, with the visual images they described. Recently, [12,13,14] introduced models capable of learning the semantic correspondences between raw speech waveforms and natural images at the pixel level. ...
Preprint
In this paper, we investigate the manner in which interpretable sub-word speech units emerge within a convolutional neural network model trained to associate raw speech waveforms with semantically related natural image scenes. We show how diphone boundaries can be superficially extracted from the activation patterns of intermediate layers of the model, suggesting that the model may be leveraging these events for the purpose of word recognition. We present a series of experiments investigating the information encoded by these events.
... Unlike human infants, such a system depends on being properly prepared with manually annotated data prior to any online interaction. For a more cognitively realistic approach, the Cross-channel Early Lexical Learning (CELL) model is a simple example of a model that learns directly from raw 'first-person-perspective' sensory data to ground words, particularly relating to object shapes (Roy & Pentland, 2002). ...
Thesis
Full-text available
In this thesis, I carry out a novel and interdisciplinary analysis into various complex factors involved in human natural-language acquisition, use and comprehension, aimed at uncovering some of the basic requirements for if we were to try and develop artificially intelligent(AI)agents with similar capacities.
... People have studied for those two decades, for example, how children can learn basic social interaction skills such as joint attention [3,14,21]. Also, the problem of language grounding has been studied already 20 or 30 years ago, even before developmental robotics started as a field [28,29,27]. ...
Preprint
This paper outlines a perspective on the future of AI, discussing directions for machines models of human-like intelligence. We explain how developmental and evolutionary theories of human cognition should further inform artificial intelligence. We emphasize the role of ecological niches in sculpting intelligent behavior, and in particular that human intelligence was fundamentally shaped to adapt to a constantly changing socio-cultural environment. We argue that a major limit of current work in AI is that it is missing this perspective, both theoretically and experimentally. Finally, we discuss the promising approach of developmental artificial intelligence, modeling infant development through multi-scale interaction between intrinsically motivated learning, embodiment and a fastly changing socio-cultural environment. This paper takes the form of an interview of Pierre-Yves Oudeyer by Mandred Eppe, organized within the context of a KI - K{\"{u}}nstliche Intelligenz special issue in developmental robotics.
... People have studied for those two decades, for example, how children can learn basic social interaction skills such as joint attention [3,14,21]. Also, the problem of language grounding has been studied already 20 or 30 years ago, even before developmental robotics started as a field [28,29,27]. ...
... 3 Apart from phrase localization, the most promising data source is image captioning datasets with sentence-to-image mappings (or discovered from multimodal documents, as in Hessel et al. (2019)). Image captions belong to a specific type of language called grounded language (Roy and Pentland, 2002;Hermann et al., 2017), which has an explicit grounding to external existence or physical actions. However, grounded language has a large discrepancy to other types of natural language (e.g., News, Wiki, and Textbooks). ...
... Our framework draws on a multimodal semantic representational space that is inspired partly by recent work on visually grounded word learning (Lazaridou et al., 2016;Roy & Pentland, 2002;Yu, 2005). This line of research uses visual features in the environment to model word learning as a process grounded in visual perception. ...
Article
Overextension—the phenomenon that children extend known words to describe referents outside their vocabulary—is a hallmark of lexical innovation in early childhood. Overextension is a subject of extensive inquiry in linguistics and developmental psychology, but there exists no coherent formal account of this phenomenon. We develop a general computational framework that captures important properties of overextension reported separately in the previous literature. We operationalize overextension as probabilistic inference over a conceptual space that draws on a fusion of knowledge from lexical semantics, deep neural networks, and psychological experiments to support both production and comprehension. We show how this minimally parameterized framework explains overextension in young children over a comprehensive set of noun-referent pairs previously reported in child speech, and it also predicts the behavioral asymmetry in children's overextensional production and comprehension reported in lab settings. Our work offers a computational theory for the origins of word meaning extension and supports a single-system view of language production and comprehension.
Preprint
Full-text available
Article
In order to learn the mappings from words to referents, children must integrate co‐occurrence information across individually ambiguous pairs of scenes and utterances, a challenge known as cross‐situational word learning. In machine learning, recent multimodal neural networks have been shown to learn meaningful visual‐linguistic mappings from cross‐situational data, as needed to solve problems such as image captioning and visual question answering. These networks are potentially appealing as cognitive models because they can learn from raw visual and linguistic stimuli, something previous cognitive models have not addressed. In this paper, we examine whether recent machine learning approaches can help explain various behavioral phenomena from the psychological literature on cross‐situational word learning. We consider two variants of a multimodal neural network architecture and look at seven different phenomena associated with cross‐situational word learning and word learning more generally. Our results show that these networks can learn word‐referent mappings from a single epoch of training, mimicking the amount of training commonly found in cross‐situational word learning experiments. Additionally, these networks capture some, but not all of the phenomena we studied, with all of the failures related to reasoning via mutual exclusivity. These results provide insight into the kinds of phenomena that arise naturally from relatively generic neural network learning algorithms, and which word learning phenomena require additional inductive biases.
Thesis
Full-text available
As most of the endangered languages only exist in spoken form, it is crucial for linguists to transcribe to preserve records of linguistic events and support language learning. The urgency of language endangerment has also prompted linguists to incorporate computational speech processing tools into the transcription pipeline. However, treating indigenous knowledge as mere data disenfranchises local language speakers. In order to truly support oral language speakers in their desire to maintain the language, speech technological solutions need to put human collaboration at the centre of the transcription pipeline. With the goal of collaboration with language speakers, we explored a transcription pipeline that incorporates respeaking and word-spotting. By including respeaking, a linguistic field method that utilises slow repeated speech, it enables oral language speaker to participate in the transcription process. Using the Indigenous Australian language of Kunwinjku as an example, we showed the positive effect of respeaking on word-spotting. We also demonstrate how additional collaborative tasks in respeaking can be included to significantly improve the performance of word-spotting in the transcription pipeline.
Chapter
This chapter presents work on developmental machine learning strategies applied to robots for language acquisition. The authors focus on learning by scaffolding and emphasize the role of the human caregiver for robot learning. Indeed, language acquisition does not occur in isolation, neither can it be a robot’s “genetic legacy.” Rather, they propose that language is best acquired incrementally, in a social context, through human-robot interactions in which humans guide the robot, as if it were a child, through the learning process. The authors briefly discuss psychological models related to this work and describe and discuss computational models that they implemented for robot language acquisition. The authors aim to introduce robots into our society and treat them as us, using child development as a metaphor for robots’ developmental language learning.
Preprint
Full-text available
Language acquisition is the process by which humans acquire the capacity to perceive and comprehend language (in other words, gain the ability to be aware of language and to understand it), as well as to produce and use words and sentences to communicate (Wikipedia). Language acquisition involves structures, rules and representation. The capacity to use language successfully requires one to acquire a range of tools including phonology, morphology, syntax, semantics, and an extensive vocabulary. Language can be vocalized as in speech, or manual as in sign. (Pichler, 2015; Algburi, and Igaab, 2021: 39; and Betti, 1990: 24). Human language capacity is represented in the brain. Even though human language capacity is finite, one can say and understand an infinite number of sentences, which is based on a syntactic principle called recursion. Evidence suggests that every individual has three recursive mechanisms that allow sentences to go indeterminately. These three mechanisms are: relativization, complementation and coordination.
Article
Full-text available
The Symbolic Grounding Problem is viewed as a by-product of the classical cognitivist approach to studying the mind. In contrast, an epigenetic interpretation of connectionist approaches to studying the mind is shown to offer an account of symbolic skills as an emergent, developmental phenomenon. We describe a connectionist model of concept formation and vocabulary growth that auto-associates image representations and their associated labels. The image representations consist of clusters of random dot figures, generated by distorting prototypes. Any given label is associated with a cluster of random dot figures. The network model is tested on its ability to reproduce image representations given input labels alone (comprehension) and to identify labels given input images alone (production). The model implements several well-documented findings in the literature on early semantic development; the occurrence of over- and under-extension errors; a vocabulary spurt; a comprehension/production asymmetry; and a prototype effect. It is shown how these apparently disparate findings can be attributed to the operation of a single underlying mechanism rather than by invoking separate explanations for each phenomenon. The model represents a first step in the direction of providing a formal explanation of the emergence of symbolic behaviour in young children.
Article
Full-text available
Language research thrives on data collected from spontaneous interactions in naturally occurring situations. However, the process of collecting, transcribing, and analyzing naturalistic data can be extremely time-consuming and often unreliable. This book describes three basic tools for language analysis of transcript data by computer that have been developed in the context of the "Child Language Data Exchange System (CHILDES)" project. These are: the "CHAT" transcription and coding format, the "CLAN" package of analysis programs, and the "CHILDES" database. These tools have brought about significant changes in the way research is conducted in the child language field. They are being used with great success by researchers working with second language learning, adult conversational interactions, sociological content analyses, and language recovery in aphasia, as well as by students of child language development. The tools are widely applicable, although this book concentrates on their use in the child language field, believing that researchers from other areas can make the necessary analogies to their own topics. This thoroughly revised 2nd edition includes documentation on a dozen new computer programs that have been added to the basic system for transcript analysis. The most important of these new programs is the "CHILDES" Text Editor (CED) which can be used for a wide variety of purposes, including editing non-Roman orthographies, systematically adding codes to transcripts, checking the files for correct use of "CHAT," and linking the files to digitized audio and videotape. In addition to information on the new computer programs, the manual documents changed the shape of the "CHILDES/BIB" system--given a major update in 1994--which now uses a new computer database system. The documentation for the "CHILDES" transcript database has been updated to include new information on old corpora and information on more than a dozen new corpora from many different languages. Finally, the system of "CHAT" notations for file transcript have been clarified to emphasize the ways in which the codes are used by particular "CLAN" programs. The new edition concludes with a discussion of new directions in transcript analysis and links between the "CHILDES" database and other developments in multimedia computing and global networking. It also includes complete references organized by research topic area for the more than 300 published articles that have made use of the "CHILDES" database and/or the "CLAN" programs. LEA also distributes the "CLAN" programs and the complete "CHILDES" Database--including corpora from several languages and discourse situations--described in "The CHILDES Project." Be sure to choose the correct platform (IBM or Macintosh) for the "CLAN" programs; the "CHILDES" Database CD-ROM runs on both platforms.
Article
Full-text available
It has been claimed that young children use object names overgenerally and undergenerally because they do not have notions of objects of particular kinds, but rather, complexive notions of objects and their habitual actions or locations. However, for overgeneral uses in particular, it is difficult to differentiate word meaning from word use because communicative functions are not explicitly expressed in single-word speech. In the present paper, we identify three types of overgeneral uses, and argue that two of these reflect communicative functions rather than complexive meanings. We obtained production data from 10 children in the single-word period, using a standardized method of recording utterance contexts. Most uses of object names were for appropriate instances of the adult categories. Of the overgeneral uses, most were attributable to communicative functions rather than complexive meanings, and there was no evidence of undergeneral use. The results provide strong evidence that, from the very start, children's object names, like those of adults, apply to objects of particular kinds.
Article
Full-text available
Linguistic experience affects phonetic perception. However, the critical period during which experience affects perception and the mechanism responsible for these effects are unknown. This study of 6-month-old infants from two countries, the United States and Sweden, shows that exposure to a specific language in the first half year of life alters infants' phonetic perception.
Article
Full-text available
PREVIOUS WORK INDICATES THAT SS CAN LEARN TO CLASSIFY SETS OF PATTERNS WHICH ARE DISTORTIONS OF A PROTOTYPE THEY HAVE NOT SEEN. IT IS SHOWN THAT AFTER LEARNING A SET OF PATTERNS, THE PROTOTYPE (SCHEMA) OF THAT SET IS MORE EASILY CLASSIFIED THAN CONTROL PATTERNS ALSO WITHIN THE LEARNED CATEGORY. AS THE VARIABILITY AMONG THE MEMORIZED PATTERNS INCREASES, SO DOES THE ABILITY OF SS TO CLASSIFY HIGHLY DISTORTED NEW INSTANCES. THESE FINDINGS ARGUE THAT INFORMATION ABOUT THE SCHEMA IS ABSTRACTED FROM THE STORED INSTANCES WITH VERY HIGH EFFICIENCY. IT IS UNCLEAR WHETHER THE ABSTRACTION OF INFORMATION INVOLVED IN CLASSIFYING THE SCHEMA OCCURS WHILE LEARNING THE ORIGINAL PATTERNS OR WHETHER THE ABSTRACTION PROCESS OCCURS AT THE TIME OF THE 1ST PRESENTATION OF THE SCHEMA.
Article
Full-text available
In order to acquire a lexicon, young children must segment speech into words, even though most words are unfamiliar to them. This is a non-trivial task because speech lacks any acoustic analog of the blank spaces between printed words. Two sources of information that might be useful for this task are distributional regularity and phonotactic constraints. Informally, distributional regularity refers to the intuition that sound sequences that occur frequently and in a variety of contexts are better candidates for the lexicon than those that occur rarely or in few contexts. We express that intuition formally by a class of functions called DR functions. We then put forth three hypotheses: First, that children segment using DR functions. Second, that they exploit phonotactic constraints on the possible pronunciations of words in their language. Specifically, they exploit both the requirement that every word must have a vowel and the constraints that languages impose on word-initial and word-final consonant clusters. Third, that children learn which word-boundary clusters are permitted in their language by assuming that all permissible word-boundary clusters will eventually occur at utterance boundaries. Using computational simulation, we investigate the effectiveness of these strategies for segmenting broad phonetic transcripts of child-directed English. The results show that DR functions and phonotactic constraints can be used to significantly improve segmentation. Further, the contributions of DR functions and phonotactic constraints are largely independent, so using both yields better segmentation than using either one alone. Finally, learning the permissible word-boundary clusters from utterance boundaries does not degrade segmentation performance.
Article
Full-text available
Infants aged 4-6 months discriminate the fine phonetic differences that distinguish syllables in both their native and unfamiliar languages, but by 10-12 months their perceptual sensitivities are reorganized so that they discriminate only the phonetic variations that are used to distinguish meaning in their native language. It would seem, then, that infants apply their well honed phonetic sensitivities as they advance and begin to associate words with objects, but the question of how speech perception sensitivities are used in early word learning has not yet been answered. Here we use a recently developed technique to show that when they are required to pair words with objects, infants of 14 months fail to use the fine phonetic detail they detect in syllable discrimination tasks. In contrast, infants of 8 months--who are not yet readily learning words--successfully discriminate phonetic detail in the same task in which infants aged 14 months fail. Taken together, these results suggest a second reorganization in infants's use of phonetic detail as they move from listening to syllables to learning words.
Article
Full-text available
Dutch-learning and English-learning 9-month-olds were tested, using the Headturn Preference Procedure, for their ability to segment Dutch words with strong/weak stress patterns from fluent Dutch speech. This prosodic pattern is highly typical for words of both languages. The infants were familiarized with pairs of words and then tested on four passages, two that included the familiarized words and two that did not. Both the Dutch- and the English-learning infants gave evidence of segmenting the targets from the passages, to an equivalent degree. Thus, English-learning infants are able to extract words from fluent speech in a language that is phonetically different from English. We discuss the possibility that this cross-language segmentation ability is aided by the similarity of the typical rhythmic structure of Dutch and English words.
Conference Paper
Full-text available
Human-computer interaction based on recognition of speech, gestures, and other natural modalities is on the rise. Recognition technologies are typically developed in a statistical framework and require large amounts of training data. The cost of collecting manually annotated data is usually the bottleneck in developing such systems. We explore the idea of learning from unannotated data by leveraging information across multiple modes of input. A working system inspired by infant language learning which learns from untranscribed speech and images is presented
Conference Paper
Full-text available
This paper describes a probabilistic object recognition technique which does not require correspondence matching of images. This technique is an extension of our earlier work (1996) on object recognition using matching of multi-dimensional receptive field histograms. In the earlier paper we have shown that multi-dimensional receptive field histograms can be matched to provide object recognition which is robust in the face of changes in viewing position and independent of image plane rotation and scale. In this paper we extend this method to compute the probability of the presence of an object in an image. The paper begins with a review of the method and previously presented experimental results. We then extend the method for histogram matching to obtain a genuine probability of the presence of an object. We present experimental results on a database of 100 objects showing that the approach is capable recognizing all objects correctly by using only a small portion of the image. Our results show that receptive field histograms provide a technique for object recognition which is robust, has low computational cost and a computational complexity which is linear with the number of pixels
Article
Full-text available
Performance of even the best current stochastic recognizers severely degrades in an unexpected communications environment. In some cases, the environmental effect can be modeled by a set of simple transformations and, in particular, by convolution with an environmental impulse response and the addition of some environmental noise. Often, the temporal properties of these environmental effects are quite different from the temporal properties of speech. We have been experimenting with filtering approaches that attempt to exploit these differences to produce robust representations for speech recognition and enhancement and have called this class of representations relative spectra (RASTA). In this paper, we review the theoretical and experimental foundations of the method, discuss the relationship with human auditory perception, and extend the original method to combinations of additive noise and convolutional noise. We discuss the relationship between RASTA features and the nature of the recognition models that are required and the relationship of these features to delta features and to cepstral mean subtraction. Finally, we show an application of the RASTA technique to speech enhancement
Article
Full-text available
This paper presents an application of recurrent networks for phone probability estimation in large vocabulary speech recognition. The need for efficient exploitation of context information is discussed
Article
Full-text available
This paper presents a model-based, unsupervised algorithm for recovering word boundaries in a natural-language text from which they have been deleted. The algorithm is derived from a probability model of the source that generated the text. The fundamental structure of the model is specified abstractly so that the detailed component models of phonology, word-order, and word frequency can be replaced in a modular fashion. The model yields a language-independent, prior probability distribution on all possible sequences of all possible words over a given alphabet, based on the assumption that the input was generated by concatenating words from a fixed but unknown lexicon. The model is unusual in that it treats the generation of a complete corpus, regardless of length, as a single event in the probability space. Accordingly, the algorithm does not estimate a probability distribution on words; instead, it attempts to calculate the prior probabilities of various word sequences that could underlie the observed text. Experiments on phonemic transcripts of spontaneous speech by parents to young children suggest that this algorithm is more effective than other proposed algorithms, at least when utterance boundaries are given and the text includes a substantial number of short utterances. Keywords: Bayesian grammar induction, probability models, minimum description length (MDL), unsupervised learning, cognitive modeling, language acquisition, segmentation
Article
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Article
Among the earliest and most frequent words that infants hear are their names. Yet little is known about when infants begin to recognize their own names. Using a modified version of the head-turn preference procedure, we tested whether 4.5-month-olds preferred to listen to their own names over foils that were either matched or mismatched for stress pattern. Our findings provide the first evidence that even these young infants recognize the sound patterns of their own names. Infants demonstrated significant preferences for their own names compared with foils that shared the same stress patterns, as well as foils with opposite patterns. The results indicate when infants begin to recognize sound patterns of items frequently uttered in the infants' environments.
Article
The assumption that language acquisition is relatively independent of the amount and kind of language input must be assessed in light of information about the speech actually heard by young children. The speech of middle-class mothers to 2-year-old children was found to be simpler and more redundant than their speech to 10-year-old children. The mothers modified their speech less when talking to children whose responses they could not observe, indicating that the children played some role in eliciting the speech modifications. Task difficulty did not contribute to the mothers' production of simplified, redundant speech. Experienced mothers were only slightly better than nonmothers in predicting the speech-style modifications required by young children. These findings indicate that children who are learning language have available a sample of speech which is simpler, more redundant, and less confusing than normal adult speech.
Article
This paper presents an implemented computational model of word acquisition which learns directly from raw multimodal sensory input. Set in an information theoretic framework, the model acquires a lexicon by finding and statistically modeling consistent cross-modal structure. The model has been implemented in a system using novel speech processing, computer vision, and machine learning algorithms. In evaluations the model successfully performed speech segmentation, word discovery and visual categorization from spontaneous infant-directed speech paired with video images of single objects. These results demonstrate the possibility of using state-of-the-art techniques from sensory pattern recognition and machine learning to implement cognitive models which can process raw sensor data without the need for human transcription or labeling. © 2002 Cognitive Science Society, Inc. All rights reserved.
Article
We ask if certain dimensions of perceptual similarity are weighted more heavily than others in determining word extension. The specific dimensions examined were shape, size, and texture. In four experiments, subjects were asked either to extend a novel count noun to new instances or, in a nonword classification task, to put together objects that go together. The subjects were 2-year-olds, 3-year- olds, and adults. The results of all four experiments indicate that 2- and 3-year-olds and adults all weight shape more heavily than they do size or texture. This ob- served emphasis on shape, however, depends on the age of the subject and the task. First, there is a developmental trend. The shape bias increases in strength and generality from 2 to 3 years of age and more markedly from early childhood to adulthood. Second, in young children, the shape bias is much stronger in word extension than in nonword classification tasks. These results suggest that the development of the shape bias originates in language learning-it reflects a fact about language-and does not stem from general perceptual processes.
Article
16 mothers and 16 fathers were recorded in dyadic sessions with their children (8 5-years-olds, 8 2-year-olds; half boys, half girls) and with an adult. The noise-free questions and declaratives were analyzed separately by a real-time spectrum analyzer for fundamental frequency (pitch) and frequency range. Analysis revealed that mothers raised their pitch (from adult-addressed levels) equally for both ages of child listeners, but increased their ranges more when speaking to the younger children. Fathers increased their pitch and ranges even more than mothers, when addressing the younger children, but did not differentiate between 5-year-old and adult listeners. In adult-addressed speech, mothers used greater frequency ranges than fathers. These results suggest that both fundamental frequency and frequency range are significantly influenced by sex-role values. Furthermore, the hypothesis that mothers and fathers supply essentially redundant linguistic input to the language learning child was only partially supported.
Article
Four studies using a variant of the conditioned head turning procedure, in which response latencies to extraneous noises occurring at different junctures within synthetic syllable strings served as the dependent variable, investigated 6- and 9-month-old infants’ representations of familiar and novel syllable pairs manifesting diverse rhythmic patterns.Familiarbisyllables tended to be perceived similarly by both age groups: These bisyllables were perceived as being cohesive, without regard to whether they manifested trochaic (longer–shorter) or iambic (shorter–longer) rhythm.Novelbisyllables were perceived differently by the two age groups. Six-month-olds appeared to perceive segmentally novel, but rhythmically familiar, bisyllables as cohesive, whereas they failed to perceive rhythmically novel, but segmentally familiar bisyllables in the same fashion. Similar patterns of results were obtained with trochaic and iambic bisyllables. Nine-month-olds, however, were differentially sensitive to different rhythmic patterns. Older infants appeared to perceive novel bisyllables as cohesive only when they manifested trochaic rhythm, whether the bisyllables were segmentally or rhythmically novel. Nine-month-olds’ behavior is consistent with the possibility that they have adopted a metrical strategy for segmentation, as A. Cutler (1990 and elsewhere) has argued is the case for novice and expert English speakers alike. The plausibility of such a strategy with respect to the nature of child-directed speech is discussed, as are possible consequences for acquisition of such a strategy.
Chapter
There are many speech recognition applications that require only partial information to be extracted from a speech utterance. These applications include human-machine interactions where it may be difficult to constrain users’ utterances to be within the domain of the machine. Other types of applications that are of interest are those where speech utterances arise from human-human interaction, interaction with speech messaging systems, or any other domain that can be characterized as being unconstrained or spontaneous. This chapter is concerned with the problem of spotting keywords in continuous speech utterances. Many important speech input applications involving word spotting will be described. The chapter will also discuss Automatic Speech Recognition (ASR) problems that are particularly important in word spotting applications. These problems include rejection of out-of-vocabulary utterances, derivation of measures of confidence, and the development of efficient and flexible search algorithms.
Article
Phoneme substrings that are recurrent within training data are detected and logged using dynamic programming procedures. The resulting keystrings (cluster centroids) are awarded a usefulness rating based on smoothed occurrence probabilities in wanted and unwanted data. The rankings of the keystrings by usefulness measured on training, development test and final test data for three language-pairs from the OGI multi-language corpus are highly consistent, showing that language-specific features are being found. Statistical measures of local association also suggest that keystring occurrences can be correlated in a manner similar to that of keywords for a particular topic. With improved recognition accuracy it should be possible to exploit this information in order to enhance performance in topic identification.
Article
Three studies investigated the perception of rhythmic units in speech to explore the potential role of rhythm in word-level segmentation. Exp I with 31 undergraduates investigated whether adults expect trochaic (strong-weak) units to cohere. The prediction that poststress pauses that violated the coherence of the trochee would be unexpected and therefore more noticeable was supported. Ss were consistently more likely to identify post- than prestress pauses in trisyllabic speech stimuli. Exp 2 used a variant of a head-turn preference procedure to test a similar question with 64 7- and 9-mo-old infants. Nine mo olds showed a significant preference for stimuli in which the coherence of the trochee was preserved, suggesting that they perceived those stimuli as more natural than stimuli in which trochaic sequences were disrupted by pauses. Among 7-mo olds, no preferences were observed. Exp 3 with 32 9 mo old infants tested the possibility that 9-mo olds use the trochaic stress pattern in word-level segmentation. Ss distinguished a trochaic sequence that had previously been embedded in a 4-syllable string from a novel trochaic sequence, suggesting that they recognized the previously heard trochaic unit as familiar. There was no such discrimination for iambic targets and distracters. (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
For the purposes of the present discussion, the term structure will be used in the following non-rigorous sense: A set of phonemes or a set of data is structured in respect to some feature, to the extent that we can form in terms of that feature some organized system of statements which describes the members of the set and their interrelations (at least up to some limit of complexity). In this sense, language can be structured in respect to various independent features. And whether it is structured (to more than a trivial extent) in respect to, say, regular historical change, social intercourse, meaning, or distribution — or to what extent it is structured in any of these respects — is a matter decidable by investigation. Here we will discuss how each language can be described in terms of a distributional structure, i.e. in terms of the occurrence of parts (ultimately sounds) relative to other parts, and how this description is complete without intrusion of other features such as history or meaning. It goes without saying that other studies of language — historical, psychological, etc.—are also possible, both in relation to distributional structure and independently of it.
Article
Conducted 9 experiments with a total of 663 undergraduates using the technique of priming to study the nature of the cognitive representation generated by superordinate semantic category names. In Exp I, norms for the internal structure of 10 categories were collected. In Exps II, III, and IV, internal structure was found to affect the perceptual encoding of physically identical pairs of stimuli, facilitating responses to physically identical good members and hindering responses to identical poor members of a category. Exps V and VI showed that the category name did not generate a physical code (e.g., lines or angles), but rather affected perception of the stimuli at the level of meaning. Exps VII and VIII showed that while the representation of the category name which affected perception contained a depth meaning common to words and pictures which enabled Ss to prepare for either stimulus form within 700 msec, selective reduction of the interval between prime and stimulus below 700 msec revealed differentiation of the coding of meaning in preparation for actual perception. Exp IX suggested that good examples of semantic categories are not physiologically determined, as the effects of the internal structure of semantic categories on priming (unlike the effects for color categories) could be eliminated by long practice. (57 ref) (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Book
The textbook deals essentially with discrete-time signals for which both time and amplitude are discrete. The discussion covers the representation of linear shift-invariant systems using the convolution sum for the time domain and the Fourier transforms for the frequency domain; the characteristics of z transforms and system function representation of linear shift-invariant systems; the implementation of this class of systems as digital networks composed of adders, delay elements, and coefficient multipliers, leading to a theory of digital networks and digital filter design methods; and the discrete Hilbert transforms which are necessary for homomorphic signal processing. The computation of the discrete Fourier transform with fast algorithms for both one-dimensional and multidimensional sequences is outlined. Basic concepts concerning discrete random signals are introduced, emphasizing the representation of quantization effects in terms of additive noise and analysis of these effects for both digital filtering and fast Fourier transform algorithms. Several applications of homomorphic signal processing are examined along with the estimation of the spectrum of a random signal.
Article
One critical aspect of language acquisition is the development of a lexicon that associates sounds and meanings; but developing a lexicon first requires that the infant segment utterances into individual words. How might the infant begin this process? The present study was designed to examine the potential role that sensitivity to predominant stress patterns of words might play in lexical development. In English, by far the majority of words have stressed (strong) initial syllables. Experiment 1 of our study demonstrated that by 9 months of age American infants listen significantly longer to words with strong/weak stress patterns than to words with weak/strong stress patterns. However, Experiment 2 showed that no significant preferences for the predominant stress pattern appear with 6-month-old infants, which suggests that the preference develops as a result of increasing familiarity with the prosodic features of the native language. In a third experiment, 9-month-olds showed a preference for strong/weak patterns even when the speech input was low-pass filtered, which suggests that their preference is specifically for the prosodic structure of the words. Together the results suggest that attention to predominant stress patterns in the native language may form an important part of the infant's process of developing a lexicon.
Chapter
Information theory answers two fundamental questions in communication theory: what is the ultimate data compression (answer: the entropy H), and what is the ultimate transmission rate of communication (answer: the channel capacity C). For this reason some consider information theory to be a subset of communication theory. We will argue that it is much more. Indeed, it has fundamental contributions to make in statistical physics (thermodynamics), computer science (Kolmogorov complexity or algorithmic complexity), statistical inference (Occam's Razor: “The simplest explanation is best”) and to probability and statistics (error rates for optimal hypothesis testing and estimation). The relationship of information theory to other fields is discussed. Information theory intersects physics (statistical mechanics), mathematics (probability theory), electrical engineering (communication theory) and computer science (algorithmic complexity). We describe these areas of intersection in detail.
Article
The number of digits it takes to write down an observed sequence x1, …, xN of a time series depends on the model with its parameters that one assumes to have generated the observed data. Accordingly, by finding the model which minimizes the description length one obtains estimates of both the integer-valued structure parameters and the real-valued system parameters.
Article
This paper explores the number of word boundaries which can be detected from sequences of phonemes and broad classes in continuous speech transcriptions. In the first part of the paper, word boundaries are detected from sequences of three phonemes which occur across word boundaries but which are excluded word internally. When such sequences are matched against phonemic transcriptions of 145 utterances, it is shown that around 37% of all word boundaries can be correctly identified. When the same transcriptions are represented by broad classes rather than phonemes, a knowledge of sequences which span word boundaries but which do not occur word internally is almost completely ineffective for the purpose of word boundary detection. Instead, it is shown that a version of the model discussed in Cutler & Norris 1988 based on the distinction between “strong” and “weak” vowels enables over 40% of word boundaries to be correctly located at the broad class level although many word boundaries are also inse
Article
This thesis proposes a computational model of how children may come to learn the meanings of words in their native language. The proposed model is divided into two separate components. One component produces semantic descriptions of visually observed events while the other correlates those descriptions with co-occurring descriptions of those events in natural language. The first part of this thesis describes three implementations of the correlation process whereby representations of the meanings of whole utterances can be decomposed into fragments assigned as representations of the meanings of individual words. The second part of this thesis describes an implemented computer program that recognizes the occurrence of simple spatial motion events in simulated video input.
Article
When presented with two-element geometric stimuli, with one element enclosed within the other, infants under 2 months of age do not appear to detect the internal element and fail to respond to a change in that element. Infants over 2 months of age experience no difficulty in this respect. However, it is established that this “externality effect” does break down under certain circumstances, for example, where there is independent modulation of the internal element. Evidence is presented which appears to run counter to explanations of this effect in terms of information processing limitations, capture of attention by dominant salience, or level of arousal. An alternative explanation based on figure-ground organization is given partial support.
Article
Human infants' discrimination of changes in internal and external elements of compound visual patterns was investigated in four experiments employing a familiarization-novelty paradigm in which visual reinforcing patterns were presented contingent upon rate of high-amplitude nonnutritive sucking. In Experiment 1, 4-month infants discriminated changes in the shape of internal, external and both internal and external figures. One-month infants discriminated external changes in both internal and external figures, but failed to show reliable response recovery when only internal figures were changed. Experiments 2 and 3 failed to explain the 1-month results on the basis of poor resolution of internal figures by showing comparable discrimination of small and large singly-presented figures and by failing to find improved internal discrimination with large separation between internal and external figures. In Experiment 4, 1-month infants showed response recovery to figure additions made adjacent to the initial figure, but not to internal additions. The results are interpreted in terms of attentiveness by young infants to external pattern elements and may indicate early processing of figure-ground information. The developmental differences observed suggest an increased breadth of attention to pattern elements.
Article
Infants' sensitivity to correlations or co-occurrences among attributes may play a role in abilities ranging from pattern or object recognition to category formation. The present set of experiments investigated 4-, 7-, and 10-month-old infants' ability to perceive and base novelty responses on correlations among perceptual attributes in a category-like context (i.e., with the correlation embedded in a set of discriminable stimuli). In a habituation-dishabituation paradigm, 10-month-old infants clearly responded on the basis of the correlation among attributes. In contrast, 4- and 7-month-old infants responded primarily on the basis of specific featural information, but did not respond reliably to the correlation. It is suggested that the sensitivity to correlated attributes demonstrated by 10-month-old infants may have implications for the processes underlying the infants' categorization abilities.
Article
One critical aspect of language acquisition is the development of a lexicon that associates sounds and meanings; but developing a lexicon first requires that the infant segment utterances into individual words. How might the infant begin this process? The present study was designed to examine the potential role that sensitivity to predominant stress patterns of words might play in lexical development. In English, by far the majority of words have stressed (strong) initial syllables. Experiment 1 of our study demonstrated that by 9 months of age American infants listen significantly longer to words with strong/weak stress patterns than to words with weak/strong stress patterns. However, Experiment 2 showed that no significant preferences for the predominant stress pattern appear with 6-month-old infants, which suggests that the preference develops as a result of increasing familiarity with the prosodic features of the native language. In a third experiment, 9-month-olds showed a preference for strong/weak patterns even when the speech input was low-pass filtered, which suggests that their preference is specifically for the prosodic structure of the words. Together the results suggest that attention to predominant stress patterns in the native language may form an important part of the infant's process of developing a lexicon.
Article
Learners rely on a combination of experience-independent and experience-dependent mechanisms to extract information from the environment. Language acquisition involves both types of mechanisms, but most theorists emphasize the relative importance of experience-independent mechanisms. The present study shows that a fundamental task of language acquisition, segmentation of words from fluent speech, can be accomplished by 8-month-old infants based solely on the statistical relationships between neighboring speech sounds. Moreover, this word segmentation was based on statistical learning from only 2 minutes of exposure, suggesting that infants have access to a powerful mechanism for the computation of statistical properties of the language input.
Conference Paper
We are developing a system which learns words from co-occurring spoken and visual input. The goal is to automatically segment continuous speechatword boundaries without a lexicon, and to form visual categories which correspond to spoken words. Mutual information is used to integrate acoustic and visual distance metrics in order to extract an audio-visual lexicon from raw input. Wereport results of experiments with a corpus of infant-directed speech and images. 1. INTRODUCTION We are developing systems which learn words from co-occurring audio and visual input [5, 4]. Input consists of naturally spoken mutliword utterances paired with visual representations of object shapes (Figure 1). Output of the system is an audio-visual lexicon of sound-shape associations which encode acoustic forms of words (or phrases) and their visually grounded referents. We assume that, in general, the audio and visual signals are uncorrelated in time. However, when a wordisspoken, its visual representatio...
Article
This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described