Harald BaayenUniversity of Tübingen | EKU Tübingen
Harald Baayen
Doctor of Philosophy
About
416
Publications
180,833
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
38,787
Citations
Introduction
I am interested in words: their internal structure, their meaning, their distributional properties, how they are used in different speech communities and registers, and how they are processed in language comprehension and speech production.
Skills and Expertise
Additional affiliations
July 2007 - August 2011
July 1980 - September 1980
Summer Institute of Linguistics
Position
- Teaching Assistant
Description
- Fulltime, Non Tenure Track
September 1985 - December 1988
Education
February 1985 - May 1989
July 1983 - January 1985
Publications
Publications (416)
Recently, deep learning models have increasingly been used in cognitive modelling of language. This study asks whether deep learning can help us to better understand the learning problem that needs to be solved by speakers, above and beyond linear methods. We utilise the Discriminative Lexicon Model (DLM, Baayen et al., 2019), which models comprehe...
In Standard Chinese, Tone 3 (the dipping tone) becomes Tone 2 (rising tone) when followed by another Tone 3. Previous studies have noted that this sandhi process may be incomplete, in the sense that the assimilated Tone 3 is still distinct from a true Tone 2. While Mandarin Tone 3 sandhi is widely studied using carefully controlled laboratory speec...
Using distributional semantics, we show that English nominal pluralization exhibits semantic clusters. For instance, the change in semantic space from singulars to plurals differs depending on whether a word denotes, e.g., a fruit, or an animal. Languages with extensive noun classes such as Swahili and Kiowa distinguish between these kind of words...
Günther et al. (2022) investigated the relationship between words and images in which they concluded the possibility of a direct link between words and embodied experience. In their study, participants were presented with a target noun and a pair of images, one chosen by their model and another chosen randomly. Participants were asked to select the...
Word frequency is a strong predictor in most lexical processing tasks. Thus, any model of word recognition needs to account for how word frequency effects arise. The Discriminative Lexicon Model (DLM) models lexical processing with mappings between words' forms and their meanings. Comprehension and production are modeled via linear mappings between...
Grounding language in vision is an active field of research seeking to construct cognitively plausible word and sentence representations by incorporating perceptual knowledge from vision into text-based representations. Despite many attempts at language grounding, achieving an optimal equilibrium between textual representations of the language and...
Trial-to-trial effects have been found in a number of studies, indicating that processing a stimulus influences responses in subsequent trials. A special case are priming effects which have been modelled successfully with error-driven learning (Marsolek, 2008), implying that participants are continuously learning during experiments. This study inve...
Distributional semantics offers new ways to study the semantics of morphology. This study focuses on the semantics of noun singulars and their plural inflectional variants in English. Our goal is to compare two models for the conceptualization of plurality. One model (FRACSS) proposes that all singular-plural pairs should be taken into account when...
Word frequency is a strong predictor in most lexical processing tasks. Thus, any model of word recognition needs to account for how word frequency effects arise. The Discriminative Lexicon Model (DLM; Baayen et al., 2018a, 2019) models lexical processing with linear mappings between words' forms and their meanings. So far, the mappings can either b...
This study investigates the phenomenon of defectiveness in Russian case and number noun paradigms from the perspective of distributional semantics. We made use of word embeddings, high-dimensional vectors trained from large text corpora, and compared the observed paradigms of nouns that are defective in the genitive plural, as suggested by Zaliznja...
Finnish nouns are characterized by rich inflectional variation, with obligatory marking of case and number, with optional possessive suffixes and with the possibility of further cliticization. We present a model for the conceptualization of Finnish inflected nouns, using pre-compiled fasttext embeddings (300-dimensional semantic vectors that approx...
In this study, we analyzed ultrasound images at the middle of the stem vowel [a:] of German inflected verbs, using GAMMs. Rather than focusing on tracked tongue contour shapes, we modeled brightness of each pixel as a function of x- and y-coordinates. The result showed higher tongue body and lower tongue tip positions for high frequency words compa...
Indonesian has two noun-forming prefixes, PE- and PEN-, that often stand in a paradigmatic relation to verbal base words with the prefixes BER- and MEN-. The central question addressed in the present study is whether the form similarities between PEN- and MEN- make PEN- easier to learn compared to PE-. To address this question, we made use of a com...
In this study, we found that paradigmatic probability of and semantic support for interfixes of German compounds were correlated quantitatively and qualitatively to some extent. Paradigmatic probability of an interfix of a compound is based on the concepts of morphemes and morphological paradigms and represents how likely a particular interfix of a...
In this poster, we showed that greater semantic support to an interfix of a German compound word predicts phonetically enhanced acoustic and articulatory realisations (i.e., longer duration and clearer articulation). The measure of semantic support was derived from the Discriminative Lexicon Model (DLM; Baayen et al., 2019) without any information...
Distributional semantics offers new ways to study the semantics of morphology. This study focuses on the semantics of noun singulars and their plural inflectional variants in English. Our goal is to compare two models for the conceptualization of plurality. One model (FRACSS) proposes that all singular-plural pairs should be taken into account when...
This study investigates the phenomenon of defectiveness in Russian case and number noun paradigms from the perspective of distributional semantics. We made use of word embeddings, high-dimensional vectors trained from large text corpora, and compared the observed paradigms of nouns that are defective in the genitive plural, as suggested by Zaliznja...
Current computational models capturing words' meaning mostly rely on textual corpora. While these approaches have been successful over the last decades, their lack of grounding in the real world is still an ongoing problem. In this paper, we focus on visual grounding of word embeddings and target two important questions. First, how can language ben...
Language grounding to vision is an active field of research aiming to enrich text-based representations of word meanings by leveraging perceptual knowledge from vision. Despite many attempts at language grounding, it is still unclear how to effectively inject visual knowledge into the word embeddings of a language in such a way that a proper balanc...
Thul et al. (2020) called attention to problems that arise when chronometric experiments implementing specific factorial designs are analysed with the generalized additive mixed model (GAMM), using factor smooths to capture trial-to-trial dependencies. From a series of simulations incorporating such dependencies, they conclude that GAMMs are inappr...
Semantic differentiation of nominal pluralization is grammaticalized in many languages. For example, plural markers may only be relevant for human nouns. English does not appear to make such distinctions. Using distributional semantics, we show that English nominal pluralization exhibits semantic clusters. For instance, pluralization of fruit words...
In this study, we showed that inflected words tend to have greater semantic support to their word-final triphones and that the amount of semantic support to word-final triphones successfully predicts hyper-articulation effect of morpheme boundary. Semantic support from words' meanings to words' constituent phones was modeled and calculated with Lin...
In this study, we showed, based on the tongue movement data from a spontaneous speech corpus of German (Arnold et al., 2016), that articulations of [a(:)] are articulatorily more reduced (centralized) for the vowel without a following morpheme boundary and more enhanced (hyper-articulated) for the same vowel with a following morpheme boundary, as f...
Naive discriminative learning (NDL) and linear discriminative learning (LDL) are simple computational algorithms for lexical learning and lexical processing. Both NDL and LDL assume that learning is discriminative, driven by prediction error, and that it is this error that calibrates the association strength between input and output representations...
This study addresses a series of methodological questions that arise when modeling inflectional morphology with Linear Discriminative Learning. Taking the semi-productive German noun system as example, we illustrate how decisions made about the representation of form and meaning influence model performance. We clarify that for modeling frequency ef...
N Prep N constructions such as Sp. bicicleta de montaña ‘mountain bike’ are very productive and frequent in Romance languages. They commonly have been classified as syntagmatic compounds that show no orthographic union and exhibit an internal structure that resembles free syntactic structures, such as Sp. libro para niños ‘book for children’. There...
This paper presents three case studies of modeling aspects of lexical processing with Linear Discriminative Learning (LDL), the computational engine of the Discriminative Lexicon model (Baayen et al., 2019). With numeric representations of word forms and meanings, LDL learns to map one vector space onto the other, without being informed about any m...
Does morphological structure affect articulation when segmental similarity is strictly controlled? To address this question, we used electromagnetic articulography to study the articulatory trajectories of tongue tip and tongue body during the articulation of German words containing [a(:)] as stem vowels followed by [t] that in roughly half of the...
Anticipatory coarticulation has been reported to be affected by word form frequency. However, it remains unclear whether frequency effect also modulates carry-over (perseverative) coarticulation. To investigate the interaction of word form frequency effect and carry-over/anticipatory coarticulations, ultrasound imaging was performed on the articula...
We describe an inference principle for speech resynthesis using the vocal tract simulator VocalTractLab (VTL). Our method generates smooth and plausible motor trajectories controlling the vocal tract simulator. The method utilizes a differentiable forward model approximation of the VTL, namely, an LSTM that learned the involved temporal motor-acous...
This study addresses a series of methodological questions that arise when modeling inflectional morphology with Linear Discriminative Learning. Taking the semi-productive German noun system as example, we illustrate how decisions made about the representation of form and meaning influence model performance. We clarify that for modeling frequency ef...
Thul et al. (2020) called attention to problems that arise when chronometric experiments implementing specific factorial designs are analysed with the generalized additive mixed model (henceforth GAMM), using factor smooths to capture trial-to-trial dependencies. From a series of simulations using sine waves representing such dependencies, Thul et...
A computational model for the comprehension of single spoken words is presented that builds on an earlier model using discriminative learning. Real-valued features are extracted from the speech signal instead of discrete features. Vectors representing word meanings using one-hot encoding are replaced by real-valued semantic vectors. Instead of incr...
Indonesian has two prefixes, PE- and PEN- , that are similar in form and meaning, but are probably not allomorphs. In this study, we applied a distributional vector space model to clarify whether these prefixes have discriminable semantics. Comparisons of pairs of words within and across morphologically defined sets of words revealed that cosine si...
This paper introduces the generalized additive mixed model (GAMM) and the quantile generalized additive mixed model (QGAMM) through reanalyses of bilinguals’ lexical decision data from Dijkstra et al. (2010) and Miwa et al. (2014). We illustrate how regression splines can be used to test for nonlinear effects of cross-language similarity in form as...
(*This paper will appear in the Proceedings of 12th International Seminar on Speech Production (ISSP). Its details (e.g. pages) will be added here then.)
Anticipatory coarticulation has been reported to be affected by word form frequency. However, it remains unclear whether frequency effect also modulates carry-over (perseverative) coarticulation....
In structuralist linguistics, compounds are argued not to constitute morphological categories, due to the absence of systematic form-meaning correspondences. This study investigates subsets of compounds for which systematic form-meaning correspondences are present: adjective–noun compounds in Mandarin. We show that there are substantial differences...
Frequency has been found to be predictive for articulatory realization. And yet it remains not very clear what frequency reflects in speech production process. The present study derived and applied semantic measures from a computational model based on a discriminative learning mechanism, namely a Linear Discriminative Learning model, in order to pr...
The dialectical changes seen across the course of individual lives are typically thought to reflect the attritional influence of standard languages on native dialects. However, the distributional properties of natural languages, which guarantee that lexical knowledge continuously increases across the lifespan, suggest these changes might simply ref...
Many theories of word structure in linguistics and morphological processing in cognitive psychology are grounded in a compositional perspective on the (mental) lexicon in which complex words are built up during speech production from sublexical elements such as morphemes, stems, and exponents. When combined with the hypothesis that storage in the l...
We replicated word form frequency effect on articulation of German vowel(s) [a(:)], using ultrasound imaging. The effect was found to be modulated by carryover- as well as anticipatory-coarticulations. In addition, this study introduced a new analysis method for ultrasound images: the whole-image analysis with GAMs.
This study shows relative semantic importance of stem triphones and suffix (transitional) triphones (Relative Functional Load) is very predictive for tongue (tip/body) movements. This semantic measure is derived from a Linear Discriminative Learning (LDL) model, which enables a direct mapping between forms and semantics without any intermediate con...
This study addresses whether there is anything special about learning a third language, as compared to learning a second language, that results solely from the order of acquisition. We use a computational model based on the mathematical framework of Linear Discriminative Learning to explore this question for the acquisition of a small trilingual vo...
This work is a corpus study on articulations of word-final syllables with the structures VC or VCC in German, e.g. "sagt". Tongue movement data were collected from the KEC corpus, where tongue movements are recorded by Electromagnetic Articulography (EMA).
We found optimized tongue movements (co-articulation patterns) when the informativities of t...
Hyphenated compounds have largely been neglected in the studies of compounding, which have seldom analysed compounds in context. In this study, we argue that the hyphen use in compounds is strongly motivated. Hyphenation is used when words form a unit, which reduces the possibility of parsing them into separate units or other forms. The current stu...
Pseudowords have long served as key tools in psycholinguistic investigations of the lexicon. A common assumption underlying the use of pseudowords is that they are devoid of meaning: Comparing words and pseudowords may then shed light on how meaningful linguistic elements are processed differently from meaningless sound strings. However, pseudoword...
Neural computation relies on the integration of synaptic inputs across a neuron’s dendritic arbour. However, it is far from understood how different cell types tune this process to establish cell-type specific computations. Here, using two-photon imaging of dendritic Ca2+ signals, electrical recordings of somatic voltage and biophysical modelling,...
Both localist and connectionist models, based on experimental results obtained for English and French, assume that the degree of semantic compositionality of a morphologically complex word is reflected in how it is processed. Since priming experiments using English and French morphologically related prime-target pairs reveal stronger priming when c...
Over the last decades, a growing body of evidence on the mechanisms governing lexical storage, access, acquisition and processing has questioned traditional models of language architecture and word usage based on the hypothesis of a direct correspondence between modular components of grammar competence (lexicon vs. rules), processing correlates (me...
Coarticulation of the vowels in the stems of inflected German verbs according to the preceding pronouns and the upcoming suffixes, as a possible evidence suggesting morphological information already visible in the stems of verbs.
Neural computation relies on the integration of synaptic inputs across a neuron's dendritic arbour. However, the fundamental rules that govern dendritic integration are far from understood. In particular, it is still unclear how cell type-specific differences in dendritic integration arise from general features of neural morphology and membrane pro...
This study examines two nominalizing prefixes in Indonesian: PE- and PEN-, which derive nouns from verbs with a range of meanings similar to that found in -er suffix in English. The prefix PE- is form-invariant, whereas PEN- has several nasal allomorphs. Given their similarity in form and function, the question arises of whether PE- and PEN- are al...
Nonwords are often used to clarify how lexical processing takes place in the absence of semantics. This study shows that nonwords are not semantically vac-uous. We used Linear Discriminative Learning [2] to estimate the meanings of nonwords in the MALD database [14] from the speech signal. We show that measures gauging nonword semantics significant...
This study investigates the geographical distribution of pronunciation variation of voiceless dental and retroflex sibilants in Taiwan Mandarin. Previous studies indicated that the merging of the two sibi-lants is geographically dependent [17, 6]. However , the geographical effects in these studies are not easy to interpret due to the limited numbe...
We present the Naive Discriminative Reading Aloud (ndra) model. The ndra differs from existing models of response times in the reading aloud task in two ways. First, a single lexical architecture is responsible for both word and non-word naming. As such, the model differs from dual-route models, which consist of both a lexical route and a sub-lexic...
Using computational simulations, this work demonstrates that it is possible to learn a systematic relation between words' sound and their meanings. The sound-meaning relation was learned from a corpus of phonologically transcribed child-directed speech by using the linear discriminative learning (LDL) framework (Baayen, Chuang, Shafaei-Bajestan, &...
Linear Discriminative Learning (LDL) is a computational theory of how speakers produce and listeners understand words. LDL is developed with the aim of providing a functional characterisation of the cognitive skills that allow speakers to express their thoughts in words, and that allow listeners to decode the intended message from these words. In p...
Recent research on the acoustic realization of affixes has revealed differences between phonologically homophonous affixes, e.g. the different kinds of final [s] and [z] in English (Plag, Homann & Kunter 2017, Zimmermann 2016a). Such results are unexpected and unaccounted for in widely accepted post-Bloomfieldian item-and-arrangement models (Hocket...
This article provides a tutorial for analyzing pupillometric data. Pupil dilation has become increasingly popular in psychological and psycholinguistic research as a measure to trace language processing. However, there is no general consensus about procedures to analyze the data, with most studies analyzing extracted features from the pupil dilatio...
The field of cognitive aging has seen considerable advances in describing the linguistic and semantic changes that happen during the adult life span to uncover the structure of the mental lexicon (i.e., the mental repository of lexical and conceptual representations). Nevertheless, there is still debate concerning the sources of these changes, incl...
According to Word and Paradigm Morphology (Matthews, 1974; Blevins, 2016), the word is the basic cognitive unit over which paradigmatic analogy operates to predict form and meaning of novel forms. Baayen et al. (2019b, 2018) introduced a computational formalization of word and paradigm morphology which makes it possible to model the production and...
Recent research on the acoustic realization of affixes has revealed differencesbetween phonologically homophonous affixes, for example the different kinds offinal [s] and [z] in English (Plag et al. 2017, Zimmermann 2016). Such resultsare unexpected and unaccounted for in widely-accepted post-Bloomfieldian item-and-arrangement models (Hockett, 1954...
Indonesian has two prefixes which express a range of semantic functions (e.g. agent, instrument, patient). One prefix, PEN-, has six allomorphs (peng-, peny-, pe-, pen-, pem-, penge-). A second prefix, PE-, is described as having similar form and meaning as pe-. In this study, we used computational models of distributional semantics to clarify whet...
Using computational simulations, this work demonstrates that it is possible to learn a systematic relation between words’ sound and their meanings. The sound-meaning relation was learned from a corpus of phonologically transcribed child-directed speech by using the Linear Discriminative Learning (LDL) framework (Baayen, Chuang, Shafaei-Bajestan, &...
Corpus-based research has formed the backbone of linguistic research in recent decades. Large text corpora are used for solving various kinds of linguistic problems, including those of quantitative linguistics, cognitive linguistics, and psycholinguistics. This paper reports the creation of two corpora of contemporary Vietnamese. It also describes...