Harald Baayen

Harald Baayen
University of Tuebingen | EKU Tübingen

Doctor of Philosophy

About

396
Publications
136,240
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
31,600
Citations
Introduction
I am interested in words: their internal structure, their meaning, their distributional properties, how they are used in different speech communities and registers, and how they are processed in language comprehension and speech production.
Skills and Expertise
Additional affiliations
July 2007 - August 2011
University of Alberta
Position
  • Professor
Description
  • Fulltime, Professor Tenure
January 2006 - June 2007
Radboud University Nijmegen
Position
  • Professor
Description
  • Tenure Track
October 1998 - December 2005
Radboud University
Position
  • Postdoc
Description
  • Non Tenure Track
Education
February 1985 - May 1989
Vrije Universiteit Amsterdam
Field of study
  • Linguistics
July 1983 - January 1985
Vrije Universiteit Amsterdam
Field of study
  • Linguistics

Publications

Publications (396)
Preprint
Distributional semantics offers new ways to study the semantics of morphology. This study focuses on the semantics of noun singulars and their plural inflectional variants in English. Our goal is to compare two models for the conceptualization of plurality. One model (FRACSS) proposes that all singular-plural pairs should be taken into account when...
Preprint
This study investigates the phenomenon of defectiveness in Russian case and number noun paradigms from the perspective of distributional semantics. We made use of word embeddings, high-dimensional vectors trained from large text corpora, and compared the observed paradigms of nouns that are defective in the genitive plural, as suggested by Zaliznja...
Preprint
Current computational models capturing words' meaning mostly rely on textual corpora. While these approaches have been successful over the last decades, their lack of grounding in the real world is still an ongoing problem. In this paper, we focus on visual grounding of word embeddings and target two important questions. First, how can language ben...
Preprint
Language grounding to vision is an active field of research aiming to enrich text-based representations of word meanings by leveraging perceptual knowledge from vision. Despite many attempts at language grounding, it is still unclear how to effectively inject visual knowledge into the word embeddings of a language in such a way that a proper balanc...
Article
Thul et al. (2020) called attention to problems that arise when chronometric experiments implementing specific factorial designs are analysed with the generalized additive mixed model (GAMM), using factor smooths to capture trial-to-trial dependencies. From a series of simulations incorporating such dependencies, they conclude that GAMMs are inappr...
Preprint
Full-text available
Semantic differentiation of nominal pluralization is grammaticalized in many languages. For example, plural markers may only be relevant for human nouns. English does not appear to make such distinctions. Using distributional semantics, we show that English nominal pluralization exhibits semantic clusters. For instance, pluralization of fruit words...
Presentation
Full-text available
In this study, we showed that inflected words tend to have greater semantic support to their word-final triphones and that the amount of semantic support to word-final triphones successfully predicts hyper-articulation effect of morpheme boundary. Semantic support from words' meanings to words' constituent phones was modeled and calculated with Lin...
Presentation
Full-text available
In this study, we showed, based on the tongue movement data from a spontaneous speech corpus of German (Arnold et al., 2016), that articulations of [a(:)] are articulatorily more reduced (centralized) for the vowel without a following morpheme boundary and more enhanced (hyper-articulated) for the same vowel with a following morpheme boundary, as f...
Chapter
Full-text available
Naive discriminative learning (NDL) and linear discriminative learning (LDL) are simple computational algorithms for lexical learning and lexical processing. Both NDL and LDL assume that learning is discriminative, driven by prediction error, and that it is this error that calibrates the association strength between input and output representations...
Article
Full-text available
This study addresses a series of methodological questions that arise when modeling inflectional morphology with Linear Discriminative Learning. Taking the semi-productive German noun system as example, we illustrate how decisions made about the representation of form and meaning influence model performance. We clarify that for modeling frequency ef...
Article
N Prep N constructions such as Sp. bicicleta de montaña ‘mountain bike’ are very productive and frequent in Romance languages. They commonly have been classified as syntagmatic compounds that show no orthographic union and exhibit an internal structure that resembles free syntactic structures, such as Sp. libro para niños ‘book for children’. There...
Preprint
Full-text available
This paper presents three case studies of modeling aspects of lexical processing with Linear Discriminative Learning (LDL), the computational engine of the Discriminative Lexicon model (Baayen et al., 2019). With numeric representations of word forms and meanings, LDL learns to map one vector space onto the other, without being informed about any m...
Conference Paper
Full-text available
Does morphological structure affect articulation when segmental similarity is strictly controlled? To address this question, we used electromagnetic articulography to study the articulatory trajectories of tongue tip and tongue body during the articulation of German words containing [a(:)] as stem vowels followed by [t] that in roughly half of the...
Conference Paper
Full-text available
Anticipatory coarticulation has been reported to be affected by word form frequency. However, it remains unclear whether frequency effect also modulates carry-over (perseverative) coarticulation. To investigate the interaction of word form frequency effect and carry-over/anticipatory coarticulations, ultrasound imaging was performed on the articula...
Conference Paper
Full-text available
We describe an inference principle for speech resynthesis using the vocal tract simulator VocalTractLab (VTL). Our method generates smooth and plausible motor trajectories controlling the vocal tract simulator. The method utilizes a differentiable forward model approximation of the VTL, namely, an LSTM that learned the involved temporal motor-acous...
Preprint
Full-text available
This study addresses a series of methodological questions that arise when modeling inflectional morphology with Linear Discriminative Learning. Taking the semi-productive German noun system as example, we illustrate how decisions made about the representation of form and meaning influence model performance. We clarify that for modeling frequency ef...
Preprint
Full-text available
Thul et al. (2020) called attention to problems that arise when chronometric experiments implementing specific factorial designs are analysed with the generalized additive mixed model (henceforth GAMM), using factor smooths to capture trial-to-trial dependencies. From a series of simulations using sine waves representing such dependencies, Thul et...
Article
Full-text available
A computational model for the comprehension of single spoken words is presented that builds on an earlier model using discriminative learning. Real-valued features are extracted from the speech signal instead of discrete features. Vectors representing word meanings using one-hot encoding are replaced by real-valued semantic vectors. Instead of incr...
Article
Full-text available
Many theories of word structure in linguistics and morphological processing in cognitive psychology are grounded in a compositional perspective on the (mental) lexicon in which complex words are built up during speech production from sublexical elements such as morphemes, stems, and exponents. When combined with the hypothesis that storage in the l...
Article
Full-text available
Indonesian has two prefixes, PE- and PEN- , that are similar in form and meaning, but are probably not allomorphs. In this study, we applied a distributional vector space model to clarify whether these prefixes have discriminable semantics. Comparisons of pairs of words within and across morphologically defined sets of words revealed that cosine si...
Article
Full-text available
This paper introduces the generalized additive mixed model (GAMM) and the quantile generalized additive mixed model (QGAMM) through reanalyses of bilinguals’ lexical decision data from Dijkstra et al. (2010) and Miwa et al. (2014). We illustrate how regression splines can be used to test for nonlinear effects of cross-language similarity in form as...
Preprint
Full-text available
(*This paper will appear in the Proceedings of 12th International Seminar on Speech Production (ISSP). Its details (e.g. pages) will be added here then.) Anticipatory coarticulation has been reported to be affected by word form frequency. However, it remains unclear whether frequency effect also modulates carry-over (perseverative) coarticulation....
Article
Full-text available
In structuralist linguistics, compounds are argued not to constitute morphological categories, due to the absence of systematic form-meaning correspondences. This study investigates subsets of compounds for which systematic form-meaning correspondences are present: adjective–noun compounds in Mandarin. We show that there are substantial differences...
Poster
Full-text available
Frequency has been found to be predictive for articulatory realization. And yet it remains not very clear what frequency reflects in speech production process. The present study derived and applied semantic measures from a computational model based on a discriminative learning mechanism, namely a Linear Discriminative Learning model, in order to pr...
Chapter
Full-text available
The dialectical changes seen across the course of individual lives are typically thought to reflect the attritional influence of standard languages on native dialects. However, the distributional properties of natural languages, which guarantee that lexical knowledge continuously increases across the lifespan, suggest these changes might simply ref...
Poster
Full-text available
We replicated word form frequency effect on articulation of German vowel(s) [a(:)], using ultrasound imaging. The effect was found to be modulated by carryover- as well as anticipatory-coarticulations. In addition, this study introduced a new analysis method for ultrasound images: the whole-image analysis with GAMs.
Poster
Full-text available
This study shows relative semantic importance of stem triphones and suffix (transitional) triphones (Relative Functional Load) is very predictive for tongue (tip/body) movements. This semantic measure is derived from a Linear Discriminative Learning (LDL) model, which enables a direct mapping between forms and semantics without any intermediate con...
Article
Full-text available
This study addresses whether there is anything special about learning a third language, as compared to learning a second language, that results solely from the order of acquisition. We use a computational model based on the mathematical framework of Linear Discriminative Learning to explore this question for the acquisition of a small trilingual vo...
Presentation
Full-text available
This work is a corpus study on articulations of word-final syllables with the structures VC or VCC in German, e.g. "sagt". Tongue movement data were collected from the KEC corpus, where tongue movements are recorded by Electromagnetic Articulography (EMA). We found optimized tongue movements (co-articulation patterns) when the informativities of t...
Article
Full-text available
Hyphenated compounds have largely been neglected in the studies of compounding, which have seldom analysed compounds in context. In this study, we argue that the hyphen use in compounds is strongly motivated. Hyphenation is used when words form a unit, which reduces the possibility of parsing them into separate units or other forms. The current stu...
Article
Full-text available
Pseudowords have long served as key tools in psycholinguistic investigations of the lexicon. A common assumption underlying the use of pseudowords is that they are devoid of meaning: Comparing words and pseudowords may then shed light on how meaningful linguistic elements are processed differently from meaningless sound strings. However, pseudoword...
Article
Full-text available
Neural computation relies on the integration of synaptic inputs across a neuron’s dendritic arbour. However, it is far from understood how different cell types tune this process to establish cell-type specific computations. Here, using two-photon imaging of dendritic Ca2+ signals, electrical recordings of somatic voltage and biophysical modelling,...
Article
Full-text available
Both localist and connectionist models, based on experimental results obtained for English and French, assume that the degree of semantic compositionality of a morphologically complex word is reflected in how it is processed. Since priming experiments using English and French morphologically related prime-target pairs reveal stronger priming when c...
Chapter
Full-text available
Over the last decades, a growing body of evidence on the mechanisms governing lexical storage, access, acquisition and processing has questioned traditional models of language architecture and word usage based on the hypothesis of a direct correspondence between modular components of grammar competence (lexicon vs. rules), processing correlates (me...
Presentation
Full-text available
Coarticulation of the vowels in the stems of inflected German verbs according to the preceding pronouns and the upcoming suffixes, as a possible evidence suggesting morphological information already visible in the stems of verbs.
Preprint
Full-text available
Neural computation relies on the integration of synaptic inputs across a neuron's dendritic arbour. However, the fundamental rules that govern dendritic integration are far from understood. In particular, it is still unclear how cell type-specific differences in dendritic integration arise from general features of neural morphology and membrane pro...
Article
Full-text available
This study examines two nominalizing prefixes in Indonesian: PE- and PEN-, which derive nouns from verbs with a range of meanings similar to that found in -er suffix in English. The prefix PE- is form-invariant, whereas PEN- has several nasal allomorphs. Given their similarity in form and function, the question arises of whether PE- and PEN- are al...
Conference Paper
Full-text available
Nonwords are often used to clarify how lexical processing takes place in the absence of semantics. This study shows that nonwords are not semantically vac-uous. We used Linear Discriminative Learning [2] to estimate the meanings of nonwords in the MALD database [14] from the speech signal. We show that measures gauging nonword semantics significant...
Conference Paper
Full-text available
This study investigates the geographical distribution of pronunciation variation of voiceless dental and retroflex sibilants in Taiwan Mandarin. Previous studies indicated that the merging of the two sibi-lants is geographically dependent [17, 6]. However , the geographical effects in these studies are not easy to interpret due to the limited numbe...
Article
Full-text available
We present the Naive Discriminative Reading Aloud (ndra) model. The ndra differs from existing models of response times in the reading aloud task in two ways. First, a single lexical architecture is responsible for both word and non-word naming. As such, the model differs from dual-route models, which consist of both a lexical route and a sub-lexic...
Article
Full-text available
Using computational simulations, this work demonstrates that it is possible to learn a systematic relation between words' sound and their meanings. The sound-meaning relation was learned from a corpus of phonologically transcribed child-directed speech by using the linear discriminative learning (LDL) framework (Baayen, Chuang, Shafaei-Bajestan, &...
Article
Full-text available
Linear Discriminative Learning (LDL) is a computational theory of how speakers produce and listeners understand words. LDL is developed with the aim of providing a functional characterisation of the cognitive skills that allow speakers to express their thoughts in words, and that allow listeners to decode the intended message from these words. In p...
Article
Full-text available
Recent research on the acoustic realization of affixes has revealed differences between phonologically homophonous affixes, e.g. the different kinds of final [s] and [z] in English (Plag, Homann & Kunter 2017, Zimmermann 2016a). Such results are unexpected and unaccounted for in widely accepted post-Bloomfieldian item-and-arrangement models (Hocket...
Article
Full-text available
The field of cognitive aging has seen considerable advances in describing the linguistic and semantic changes that happen during the adult life span to uncover the structure of the mental lexicon (i.e., the mental repository of lexical and conceptual representations). Nevertheless, there is still debate concerning the sources of these changes, incl...
Preprint
According to Word and Paradigm Morphology (Matthews, 1974; Blevins, 2016), the word is the basic cognitive unit over which paradigmatic analogy operates to predict form and meaning of novel forms. Baayen et al. (2019b, 2018) introduced a computational formalization of word and paradigm morphology which makes it possible to model the production and...
Article
Full-text available
This article provides a tutorial for analyzing pupillometric data. Pupil dilation has become increasingly popular in psychological and psycholinguistic research as a measure to trace language processing. However, there is no general consensus about procedures to analyze the data, with most studies analyzing extracted features from the pupil dilatio...
Preprint
Recent research on the acoustic realization of affixes has revealed differencesbetween phonologically homophonous affixes, for example the different kinds offinal [s] and [z] in English (Plag et al. 2017, Zimmermann 2016). Such resultsare unexpected and unaccounted for in widely-accepted post-Bloomfieldian item-and-arrangement models (Hockett, 1954...
Conference Paper
Full-text available
Indonesian has two prefixes which express a range of semantic functions (e.g. agent, instrument, patient). One prefix, PEN-, has six allomorphs (peng-, peny-, pe-, pen-, pem-, penge-). A second prefix, PE-, is described as having similar form and meaning as pe-. In this study, we used computational models of distributional semantics to clarify whet...
Preprint
Full-text available
Using computational simulations, this work demonstrates that it is possible to learn a systematic relation between words’ sound and their meanings. The sound-meaning relation was learned from a corpus of phonologically transcribed child-directed speech by using the Linear Discriminative Learning (LDL) framework (Baayen, Chuang, Shafaei-Bajestan, &...
Article
Corpus-based research has formed the backbone of linguistic research in recent decades. Large text corpora are used for solving various kinds of linguistic problems, including those of quantitative linguistics, cognitive linguistics, and psycholinguistics. This paper reports the creation of two corpora of contemporary Vietnamese. It also describes...
Article
Full-text available
Research on age-related changes in speech has primarily focused on comparing "young" vs. “elderly” adults. Yet, listeners are able to guess talker age more accurately than a binary distinction would imply, suggesting that acoustic characteristics of speech change continually and gradually throughout adulthood. We describe acoustic properties of vow...
Conference Paper
Full-text available
We are addressing the challenge of learning an inverse mapping between acoustic features and control parameters of a vocal tract simulator. As a first step, we synthesize an articulatory corpus consisting of control parameters and wave forms using VocalTractLab (VTL; [1]) as the vocal tract simulator. The basis for the synthesis is a concatenative...
Chapter
Age effects in experimental psychology are typically interpreted as evidence for cognitive decline. Alternatively, age-related decreases in performance on cognitive tasks could be a result of increased linguistic experience (Ramscar et al. 2014). We present the results of a paired associate learning experiment in which we tested old and young Germa...
Article
Full-text available
This special issue introduces a series of papers that make available new methods to the phonetic and linguistic community and reflect upon existing data analysis practices. In our introduction, we highlight three themes that we consider pressing issues in data analysis and that run across the contributions to this special issue: the difference betw...
Article
Full-text available
The discriminative lexicon is introduced as a mathematical and computational model of the mental lexicon. This novel theory is inspired by word and paradigm morphology but operationalizes the concept of proportional analogy using the mathematics of linear algebra. It embraces the discriminative perspective on language, rejecting the idea that words...
Article
Full-text available
In the past few years, speech recognition has become a new standard for state-of-the-art technology. We now talk to our phones as much as we talk on them. How can helping machines learn to listen improve our understanding of how our own brains work? Dr Harald Baayen at Eberhard Karls University Tübingen and his collaborators work at the intersectio...
Article
Most dictionary definitions for the term compound word characterize it as a word that itself contains two or more words. Thus, a compound word such as goldfish is composed of the constituent words gold and fish . In this report, we present evidence that compound words such as goldfish might not contain the words gold and fish , but rather positiona...
Article
Full-text available
This methodological study provides a step-by-step introduction to a computational implementation of word and paradigm morphology using linear mappings between vector spaces for form and meaning. Taking as starting point the linear regression model, the main concepts underlying linear mappings are introduced and illustrated with R code. It is then s...
Article
Full-text available
When multiple correlated predictors are considered jointly in regression modeling, estimated coefficients may assume counterintuitive and theoretically uninterpretable values. We survey several statistical methods that implement strategies for the analysis of collinear data: regression with regularization (the elastic net), supervised component gen...
Conference Paper
In this study, we report on two nominalizing prefixes in Indonesian which are similar in form and meaning: pe- and peN-. Pe-, the invariant form, attaches to verbs that carry the prefix ber-. The other prefix, peN-, combines with verbs with the prefix meN- or one of its allomorphs. With regards to their similar form and function, we investigated wh...
Conference Paper
Full-text available
In this study, we report on two nominalizing prefixes in Indonesian which are similar in form and meaning: pe- and peN-. Pe-, the invariant form, attaches to verbs that carry the prefix ber-. The other prefix, peN-, combines with verbs with the prefix meN- or one of its allomorphs. With regards to their similar form and function, we investigated wh...
Article
Many studies report shorter acoustic durations, more coarticulation and reduced articulatory targets for frequent words. This study investigates a factor ignored in discussions on the relation between frequency and phonetic detail, namely, that motor skills improve with experience. Since frequency is a measure of experience, it follows that frequen...
Article
Full-text available
We present the Chinese Lexical Database (CLD): a large-scale lexical database for simplified Chinese. The CLD provides a wealth of lexical information for 3913 one-character words, 34,233 two-character words, 7143 three-character words, and 3355 four-character words, and is publicly available through http://www.chineselexicaldatabase.com. For each...