About
271
Publications
86,235
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,027
Citations
Introduction
Additional affiliations
August 2015 - present
Publications
Publications (271)
We present a novel approach to identifying individual pairs of phonetic correspondences in a dataset of dialect pronunciations. This continues work identifying shibboleths (i.e., characteristic features of a given dialect), a category that has interested dialectology and that dialectometrical research has examined mostly in the form of categorical...
The grammatical ambition to distinguish well-formed from ill-formed structures very often leads to more complicated analyses, which in turn can impede the use of analyses in further studies. We argue thus that less ambitious and less complicated analyses can often provide more scientific insight. Two concrete cases are presented where less discrimi...
The Scandinavian languages are so alike that their speakers often communicate, each using their own language, which Haugen (1966) dubbed SEMICOMMUNICATION. The success of semi-communication depends on the languages involved, and, moreover, can be asym-metric: for example, Swedish is more easily understandable for a Dane, than Danish for a Swede. It...
Jean Séguy and Hans Goebl were the founders both of Romance dialectometry and of dialectometry in general, which focused largely on Romance languages in its early years. While other attention to dialects had appealed to scholarly intuition to adduce the principles behind the geographic distribution of linguistic variation, dialectometry insisted on...
Variation Rolls the Dice: A worldwide collage in honour of Salikoko S. Mufwene aims to celebrate Mufwene’s ground-breaking contribution to linguistics in the past four decades. The title also encapsulates his approach to language as both systemic and socio-cultural practices, and the role of variation in determining particular evolutionary trajecto...
Gabon is an African country located very close to the homeland of Bantu languages (Cameroon). Starting about 5,000 years ago, Bantu-speaking populations diffused into almost all sub-Saharan Africa. By processing with computational linguistic methods (Levenshtein distance) two independently collected lexical data sets recording the pronunciation of...
In dialectology we often encounter irreducible variation in its data, i.e., multiple responses to its probes about the form of a word or phrase. Dialectometry seeks to measure the differences between dialects and has developed several ways to measure the difference between responses when one or both of them is non-unique. We introduce here BILBAO D...
The Handbook of Dialectology provides an authoritative, up-to-date and unusually broad account of the study of dialect, in one volume. Each chapter reviews essential research, and offers a critical discussion of the past, present and future development of the area. The volume is based on state-of-the-art research in dialectology around the world, p...
In this study, we investigate which factors influence the linguistic distance of Catalan dialectal pronunciations from standard Catalan. We use pronunciations from three regions where the northwestern variety of the Catalan language is spoken (Catalonia, Aragon and Andorra). In contrast to Aragon, Catalan has an official status in both Catalonia an...
Editors' introduction to 'The Handbook of Dialectology'
The Handbook of Dialectology provides an authoritative and up-to-date survey of dialectology from around the world. Each chapter reviews essential research and offers a critical discussion of the past, present and future development of the area.
• Incorporates state-of-the-art research in dialectology from around the world, providing the most curre...
In this work, we demonstrate the application of statistical measures from dialec-tometry to the study of accented English speech. This new methodology enables a more quantitative approach to the study of accents. Studies on spoken dialect data have shown that a combination of representativeness (the difference between pronunciations within the lang...
We have documented language varieties (either Turkic or Indo-European) spoken in 23 test sites by 88 informants belonging to the major ethnic groups of Kyrgyzstan, Tajikistan and Uzbekistan (Karakalpaks, Kazakhs, Kyrgyz, Tajiks, Uzbeks, Yaghnobis). The recorded linguistic material concerns 176 words of the extended Swadesh list and will be made pub...
In this study we investigate which factors affect the degree of non-native accent of L2 speakers of English who learned English in school and mostly lived for some time in an anglophone setting. We use data from the Speech Accent Archive containing over 700 speakers speaking almost 160 different native languages. We show that besides several import...
The usual focus in authorship studies is on authorship attribution, i.e. determining which author (of a given set) wrote a piece of unknown provenance. The usual setting involves a small number of candidate authors, which means that the focus quickly revolves around a search for features that discriminate among the candidates. Whether the features...
Gabmap is a freely available, open-source web application that analyzes the data of language variation, e.g. varying words for the same concepts, varying pronunciations for the same words, or varying frequencies of syntactic constructions in transcribed conversations. Gabmap is an integrated part of CLARIN (see e.g. http://portal.clarin.nl). This a...
Modern information processing enables the examination of linguistic variation in samples requiring 108 and more comparisons of individual sound segments, such as /a/ vs. /o/. This frees researchers from the need to focus narrowly on a small set of contrasts, enabling larger scale aggregate comparisons, with a number of advantages (Nerbonne 2009). W...
The title of this paper indirectly refers to the theme of the 2014
Conference on Digital Humanities in German-Speaking Areas
(Digital Humanities – methodischer Brückenschlag oder
›feindliche Übernahme‹?). I assume from this title that
computer science is viewed as the originator of such a deliberate
conquest. The first section of this paper explore...
Dialectometry applies computational and statistical analyses within dialectology, making work more easily replicable and understandable. This survey article first reviews the field briefly in order to focus on developments in the past five years. Dialectometry no longer focuses exclusively on aggregate analyses, but rather deploys various technique...
In this study we investigate which factors affect the degree of non-native accent of L2 speakers of English who learned English in school and mostly lived for some time in an anglophone setting. We use data from the Speech Accent Archive containing over 700 speakers speaking almost 160 different native languages. We show that besides several import...
Wieling et al. (2011) combined generalized additive modeling (GAM) with mixed-effects regression modeling to identify the influence of social, lexical, and geographical variables on the variation of Dutch dialect pronunciations. The conclusion of their study was that the pronunciation distance from standard Dutch became greater for locations with a...
This study uses a generalized additive mixed-effects regression model to predict lexical differences in Tuscan dialects with respect to standard Italian. We used lexical information for 170 concepts used by 2,060 speakers in 213 locations in Tuscany. In our model, geographical position was found to be an important predictor, with locations more dis...
This study uses a generalized additive mixed-effects regression model to predict lexical differences in Tuscan dialects with respect to standard Italian. We used lexical information for 170 concepts used by 2060 speakers in 213 locations in Tuscany. In our model, geographical position was found to be an important predictor, with locations more dist...
This is a great book that aims to popularize the study of how function words such as pronouns, but also articles, prepositions and auxiliary verbs reveal personality traits and roles within relationships. i James Pennebaker is a social psychologist who has made major contributions in understanding how people who've gone through traumatic experience...
In this study we develop pronunciation distances based on naive discriminative learning (NDL). Measures of pronunciation distance are used in several subfields of linguistics, including psycholinguistics, dialectology and typology. In contrast to the commonly used Levenshtein algorithm, NDL is grounded in cognitive theory of competitive reinforceme...
With an eye toward measuring the strength of foreign accents in American English, we evaluate the suitability of a modified version of the Levenshtein distance for comparing (the phonetic transcriptions of) accented pronunciations. Although this measure has been used successfully inter alia to study the differences among dialect pronunciations, it...
This paper presents an unsupervised and incremental model of learning segmentation that combines multiple cues whose use by children and adults were attested by experimental studies. The cues we exploit in this study are predictability statistics, phonotactics, lexical stress and partial lexical information. The performance of the model presented i...
This study presents the results from a computer-assisted language learning (CALL) system of Runyakitara (RU_CALL). The major objective was to provide an electronic language learning environment that can enable learners with mother tongue deficiencies to enhance their knowledge of grammar and acquire writing skills in Runyakitara. The system current...
The calculation of aggregate linguistic distances can compensate for some of the drawbacks inherent to the isogloss bundling method used in traditional dialectology to identify dialect areas. Synchronic aggregate analysis can also point out differences with respect to a diachronically based classification of dialects. In this study the Levenshtein...
This article investigates several linguistic changes which are ongoing in north-western Catalan using a contemporary corpus.
We take advantage of a range of dialectometric methods that allow us to calculate and analyse the linguistic distance between
varieties in apparent time from an aggregate perspective. Specifically, we pay attention to the pro...
This study explores the linguistic application of bipartite spectral graph partitioning, a graph-theoretic technique that
simultaneously identifies clusters of similar localities as well as clusters of features characteristic of those localities.
We compare the results using this approach with previously published results on the same dataset using...
A careful investigation of synchronic patterns of linguistic variation with underlying linguistic features can lead to important insights into the comprehension of diachronic phonetic processes. In this article, we showed that the method of spectral partitioning of bipartite graphs applied to synchronic dialectal data can effectively and reliably b...
Dialectology is one of the sub-disciplines in the humanities that embraced digital techniques early on. The use of computational and quantitative techniques in dialectology is known as 'dialectometry'. The present collection of articles contain several which proudly continue working within dialectometry's usual assumptions and toward its establishe...
Traditional Estonian dialect classifications are based on the phonology, morphology, and lexis, and there are very few studies
about syntax available. The present article is the first quantitative syntactic study of Estonian dialects. We concentrate
on constructions consisting of finite and non-finite verbs, and we apply contemporary statistical me...
In order to realize the idea of document enrichment we developed a tool called TermPedia which predicts and defines technical terms in educational text. The definitions are extracted from Wikipedia, and the technical terms are also linked to contextually relevant Wikipedia articles which provide further explanation for the definitions. This paper p...
The current paper examines the role of regiolects in the Dutch language of the Netherlands and Flanders. Because regiolects are difficult to study, as they may not constitute a linguistic variety in the usual sense of the word, we focus on the speech of professional announcers employed by regional radio stations. We examine their speech in light of...
It is a useful premise to assume that every document in a collection and every query issued to an information retrieval (IR) system are geography-dependent. If one can determine what area an article is about (i.e., its geographical scope), this information can be used to improve the accuracy with which people, places and organisations named in the...
The range of dialectometric methods suggests the need for validation work. We propose a gold standard, based on the consensual
classification of a well-studied area. Fidelity to the gold standard is assessed via matrix overlap measures (Rand and Fowlkes/Mallows).
Word-based techniques in which varieties are compared to each other directly emerge as...
A shibboleth is a pronunciation, or, more generally, a variant of speech that betrays where a speaker is from (Judges 12:6). We propose a generalization of the well-known precision and recall scores to deal with the case of detecting distinctive, characteristic variants when the analysis is based on numerical difference scores. We also compare our...
Structuralists famously observed that language is “un systême oû tout se tient” (Meillet, 1903, p. 407), insisting that the system of relations of linguistic units was more important than their concrete content. This study attempts to derive content from relations, in particular phonetic (acoustic) content from the distribution of alternative pronu...
In recent years, dialectometry has gained interest among Catalan dialectologists. As a consequence, a specific dialectometric approach has been developed at the University of Barcelona, which aims at increasing the accuracy of final groupings by means of discriminating the predictable components of the language from its unpredictable ones. Another...
In this study we attempt to derive phonetic distances from alternative dialectal pronunciations used in different geographical varieties. We use two dialect atlases each containing the phonetic transcriptions of the same set of words at hundreds of sites. We collect the sound correspondences through alignment with the Levenshtein distance algorithm...
In this study we examine linguistic variation and its dependence on both social and geographic factors. We follow dialectometry in applying a quantitative methodology and focusing on dialect distances, and social dialectology in the choice of factors we examine in building a model to predict word pronunciation distances from the standard Dutch lang...
In this study we use bipartite spectral graph partitioning to simultaneously cluster varieties and identify their most distinctive linguistic features in Dutch dialect data. While clustering geographical varieties with respect to their features, e.g. pronunciation, is not new, the simultaneous identification of the features which give rise to the g...
We develop an aggregate measure of syn-tactic difference for automatically finding typical syntactic differences between col-lections of text. With the use of this mea-sure it is possible to mine for statistically significant syntactic differences between for example, the English of learners and natives, or between related dialects, and to find not...
Gabmap 2 is a web application aimed especially to facilitate explorations in quantitative dialectology – or dialectometry – by enabling researchers in dialectology to conduct computer-supported explorations and calculations even if they have relatively little computational expertise. Gabmap creates various views of dialect data, from histograms of...
The primary data on pronunciation variation – e.g., dialect atlas data – is often recorded incommensurably, i.e. in different ways in different atlases, and even in different ways within the same atlas when teams of fieldworkers and transcribers are involved. In particular these data collections differ in the detail in which pronunciations are reco...
We briefly introduce the papers in this special issue of Dialectologia on production, perception and attitude. They are the result of a call for papers issued at an interdisciplinary workshop Leuven in 2009 organized by Dirk Geeraerts, Stef Grondelaers, Leen Impe and Dirk Speelman.
We examine situations in which linguistic changes have probably been propagated via normal contact as opposed to via conquest, recent settlement and large-scale migration. We proceed then from two simplifying assumptions: first, that all linguistic variation is the result of either diffusion or independent innovation, and, second, that we may opera...
Dialect classification is a classical problem in traditional dialectology. In the course of the last few decades, several quantitative approaches have been suggested as solutions for this problem, one of which uses "Levenshtein distance" for measuring linguistic distances between dialects. In the present paper we shall introduce the Levenshein algo...
In this study we apply hierarchical spectral partitioning of bipartite graphs to a Dutch dialect dataset to cluster dialect varieties and determine the concomitant sound correspondences. An important advantage of this clustering method over other dialectometric methods is that the linguistic basis is simultaneously determined, bridging the gap betw...
We investigate language contact effects between Bulgarian dialects on the one hand, and the languages of the countries bordering Bulgaria on the other. The Bulgarian data comes from Stojkov's Bulgarian Dialect Atlases. We investigate three techniques to detect contact effects, the phone frequency method and the feature frequency method, both o...
We proceed from the view that linguistic variation must be examined from an aggregate perspective, i.e. from a perspective which encompasses as much of the variation between language varieties as possible rather than concentrating on single linguistic features. We review the motivation for this position in the first section of the article, and then...
We apply a computational measure of pronunciation dierence to a database of 36 word pronunciations from 490 sites throughout Stoykov's Bulgarian Dialect Atlases. The result is a comprehensive view of the aggregate pronunciation dierences among the 490 sites. This study aims to contribute therefore to Bulgarian dialectology, as well as to the develo...
This article surveys recent developments furthering dialectometric research which the authors have been involved in, in particular techniques for measuring large numbers of pronunciations (in phonetic transcription) of comparable words at various sites. Edit dis-tance (also known as Levenshtein distance) has been deployed for this purpose, for whic...
This is the report of a panel discussion held in connection with the special session on computational methods in dialectology at Methods XIII: Methods in Dialectology on 5 August, 2008 at the University of Leeds. We scheduled this panel discussion in order to reflect on what the introduction of computational methods has meant to our subfield of lin...
This book is a computational study of dialects and social differences in language. © Edinburgh University Press and the Association for History and Computing 2009.
This is the report of a panel discussion held in connection with the special session on computational methods in dialectology at Methods XIII: Methods in Dialectology on 5 August, 2008 at the University of Leeds. We scheduled this panel discussion in order to reflect on what the introduction of computational methods has meant to our subfield of lin...
The volume we are introducing here contains a selection of the papers presented at a special track on computational techniques for studying language variation held at The Thirteenth International Conference on Methods in Dialectology in Leeds on Aug. 4-5, 2008. We are grateful to the organizers, Nigel Armstrong, Joan Beal,Fiona Douglas, Barry Hesel...