Jonathan Dunn

Jonathan Dunn
University of Canterbury | UC · Department of Linguistics

PhD

About

55
Publications
12,133
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
299
Citations
Introduction
I am a computational linguist in the Linguistics Department at the University of Canterbury in Christchurch, New Zealand. I work across both linguistic theory and natural language processing. My research models the emergence of grammatical structure within individuals and its diffusion across global populations. Site: https://www.jdunn.name
Additional affiliations
June 2018 - present
University of Canterbury
Position
  • Lecturer
September 2015 - May 2018
National Geospatial-Intelligence Agency
Position
  • Researcher
September 2015 - May 2018
Illinois Institute of Technology
Position
  • Professor (Assistant)

Publications

Publications (55)
Article
Full-text available
This paper develops a construction-based dialectometry capable of identifying previously unknown constructions and measuring the degree to which a given construction is subject to regional variation. The central idea is to learn a grammar of constructions (a CxG) using construction grammar induction and then to use these constructions as features f...
Conference Paper
Full-text available
A usage-based Construction Grammar (CxG) posits that slot-constraints generalize from common exemplar constructions. But what is the best model of constraint generalization? This paper evaluates competing frequency-based and association-based models across eight languages using a metric derived from the Minimum Description Length paradigm. The expe...
Conference Paper
Full-text available
This paper evaluates global-scale dialect identification for 14 national varieties of English as a means for studying syntactic variation. The paper makes three main contributions: (i) introducing data-driven language mapping as a method for selecting the inventory of national varieties to include in the task; (ii) producing a large and dynamic set...
Article
Full-text available
The goal of this paper is to provide a complete representation of regional linguistic variation on a global scale. To this end, the paper focuses on removing three constraints that have previously limited work within dialectology/dialectometry. First, rather than assuming a fixed and incomplete set of variants, we use Computational Construction Gra...
Conference Paper
Full-text available
While text corpora have been steadily increasing in overall size, even very large corpora are not designed to represent global population demographics. For example, recent work has shown that existing English gigaword corpora over-represent inner-circle varieties from the US and the UK (Dunn, 2019c). To correct implicit geographic and demographic b...
Conference Paper
Full-text available
This paper provides language identification models for low-and under-resourced languages in the Pacific region with a focus on previously unavailable Austronesian languages. Accurate language identification is an important part of developing language resources. The approach taken in this paper combines 29 Austronesian languages with 171 non-Austron...
Conference Paper
Full-text available
This paper simulates a low-resource setting across 17 languages in order to evaluate embedding similarity, stability, and reliability under different conditions. The goal is to use corpus similarity measures before training to predict properties of embeddings after training. The main contribution of the paper is to show that it is possible to predi...
Preprint
Full-text available
This paper provides language identification models for low- and under-resourced languages in the Pacific region with a focus on previously unavailable Austronesian languages. Accurate language identification is an important part of developing language resources. The approach taken in this paper combines 29 Austronesian languages with 171 non-Austro...
Preprint
Full-text available
This paper simulates a low-resource setting across 17 languages in order to evaluate embedding similarity, stability, and reliability under different conditions. The goal is to use corpus similarity measures before training to predict properties of embeddings after training. The main contribution of the paper is to show that it is possible to predi...
Article
Full-text available
This paper experiments with frequency-based corpus similarity measures across 39 languages using a register prediction task. The goal is to quantify (i) the distance between different corpora from the same language and (ii) the homogeneity of individual corpora. Both of these goals are essential for measuring how well corpus-based linguistic analys...
Book
Full-text available
Corpus analysis can be expanded and scaled up by incorporating computational methods from natural language processing. This Element shows how text classification and text similarity models can extend our ability to undertake corpus linguistics across very large corpora. These computational methods are becoming increasingly important as corpora grow...
Article
Full-text available
This paper measures the impact of increased exposure on whether learned construction grammars converge onto shared representations when trained on data from different registers. Register influences the frequency of constructions , with some structures common in formal but not informal usage. We expect that a grammar induction algorithm exposed to d...
Preprint
Full-text available
This paper measures the impact of increased exposure on whether learned construction grammars converge onto shared representations when trained on data from different registers. Register influences the frequency of constructions, with some structures common in formal but not informal usage. We expect that a grammar induction algorithm exposed to di...
Preprint
Full-text available
This paper asks whether a distinction between production-based and perception-based grammar induction influences either (i) the growth curve of grammars and lexicons or (ii) the similarity between representations learned from independent sub-sets of a corpus. A production-based model is trained on the usage of a single individual, thus simulating t...
Article
Full-text available
This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the web and tweets), allowing us to evaluate whether such geo-referenced corpora are reliable for modelling linguistic variation. The basic idea is that, if each source adequately represents a single unde...
Preprint
Full-text available
This paper formulates and evaluates a series of multi-unit measures of directional association, building on the pairwise {\Delta}P measure, that are able to quantify association in sequences of varying length and type of representation. Multi-unit measures face an additional segmentation problem: once the implicit length constraint of pairwise meas...
Preprint
Full-text available
The goal of this paper is to provide a complete representation of regional linguistic variation on a global scale. To this end, the paper focuses on removing three constraints that have previously limited work within dialectology/dialectometry. First, rather than assuming a fixed and incomplete set of variants, we use Computational Construction Gra...
Preprint
Full-text available
Computational measures of linguistic diversity help us understand the linguistic landscape using digital language data. The contribution of this paper is to calibrate measures of linguistic diversity using restrictions on international travel resulting from the COVID-19 pandemic. Previous work has mapped the distribution of languages using geo-refe...
Preprint
Full-text available
This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the web and tweets), allowing us to evaluate whether such geo-referenced corpora are reliable for modelling linguistic variation. The basic idea is that, if each source adequately represents a single unde...
Preprint
Full-text available
This paper develops a construction-based dialectometry capable of identifying previously unknown constructions and measuring the degree to which a given construction is subject to regional variation. The central idea is to learn a grammar of constructions (a CxG) using construction grammar induction and then to use these constructions as features f...
Article
Full-text available
This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping. First, the corpus provides a representation of where national varieties of major languages are used (e.g., English, Arabic, Russian) together with consistently collected data for each variety. Second, the pape...
Conference Paper
Full-text available
Computational measures of linguistic diversity help us understand the linguistic landscape using digital language data. The contribution of this paper is to calibrate measures of linguistic diversity using restrictions on international travel resulting from the COVID-19 pandemic. Previous work has mapped the distribution of languages using geo-refe...
Conference Paper
Full-text available
Active Video Watching (AVW-Space) is an online platform for video-based learning which supports engagement via note-taking and personalized nudges. In this paper, we focus on the quality of the comments students write. We propose two schemes for assessing the quality of comments. Then, we evaluate these schemes by computing the inter-coder agreemen...
Conference Paper
Full-text available
AVW-Space is an online video-based learning platform which aims to improve engagement by providing a note-taking environment and personalized support. This paper presents a PhD project focusing on the nudges about the quality of comments learners make on the videos in AVW-Space. We first automated the quality assessment of comments using machine le...
Article
Full-text available
Computational measures of linguistic diversity help us understand the linguistic landscape using digital language data. The contribution of this paper is to calibrate measures of linguistic diversity using restrictions on international travel resulting from the COVID-19 pandemic. Previous work has mapped the distribution of languages using geo-refe...
Preprint
This paper evaluates large georeferenced corpora, taken from both web-crawled and social media sources, against ground-truth population and language-census datasets. The goal is to determine (i) which dataset best represents population demographics; (ii) in what parts of the world the datasets are most representative of actual populations; and (iii...
Preprint
This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping. First, the corpus provides a representation of where national varieties of major languages are used (e.g., English, Arabic, Russian) together with consistently collected data for each variety. Second, the pape...
Chapter
Full-text available
Traditional approaches view metaphor as a semantic/pragmatic phenomenon that occurs at a conceptual level as mappings between independent concepts. These conceptual mappings are then lexicalized into observed metaphoric expressions. In this view, the lexical and grammatical structure of a metaphoric expression is not relevant to the underlying meta...
Conference Paper
Full-text available
This paper evaluates large georeferenced corpora, taken from both web-crawled and social media sources, against ground-truth population and language-census datasets. The goal is to determine (i) which dataset best represents population demographics; (ii) in what parts of the world the datasets are most representative of actual populations; and (iii...
Conference Paper
Full-text available
This paper evaluates large georeferenced corpora, taken from both web-crawled and social media sources, against ground-truth population and language-census datasets. The goal is to determine (i) which dataset best represents population demographics; (ii) in what parts of the world the datasets are most representative of actual populations; and (iii...
Article
Full-text available
The goal of this paper is to provide a complete representation of regional linguistic variation on a global scale. To this end, the paper focuses on removing three constraints that have previously limited work within dialectology/dialectometry. First, rather than assuming a fixed and incomplete set of variants, we use Computational Construction Gra...
Preprint
Full-text available
This paper uses the Minimum Description Length paradigm to model the complexity of CxGs (operationalized as the encoding size of a grammar) alongside their descriptive adequacy (operationalized as the encoding size of a corpus given a grammar). These two quantities are combined to measure the quality of potential CxGs against unannotated corpora, s...
Preprint
Full-text available
A usage-based Construction Grammar (CxG) posits that slot-constraints generalize from common exemplar constructions. But what is the best model of constraint generalization? This paper evaluates competing frequency-based and association-based models across eight languages using a metric derived from the Minimum Description Length paradigm. The expe...
Preprint
Full-text available
This paper evaluates global-scale dialect identification for 14 national varieties of English as a means for studying syntactic variation. The paper makes three main contributions: (i) introducing data-driven language mapping as a method for selecting the inventory of national varieties to include in the task; (ii) producing a large and dynamic set...
Article
Full-text available
This paper formulates and evaluates a series of multi-unit measures of directional association, building on the pairwise ΔP measure, that are able to quantify association in sequences of varying length and type of representation. Multi-unit measures face an additional segmentation problem: once the implicit length constraint of pairwise measures is...
Conference Paper
Full-text available
This paper uses the Minimum Description Length paradigm to model the complexity of CxGs (operationalized as the encoding size of a grammar) alongside their descriptive adequacy (operationalized as the encoding size of a corpus given a grammar). These two quantities are combined to measure the quality of potential CxGs against unannotated corpora, s...
Article
Full-text available
The strength of Construction Grammar (CxG) is its descriptive power; its weakness is the learnability and falsifiability of its unconstrained representations. Learnability is the degree to which the optimum set of constructions can be consistently selected from the large set of potential constructions; falsifiability is the ability to make testable...
Conference Paper
Full-text available
Construction Grammar (CxG) views language as a network of constraint-based slot-filler constructions at different levels of representation and abstraction that emerge from observed usage. These constructions are self-similar in the sense that the same processes and forms are posited to repeat themselves at multiple levels of representation: rather...
Article
Full-text available
This paper presents an algorithm for learning the construction grammar of a language from a large corpus. This grammar induction algorithm has two goals: first, to show that construction grammars are learnable without highly specified innate structure; second, to develop a model of which units do or do not constitute constructions in a given datase...
Article
Full-text available
This paper presents and evaluates a model of how the abstractness of source and target concepts influences metaphoricity, the property of how metaphoric a linguistic metaphoric expression is. The purpose of this is to investigate the long-standing claim that metaphoric mappings are from less abstract concepts to more abstract concepts. First, abstr...
Article
Full-text available
Text classification systems are capable of predicting certain characteristics of a text’s author (e.g., gender and age) using only linguistic properties. This paper asks why such predictions are possible and how they can be interpreted. There are three factors: (1) the nature of the features used by the system; (2) the robustness of the predictions...
Article
Full-text available
This article argues that there are three types of metaphoric utterances that can be defined by (a) the contextual stability of the utterance’s interpretation and (b) the presence or absence of a conceptual source–target mapping. Evidence for these three types of metaphoric utterances comes from introspective evidence about metaphor-in-language, fro...
Article
Full-text available
This article presents a profile-based authorship analysis method which first categorizes texts according to social and conceptual characteristics of their author (e.g. Sex and Political Ideology) and then combines these profiles for two authorship analysis tasks: (1) determining shared authorship of pairs of texts without a set of candidate authors...
Conference Paper
Full-text available
This paper presents the first computationally-derived scalar measurement of metaphoricity. Each input sentence is given a value between 0 and 1 which represents how metaphoric that sentence is. This measure achieves a correlation of 0.450 (Pearson's R, p <0.01) with an experimental measure of metaphoricity involving human participants. While far fr...
Conference Paper
Full-text available
Metaphor is a cognitive process that shapes abstract target concepts by mapping them to concrete source concepts. Thus, many computational approaches to metaphor make reference, directly or indirectly, to the abstractness of words and concepts. The property of abstractness, however, remains theoretically and empirically unexplored. This paper imple...
Conference Paper
Full-text available
This study first examines the implicit and explicit premises of four systems for identifying metaphoric utterances from unannotated input text. All four systems are then evaluated on a common data set in order to see which premises are most successful. The goal is to see if these systems can find metaphors in a corpus that is mostly non-metaphoric...
Article
Full-text available
This paper argues that two properties of the linguistic structure of an utterance influence and partially determine whether the utterance has a metaphoric meaning that results in a stable interpretation: (i) degree of meta­phoricity and (ii) degree of metaphoric saturation. A majority of metaphoric utter­ances in a corpus study (66%) were unsaturat...
Article
Full-text available
Metaphoric expressions are not all equal, in the sense that some are intuitively more or less metaphoric than others. Part of this intuition is influenced by the underlying metaphor, but another part is influenced by the linguistic expression which carries that metaphor. This paper puts forward a system, first, of dividing the two important element...

Network

Cited By