Andrew Hardie’s research while affiliated with Lancaster University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (72)


A theory for words in Georgian: traditional constructs versus corpus annotation
  • Article
  • Full-text available

December 2024

·

8 Reads

Corpus Linguistics and Lingustic Theory

Andrew Hardie

·

Sophiko Daraselia

Part-of-speech annotation, as an exercise in categorisation, necessitates a category schema, based on some model or theory of the grammar of the language. Such a model may (sometimes, must ) deviate from traditional approaches for human understanding, as exploration of theoretical issues arising from a Georgian POS schema illustrates. Consistency on classifying by form versus function is problematised by difficult pronoun/demonstrative and adjective/noun distinctions. Adverb subcategorisation illustrates exclusion of semantic/derivational distinctions that traditional approaches readily admit. Variation in plural inflection has implications for how diachronicity is handled, as does “zero case”. Postpositions make necessary a specific approach to cliticisation in which enclitic elements are handled as separate tokens bearing their own analysis. Suffixaufnahme provides a case study in inclusion versus exclusion of a rare but current phenomenon. Verb morphology illustrates how simplifying assumptions help favour abstraction of categories over descriptive exhaustiveness. Divergence between the resulting model and traditional characterisations do not invalidate either, but evidence how a model’s design is inseparable from its purpose. With regard to these select issues of Georgian grammar, this discussion aims both to demonstrate the overall argument regarding theorisation/schematisation for a specific, practical purpose (POS annotation) and to justify solutions proposed to problems at hand.

Download

Design and construction of an openly available Urdu web corpus

November 2024

·

1 Read

Corpora

Urdu corpus linguistics is in its infancy, partly because the field lacks large, openly and freely accessible corpora. General purpose Urdu corpora created to date are unsuitable as shared reference data for the field due to barriers of cost or copyright. The novel Lancaster Urdu Web Corpus (luwc) is designed to fill this gap. It encompasses data from three news websites and an online chat forum. The corpus contains 24 million tokens, and is part-of-speech (pos) tagged. To overcome problems with distributing a corpus whose texts’ intellectual property belongs to other parties, the luwc is available through a cqpweb server, disallowing access to full underlying data. However, the accessibility of source urls as text-level metadata gives users a means by which to see the full original context. In spite of issues of balance/representativeness the luwc can fulfil the role of a shared reference point for Urdu corpus analysis.


The interpretation of topic models for scholarly analysis: An evaluation and critique of current practice

December 2022

·

117 Reads

·

18 Citations

Digital Scholarship in the Humanities

Topic modelling is a method of statistical data mining of a corpus of documents, popular in the digital humanities and, increasingly, in social sciences. A critical methodological issue is how ‘topics’ (groups of co-selected word types) can be interpreted in analytically meaningful terms. In the current literature, this is typically done by ‘eyeballing’; that is, cursory and largely unsystematic examination of the ‘top’ words in each algorithmically identified word group. We critically evaluate this approach in a dual analysis, comparing the ‘eyeballing’ approach with an alternative using sample close reading across the corpus. We used MALLET to extract two topic models from a test corpus: one with stopwords included, another with stopwords excluded. We then used the aforementioned methods to assign labels to these topics. The results suggest that a close-reading approach is more effective not only in level of detail but even in terms of accuracy. In particular, we found that: assigning labels via eyeballing yields incomplete or incorrect topic labels; removing stopwords drastically affects the analysis outcome; topic labelling and interpretation depend considerably on the analysts’ specialist knowledge; and differences of perspective or construal are unlikely to be captured through a topic model. We conclude that an interpretive paradigm founded in close reading may make topic modelling more appealing to humanities researchers.



Pre-suasive and persuasive strategies in the tweets of the Saudi Ministry of Health during the 2020 coronavirus pandemic: A corpus linguistic exploration

September 2022

·

87 Reads

Frontiers in Communication

In this study, we assess the applicability and usefulness of a particular theoretical framework for qualitative analysis of communicative strategies in discourses from beyond the English language. The theory in question is Cialdini's model of persuasion (and the related concept of pre-suasion). We present an operationalisation of this framework in terms of concrete linguistic features, which is implemented using the computer-assisted methods of corpus linguistics. As a case study, we explore a particular type of Arabic-language online public discourse surrounding an issue of pressing contemporary concern, namely the COVID-19 Pandemic. Specifically, we use a large collection of texts produced by the Ministry of Health of Saudi Arabia via the medium of the Ministry's official Twitter account. The tweets in question were produced in the context of a campaign to persuade the public to modify their behavior to comply with policies on protective measures. While the use of corpus-assisted linguistic approaches to examine public discourses around socially or culturally prominent issues is well-developed in the Anglosphere, it remains much more rarely utilized in the Arab World context, and especially in application to discourses in the Arabic language itself. In addition to the contribution arising from the improvements generated in our understanding of the particular issue at hand, this paper aims to contribute to the broader field of Arabic linguistics by modeling a suitable approach—albeit one whose use we show to be subject to some complicating factors—to address other questions in the study of persuasive language in Arabic.


Making use of transcription data from qualitative research within a corpus-linguistic paradigm: issues, experiences and recommendations

April 2022

·

26 Reads

·

9 Citations

Corpora

In this paper, we reflect on the process of re-operationalising transcript data generated in an ethnographic study for the purposes of corpus analysis. We present a corpus of patient–provider interactions in the context of Emergency Departments in hospitals in Australia, to discuss the process through which ethnographic transcripts were manipulated to generate a searchable corpus. We refer to the types of corpus analysis that this conversion enables, facilitated by the rich metadata collected alongside the transcribed audio recordings, augmenting the findings of prior qualitative analyses. Subsequently, we offer guidance for spoken data transcription, intended to ‘future proof’ such data for subsequent reformatting for corpus linguistic analysis.


Exploring and categorising the Arabic copula and auxiliary kāna through enhanced part-of-speech tagging

November 2021

·

12 Reads

·

3 Citations

Corpora

Arabic syntax has yet to be studied in detail from a corpus-based perspective. The Arabic copula kāna (‘be’), functions also as an auxiliary, creating periphrastic tense–aspect constructions; but the literature on these functions is far from exhaustive. To analyse kāna within the one-million word Corpus of Contemporary Arabic, part-of-speech tagging (using novel, targeted enhancements to a previously described program which improves the accessibility for linguistic analysis of the output of Habash et al.’s [2012] mada disambiguator for the Buckwalter Arabic morphological analyser) is applied to disambiguate copula and auxiliary at a high rate of accuracy. Concordances of both are extracted, and 10 percent samples (499 instances of copula kāna and 387 of auxiliary kāna) are analysed manually to identify surface-level grammatical patterns and meanings. This raw analysis is then systematised according to the more general patterns’ main parameters of variation; special descriptions are developed for specific, apparently fixed-form expressions (including two phraseologies which afford expression of verbal and adjectival modality). Overall, we uncover substantial new detail, not mentioned in existing grammars (e.g., the quantitative predominance of the past imperfect construction over other uses of auxiliary kāna). There exists notable potential for these corpus-based findings to inform and enhance not only grammatical descriptions but also pedagogy of Arabic as a first or second/foreign language.


Figure 1: VARD2 in interactive mode, showing work-in-progress on Henry V
Figure 3: Post-editing of POS tagger output. This screenshot shows the beginning of Macbeth in the format used for post-editing (raw CLAWS output with minor readability tweaks), open in Notepad++. All but one visible token is correctly tagged. The exception is meet, which should be tagged VVI (infinitive) not VV0 (present tense).
The plays constituting ESC:Folio
Social status categories
Supporting the corpus-based study of Shakespeare’s language: Enhancing a corpus of the First Folio

May 2021

·

367 Reads

·

5 Citations

ICAME Journal

·

Andrew Hardie

·

Jane Demmen

·

[...]

·

This article explores challenges in the corpus linguistic analysis of Shakespeare’s language, and Early Modern English more generally, with particular focus on elaborating possible solutions and the benefits they bring. An account of work that took place within the Encyclopedia of Shakespeare’s Language Project (2016–2019) is given, which discusses the development of the project’s data resources, specifically, the Enhanced Shakespearean Corpus. Topics covered include the composition of the corpus and its subcomponents; the structure of the XML markup; the design of the extensive character metadata; and the word-level corpus annotation, including spelling regularisation, part-of-speech tagging, lemmatisation and semantic tagging. The challenges that arise from each of these undertakings are not exclusive to a corpus-based treatment of Shakespeare’s plays but it is in the context of Shakespeare’s language that they are so severe as to seem almost insurmountable. The solutions developed for the Enhanced Shakespearean Corpus – often combining automated manipulation with manual interventions, and always principled – offer a way through.


Figure 1. Individual relative frequency values for "Affect" terms.
Figure 2. Individual relative frequency values for "Control" terms.
Figure 4. Individual relative frequency values for "Loudness" and "Strength" terms.
A linguistic approach to the psychosis continuum: (dis)similarities and (dis)continuities in how clinical and non-clinical voice-hearers talk about their voices

November 2020

·

90 Reads

·

16 Citations

Cognitive Neuropsychiatry

Introduction: “Continuum” approaches to psychosis have generated reports of similarities and differences in voice-hearing in clinical and non-clinical populations at the cohort level, but not typically examined overlap or degrees of difference between groups. Methods: We used a computer-aided linguistic approach to explore reports of voice-hearing by a clinical group (Early Intervention in Psychosis service-users; N = 40) and a non-clinical group (spiritualists; N = 27). We identify semantic categories of terms statistically overused by one group compared with the other, and by each group compared to a control sample of non-voice-hearing interview data (log likelihood (LL) value 6.63+=p < .01; effect size measure: log ratio 1.0+). We consider whether individual values support a continuum model. Results: Notwithstanding significant cohort-level differences, there was considerable continuity in language use. Reports of negative affect were prominent in both groups (p < .01, log ratio: 1.12+). Challenges of cognitive control were also evident in both cohorts, with references to “disengagement” accentuated in service-users (p < .01, log ratio: 1.14+). Conclusion: A corpus linguistic approach to voice-hearing provides new evidence of differences between clinical and non-clinical groups. Variability at the individual level provides substantial evidence of continuity with implications for cognitive mechanisms underlying voice-hearing.


A survey of grammatical variability in Early Modern English drama

August 2020

·

35 Reads

·

2 Citations

Language and Literature

Grammar is one of the levels within the language system at which authorial choices of one mode of expression over others must be examined to characterise in full the style of the author. Such choices must however be assessed in the context of an understanding of the extent of variability that exists generally in the language. This study investigates a set of grammatical features to understand their variability in Early Modern English drama, and the extent to which Shakespeare’s grammatical style is distinct from or similar to that of his contemporaries in so far as these features are concerned. A review of prior works on Shakespeare’s grammar establishes that the quantitatively informed corpus linguistic approach utilised in this study is innovative to this topic. Using two of the grammatically annotated corpora created by the Encyclopedia of Shakespeare’s Language project, one made up of Shakespeare’s plays, one of plays by other playwrights of the period, we present a method which steers a course between the narrow focus of close reading and the naïvely quantitative metrics of authorship analysis. For a set of 15 grammatical features of stylistic interest, we retrieve all instances of each feature in each play via complex corpus search patterns and calculate its relative frequency. These results are then considered, in aggregate and at the text level, to assess the differences across plays, across dramatic genre, and between Shakespeare and the other dramatists, via both statistical summary and visual representation of variability. We find that Shakespeare’s grammatical style tends (especially in comedies and tragedies) to disprefer informationally dense noun phrases relative to the other playwrights; and, moreover, to prefer tense, aspect and pronoun features which suggest a greater degree of narrative focus in his style. Furthermore, we find Shakespeare to be highly distinct in his preferences regarding verb complement subordinate clause types. These findings point the way both to a novel methodology and to further as yet unconsidered questions on the subject of Shakespeare’s grammatical style.


Citations (55)


... A precise definition of a collocation has been elusive in the L2 literature (Boers & Webb, 2018 for a concise review on the phraseology vs. corpus-based approaches to defining collocations). Following Baker, Hardie, and McEnery's (2006) definition in corpus linguistics, this study defines collocation as "the phenomenon surrounding the fact that certain words are more likely to occur in combination with other words in certain contexts" (p. 36). ...

Reference:

Multi- or Single-Word Units? The Role of Collocation Use in Comprehensible and Contextually Appropriate Second Language Speech
Corpus-building for South Asian languages
  • Citing Chapter
  • August 2006

... Wenn für einen Forschungskontext eine sehr grobe Analyse semantischer Kontexte ausreicht, kann eine Einbeziehung solcher Verfahren möglicherweise sinnvoll sein. Zu bedenken ist auch dann, dass die Erfolgsrate menschlicher Expert:innen bei der Themenerkennung auf Basis von statistisch assoziierten Wortmengen sehr mäßig ist (Gillings und Hardie 2023;Gillings, Learmonth und Mautner 2024). 47 Als vergleichbares Beispiel aus den Naturwissenschaften sei eine jüngere Biologiestudie genannt, für die 246 Wissenschaftler:innen derselbe Datensatz mit derselben Forschungsfrage gegeben wurde (Gould, Fraser, Parker, Nakagawa, Griffith, Vesk, Fidler, Hamilton, Abbey-Lee, Abbott u. a. 2023). ...

The interpretation of topic models for scholarly analysis: An evaluation and critique of current practice
  • Citing Article
  • December 2022

Digital Scholarship in the Humanities

... Another major revolution in linguistics is the emergence of corpus linguistics. McEnery & Hardie (2013), maintained that with the aid of AI, scholars possess the capacity to amass, retain, and scrutinize extensive collections of linguistic data, spanning a wide range of mediums such as written literature, oral discourse, and even communication that incorporates multiple modes of expression. These vast and comprehensive datasets offer linguists with invaluable resources to explore and analyze linguistic phenomena across a wide range of languages, dialects, and historical periods. ...

The History of Corpus Linguistics
  • Citing Article
  • March 2013

... Writing argumentative paragraphs used in problem-based learning involves a series of systematic steps to test and analyze the effectiveness of the PBL model in the context of learning to write argumentative paragraphs. Research Methodology: namely a qualitative method with a descriptive approach, a method for identifying, describing and explaining research subjects (Coleman, 2021;Collins, 2022). ...

Making use of transcription data from qualitative research within a corpus-linguistic paradigm: issues, experiences and recommendations
  • Citing Article
  • April 2022

Corpora

... Another CL issue specific to the language of a corpus is the difficulty of processing Arabic corpora (Abumalloh et al., 2016). This is due to the fact that the Arabic language is more complex for a software to process and code, due to its morphological and syntactic features (Abumalloh et al., 2016;McEnery et al., 2019). An additional problem is the lack of Arabic taggers for corpus linguistic studies. ...

1 Introducing Arabic Corpus Linguistics
  • Citing Chapter
  • January 2019

... al., 2017). For Vietnamese and Arabic, research of specific grammatical elements as well as textbook information was used substantially (Thompson, 1987;Ngo, 2006;Shaqra, 2007;Pham & Kohnert, 2009;Lyovin et al., 2017;Nguyen & Dutta 2017;McEnery et al., 2019). To introduce the basic linguistic features of Haitian Creole, a holistic approach comparing it with French was used (Valdman, 1988;Coffman Crocker, 2009). ...

Arabic Corpus Linguistics
  • Citing Book
  • January 2019

... If the corpus is created for ASR purposes, then ASR quality can also be used as a corpus quality indicator. Descriptions of corpus types can be found in various books or articles (see [59] for example); thus, they are not discussed here. Moreover, the decision to create a corpus is usually dictated by the need to obtain a specific type of corpus; therefore, the corpus type is characterized at the very beginning of the corpus creation process. ...

A Glossary of Corpus Linguistics
  • Citing Article
  • May 2006

... Arabic Discourse Studies is a (sub-)discipline of linguistics only now fully emerging. While multiple theoretically-grounded discourse-analytic frameworks have been established, and proven profoundly influential, since the 1980s, the application of these frameworks to texts in the Arabic language is a much more recent trend (to which the author of this paper has contributed: see , Hardie & Ibrahim 2021, Ibrahim 2014, 2019, 2021a, 2021b, Ibrahim & Hardie 2019, and Ibrahim, Abaalalaa & Hardie. 2022. ...

Exploring and categorising the Arabic copula and auxiliary kāna through enhanced part-of-speech tagging
  • Citing Article
  • November 2021

Corpora

... A corpus could be something as broad as the British National Corpus 2014 (BNC2014) (Love et al., 2017;Brezina, Hawtin and McEnery, 2021), which aims to represent British English as it stood throughout the early 2010s; but it could equally have more modest ambitions and aim to represent a smaller and more specialised varietysomething like the works of Shakespeare (Culpeper et al., 2021) or the language of business meetings (Handford, 2010). These smaller corpora are incredibly useful to the discourse analyst, allowing 'a much closer link between the corpus and the contexts in which the texts in the corpus were produced' (Koester, 2022: 49). ...

Supporting the corpus-based study of Shakespeare’s language: Enhancing a corpus of the First Folio

ICAME Journal

... Academic definition occasionally distinguished between 'corpus-driven ' and 'corpus-based' approaches (Tognini-Bonelli, 2002). Yet, as Hardie and Dorst (2020), and Collins et al. (2020) observed, these distinctions were often overstated, with people from both schools frequently collaborating across diverse academic disciplines. ...

A linguistic approach to the psychosis continuum: (dis)similarities and (dis)continuities in how clinical and non-clinical voice-hearers talk about their voices

Cognitive Neuropsychiatry