Content uploaded by Brian Clancy
Author content
All content in this area was uploaded by Brian Clancy on May 11, 2021
Content may be subject to copyright.
1
6 Building a corpus to represent a variety of a language
Brian Clancy
1. What is a variety of a language?
In the literature, a variety of a language is, to say the least, broadly defined. Crystal (2001: 6-7)
maintains that in its most general sense, the notion of a variety includes ‘speech and writing, regional
and class dialects, occupational genres (such as legal and scientific language), creative linguistic
expression (as in literature), and a wide range of other styles of expression.’ Similarly, McEnery et al.
(2006: 90) suggest that varieties of a language are equally expansive, covering, for example, ‘the
standard language (standardised for the purposes of education and public performance), dialects
(geographically defined), sociolects (socially defined), idiolects (unique to individual speakers) and
jargons (particular to specific domains)’. These approaches to the definition of the term are in
themselves problematic, especially for the corpus builder(s). However, it does appear that a language
variety is, in principle, systematic, coherent and predictable. Therefore, it is possible to say, with some
degree of certainty in a given language: how people from a particular region will speak (for example,
‘British English’ or ‘Liverpool English’), how lawyers will write (‘legal French’), or how television
commentators will present a type of sport (‘sports commentary’).
Quirk (1995) refers to the confusion that this slew of language varieties can cause. He cites the example
of the word ‘English’ preceded by a specific adjective or noun to designate a specific variety. This list
of varieties of English includes, but is certainly not limited to, varieties such as ‘American English’,
‘legal English’, ‘BBC English’, ‘working-class English’ or ‘Chicano English’. Although each is
referred to as a variety, they are all formed on ‘desperately different taxonomic bases’ (ibid.: 22). For
example, academic discourse (spoken and/or written) is a variety that may be used equally by speakers
of both American English and British English, and speakers of other languages such as Spanish. This,
in turn, raises the question of whether or not there exists a variety of American Academic English as
opposed to one of British Academic English as opposed to Spanish Academic English. The answer to
this question can be found in the distinction between register and dialect. In corpus linguistics, registers
are delimited in non-linguistic terms, defined by Biber et al. (1999) as situational characteristics such
as mode, interactiveness, domain, communicative purpose or topic (cf. Halliday, 1978). This results in
varieties such as legal English and academic discourse, but also varieties such as political discourse,
workplace discourse, classroom discourse, crime fiction, etc. in other words, these varieties are use-
related. Dialect, on the other hand, has been defined as a variety associated with different groups of
individuals, identifiable through pronunciation, lexical and grammatical variation (Biber et al. 1999).
2
This variation is often attributed to social characteristics such as gender, age, social class, ethnicity or
religion and, therefore, account for national varieties but also those such as BBC English, working-class
English or Chicano English. These varieties are user-related.
Accordingly, the starting point for the building of a corpus for a variety of a language could usefully be
based on a fundamental decision: is the proposed corpus being built to represent a Variety of a language,
such as American English or British English, or is it representing a variety of a language such as legal
English or academic English? A Variety is defined geographically and is user-related, where an
individual is in a sense ‘tied’ to and identified by it (Quirk, 1995). Therefore, Irish people speak Irish
English, and this includes its corresponding dialects. On the other hand, a variety is defined situationally
and is use-related; therefore, it involves the discourse activity the individual is involved in or the purpose
for which they are using language. For example, a conversation between two academics could feature
two language Varieties, say American English and British English, but one language variety, academic
English. Indeed, many corpora constructed to represent a Variety of a language are built using a range
of varieties of that language (see, for example, the Cambridge and Nottingham Corpus of Discourse in
English (CANCODE) or the first British National Corpus (BNC1994) (see Chapters 3 and 4, this
volume). The decision made to choose between Variety or variety will be largely based on the research
questions the corpus is expected to answer. This fundamental choice also has defining repercussions in
relation to issues of corpus design such as the construction of the corpus sampling frame, which in turn
has implications on size, diversity of texts and representativeness and balance (see Chapters 2, 3 and 4,
this volume).
2. Issues of corpus design for a variety of a language
Building a corpus for a variety of a language involves building something that is representative of a
whole; therefore, the design of the corpus is of particular importance. Many of the decisions made by
the corpus builder(s) in the design stage are based on the proposed uses of the corpus and on the research
questions that these entail (see Chapters 2, 3 and 4, this volume). However, as McEnery et al. (2006:
73) caution, ‘corpus building is of necessity a marriage of perfection and pragmatism’. Although corpus
builder(s) should always strive to build the perfectly representative corpus, issues such as corpus size,
text diversity and number and length of texts, as outlined in this section, may result in decisions based
on factors outside of the control of corpus designers.
3
Issue 1: Address corpus size
In general, the primary issue connected to corpus size is that of resources, and it is here that the corpus
builder(s) may have to ‘cut their coat according to their cloth’. In the design stage, a number of
fundamental decisions must be made. A key decision is whether the corpus will consist of written texts,
spoken texts, computer mediated texts or a combination of some or all of these. The timeframe for
collection, computerisation, annotation and, if required, tagging and parsing, has to be considered.
Spoken corpora are generally smaller in size than written corpora primarily due to the transcription
process. Chafe et al. (1991) observe that it takes six person hours to transcribe one minute of speech for
the Santa Barbara Corpus. McCarthy (1998) notes that it takes, on average, twenty hours to transcribe
one hour of recorded spoken data. Estimates for the American component of the International Corpus
of English (ICE-USA) range from ten hours to transcribe a 2,000-word carefully prepared monologue
to twenty hours for a dialogue containing numerous speaker overlaps (Meyer 2002). For this reason,
spoken corpora such as CANCODE, the spoken component of BNC1994 and the Spoken British
National Corpus 2014 (Spoken BNC2014) have required considerable funding both from universities
and major publishing houses. Written texts can also prove problematic when building a corpus,
especially when issues of copyright are considered (see Atkins et al. 1992: 4; McEnery et al. 2006: 77-
9; see also Chapter 4, this volume).
The corpora for the study of geographical, user-related language varieties differ in size, largely
dependent upon the nature of the texts which they contain. These include, but are not limited to, the
following corpora:
• The Global Web-based English Corpus (GloWbE), a 1.9-billion-word corpus of web-based
materials (blogs, newspapers, company websites, etc.) from twenty countries which allows for
an investigation of a wide range of phenomena in different varieties of English;
• The Corpus of Contemporary American English (COCA), a corpus of 560 million words (to
date), evenly divided between the years 1990-2017 (20 million words per year);
• The BNC1994, a 100-million-word corpus designed to be representative of both written and
spoken British English, and the companion Spoken BNC2014, an 11.5-million-word spoken
corpus;
• The International Corpus of English (ICE) project, fourteen one-million-word corpora, each
representing a different variety of English;
• CANCODE, a five-million-word corpus of spoken British (and some Irish) English;
4
• The Limerick Corpus of Irish English (LCIE), a one-million-word corpus designed to be
representative of spoken Irish English.
It should be pointed out that there are also a range of non-English language corpora designed to
represent a number of different languages. For example, there are a number of languages, amongst them
Czech, Hungarian, German, Polish, Russian and Welsh that have embraced national corpus concept. In
addition, the TenTen Corpus Family, available through Sketch Engine, is comprised of comparable 10-
billion-word, web-based corpora. These corpora are currently available in 30+ different languages.
There are also a number of corpora that have been designed to facilitate the examination of situational,
use-related varieties. For example, COCA is divided into a number of component parts – spoken
language, fiction, magazines, newspapers and academic texts. Similarly, CANCODE is subdivided into
spoken texts from intimate, pedagogical, professional, socialising and transactional settings (McCarthy,
1998). There are also a number of corpora specifically designed to represent use-related varieties.
Examples of these include written academic corpora such as the British Academic Written English
Corpus (BAWE), a 6.5-million-word corpus of undergraduate and postgraduate written assessments;
spoken academic corpora such as the 1.8-million-word Michigan Corpus of Spoken Academic English
(MICASE); computer mediated communication corpora such as the Birmingham Blog Corpus, a 628.5
million word corpus or both formal and informal blog posts or workplace corpora such as the one-
million-word Business Letter Corpus. These corpora, amongst others, are available online through a
range of search interfaces.
When considering corpus size, the purpose of the corpus should also be addressed. A comprehensive
examination of the variation between lexical features or less frequent grammatical features generally
requires a larger corpus, for example, GloWbE or the BNC. In contrast, Vaughan and Clancy (2013)
point out that use-related corpora in the 20,000-200,000-word range are more than adequate for the
investigation of high frequency pragmatic variation in spoken language. However, ultimately, the size
of a corpus might be better determined ‘not by focusing too intently on the overall length of the corpus
but by focusing more on the internal structure of the corpus’ (Meyer 2002: 34). The internal structure
of a corpus refers to matters such as diversity of texts and length and number of texts to include, and it
is to these matters that attention now turns.
Issue 2: Consider the diversity of texts to include
5
Many corpora designed to be representative of a language variety are, in essence, multipurpose. They
can be used, for example, to describe lexico-grammatical features, to study differences between
different national varieties or to study variation within the different registers that comprise the corpus.
Therefore, a corpus of this type necessarily requires a wide range of texts which can be selected
according to a range of different criteria. The Brown Corpus used a sampling frame derived from the
collection of books and periodicals in the Brown University Library and Providence Athenaeum in
1961. The Lancaster-Oslo-Bergen (LOB) corpus also chose two sampling frames; for books, the
publications listed for 1961 in The British National Bibliographic Cumulated Subject Index, 1960–
1964, and for periodicals and newspapers, those listed in Willing’s Press Guide (1961) (see Johansson
et al. 1978). The BNC1994 collected written texts under three criteria: domain, time and medium.
Domain refers to the context-type of the text (nine different context-types, for example, leisure, applied
science, world affairs, were identified), time refers to when the texts were produced (texts were sampled
in the period 1960-1993) and medium refers to the type of text publication (book, journal, newspaper,
etc.). In relation to the spoken component of the BNC1994, 4.2 million words of the corpus were
collected by a process of demographic sampling. Recordings were collected from individuals and
demographic information such as name, age, occupation, gender and social class was noted (Crowdy
1993). This was further subdivided into region and interaction type (monologue or dialogue). The
demographically sampled spoken corpus was complemented by texts collected on context-governed
criteria. These texts related to more formal speech contexts such as those encountered in educational or
business settings (see Aston and Burnard 1998 for a full description of the design).
In their initial corpus design phase the CANCODE team developed a set of spoken text-types to
correspond to existing text typologies for the written language. McCarthy (1998) describes the corpus
design as a ‘genre-based’ approach where not only is a population of speakers targeted, but the context
and environment in which the speech is produced is also taken into consideration. The framework used
for CANCODE sought to combine the nature of speaker relationship with goal-types prevalent in
everyday, spoken interaction. The nature of the speaker relationship was divided into five broad
contexts: transactional, professional, pedagogical, socialising and intimate. For each of these contexts,
three goal-types were identified: information provision, collaborative task and collaborative idea (see
McCarthy 1998: 9–10 for a definition of the terms). Therefore, for example, a university lecture is
defined as a pedagogical context with an information provision goal-type, whereas a family cooking
together would be an example of an intimate collaborative task. This, according to McCarthy (ibid.: 9)
‘offers the possibility of linking their [the data] contextual and social features directly with the lexico-
grammatical “nuts and bolts” of their step-by-step creation’. This design matrix was, in turn, adopted
by LCIE which allows for a direct comparison between these two corpora (Clancy 2016).
6
The ICE suite of corpora also contains a range of spoken and written texts collected according to
specifically delineated criteria (see Greenbaum 1991; Nelson 1996; Meyer 2002). The written
component contains both printed and nonprinted (for example, student essays, social letters) material,
although the printed material accounts for 75 per cent of the written corpus. From a spoken viewpoint,
ICE contains 60 per cent dialogic material and 40 per cent monologic; again, the criteria for the
inclusion of this material are thoroughly specified, with dialogues divided into public and private and
monologues into scripted and unscripted. In the ICE corpora, the speakers chosen were adults of
eighteen years of age or older who had received a formal education through the medium of English to
at least secondary school level (however, this design proved to be flexible in the case of well-known,
established political leaders and radio or television broadcasters whose public status made their
inclusion appropriate). Information was also recorded about gender, ethnic group, region, occupation
and status in occupation and role in relation to other participants. MICASE also employed context-
governed criteria in collecting the data. The corpus contains speech events across the major academic
disciplines in a university, for example biological and health sciences, physical sciences and
engineering, and humanities and the arts. However, the professional disciplines of law, medicine and
dentistry were excluded. Demographic information such as age, gender, academic role and first
language were recorded.
Issue 3: Address text length and number
In determining how ‘long’ a text should be in order to warrant inclusion in the corpus, the issue of
corpus size must be returned to. Both spoken and written texts range dramatically in size from a few
words (for example, a quick note to a friend) to millions of words (for example, a long novel) (see
Chapters 2, 3 and 4, this volume). It is important to remember that the results generated from a small
corpus of texts can be skewed by one relatively long text. Traditionally, the best solution to this dilemma
is to build a corpus large enough to dilute even the lengthiest text. In other words, the bigger the corpus,
the better. However, Biber (1990) maintains that the underlying parameters of linguistic variation can
be replicated in a relatively small corpus, if that corpus represents the full range of variation. In contrast,
larger corpora are not adequate for overall analyses of textual variation if they fail to represent the range
of variation. Biber (1993) examines statistical formulae for determining sample size based on a normal
distribution of grammatical features. He found that, for nouns, a sample of 59.8 × 2,000-word texts
(approximately 120,000 words) is required for representativeness; for less common grammatical
features such as conditional clauses, a sample of 1,190 × 2,000-word texts (approximately 2.4 million
words) is required. In the ICE suite of corpora, each text contains approximately 2,000 words with the
7
ending occurring at a suitable discourse break (Greenbaum 1991). These 2,000-word samples are
problematic, however, as if the text is longer than 2,000 words, then the question arises as to what part
of the text should be omitted. In selecting samples to be included in a corpus, attention must also be
paid to ensure that text initial, middle and end samples are balanced (McEnery et al. 2006).
In terms of the number of texts to be included, corpora differ greatly in their approach. The ICE
compilers decided that each regional corpus would be one million words. They also decided on ten texts
(20,000 words) as the minimum for each text category. Texts in the Brown and LOB corpora are also
2,000 words long, and therefore each corpus contains 500 texts. The demographically sampled
component of the Spoken BNC1994 consists of 153 texts and approximately 4.2 million words, giving
an average text length of approximately 27,500 words. In contrast, the 11.5 million-word Spoken
BNC2014 contains 1,251 texts, resulting in an average text length of over 9,000 words. The CANCODE
matrix of speech-genres yields fifteen cells and the initial target was to gather approximately 65,000
words per cell. MICASE contains a total of 152 speech events ranging in type from lectures to meetings
to dissertation defences to service encounters and, therefore, seeks to cover all speech which occurs in
an academic setting. These speech events range in length from 19 to 178 minutes and in word count
from 2,805 to 30,328 words (see Simpson-Vlach and Leicher 2006). The 2.6-million-word Michigan
Corpus of Upper-Level Student Papers (MICUSP) contains 829 different types of undergraduate and
postgraduate written assignments (research report, research proposal, argumentative essay, etc.) across
16 academic disciplines. What is obvious here, and is indeed a theme running throughout the chapter,
is that there is no ‘one size fits all’ approach to corpus design and, therefore, it is important that corpora
are transparent in their design decisions in order that any results generated can be critically assessed.
3. Assessing the representativeness and balance of a corpus
Although a ‘Hail Mary’ might be a more apt metaphor in this instance, representativeness is referred to
as a ‘holy grail’ of the corpus linguistic enterprise (see Chapters 2, 3 and 4, this volume). For specific
purpose corpora, it can be a relatively straightforward matter – a corpus of the writing of Jane Austen
is representative of her work if it contains all of her published novels, for example. However, for corpora
for general use, this goal is obviously more difficult to achieve given the range of texts that might be
required in order to represent said variety. According to Leech (1991: 27), a corpus is representative if
‘findings based on its contents can be generalised to a larger hypothetical corpus’, which, although
helpful from a theoretical viewpoint, is rather vague when it comes to the practicalities of corpus design
and compilation. Sinclair (2005) outlines six practical steps towards achieving as representative a
8
corpus as possible (see Chapter 3, this volume). Biber’s (1993) seminal article outlines in detail how to
construct a statistically representative corpus; however, it appears that for designers of general-purpose
corpora, in most cases a compromise is reached regarding representativeness. Recent corpora designed
for the study of a language variety such as the Spoken BNC2014 have used an opportunistic approach
to data collection (see Love et al. 2017), where participants are invited to contribute to the corpus using
their own digital audio recorders and are responsible for obtaining consent and providing the required
metadata. This approach facilitates the collection of large amounts of data in relatively short periods of
time for a relatively low cost. The corpus designers, freed from the onerous responsibilities of, for
example, actively seeking out participants, providing audio equipment, furnishing and completing
consent forms and recording metadata, are instead able to concentrate on issues of representativeness
and balance.
A balanced corpus relies heavily on intuition and best estimates (Atkins et al. 1992; Sinclair 2005;
McEnery et al. 2006). Therefore, balance, in common with representativeness, is also a rather vague,
but important, notion. In relation to corpora built to represent a language variety, when assessing the
balance of a corpus it is useful to examine other corpora, and it is becoming increasingly popular, ‘for
good or ill’ (McEnery et al. 2006: 17), to adopt an existing corpus model and, in doing so, to assume
that issues of balance have been addressed. Written corpora like the Brown Corpus and the LOB are
generally accepted as balanced corpora. The spoken component of the BNC1994 sought balance
through a design that combined both demographic and context-governed data. The Spoken BNC2014
has a ‘core’ of data with evenly balanced demographic categories (gender, age, accent/dialect and socio-
economic status) surrounded by a large ‘mantle’ of additional data (see Love et al. 2017). When
building a corpus that represents both spoken and written language, it is important to consider the
balance between the two. The BNC1994 has a 90%-10% split between written and spoken texts
respectively. In contrast, the ICE suite of corpora opted for a 60%-40% split in favour of spoken
language. Neither of these are likely to sufficiently capture the everyday linguistic experience of most
people, who experience more speech than writing in their day-to-day lives. The Longman Spoken and
Written English (LSWE) corpus contains four core registers (or varieties): conversation, newspaper
language, fiction and academic prose. According to Biber et al. (1999: 25), these four were selected on
the basis of balance in that they cover ‘much of the range of variation in English’. For example,
conversation is the register most commonly encountered by native speakers, whereas academic prose
is a highly specialised register that native speakers encounter infrequently. Between these two extremes
are the popular registers of newspapers and fiction. The corpus was designed to contain 5,000,000 words
per register. In all of these corpora, balance has been associated with proportionality; however, an effort
should be made to ensure that this proportionality represents the variety’s ‘textual universe as a whole’
(Leech, 2007: 136).
9
CANCODE, whose genre-based design was successfully adapted in the creation of LCIE, is also
considered a balanced corpus; however, this notion of balance was arrived at in a slightly different way
from corpora such as the BNC. The initial target for the CANCODE team was a figure of 65,000 words
per cell. It was found that certain data, for example intimate conversation and business meetings, were
more difficult to collect than other types because of their sensitive nature. At an early stage in the
construction of the corpus, some cells were found to contain more data than others. The progress from
the initial one million words to the final target of five million addressed these imbalances and attempted,
where possible, to equally cover all the context types in the corpus. McCarthy (1998) maintains that a
fluid corpus design is more important than a rigid adherence to numerical targets within a fixed sample
frame. This notion of corpus design as fluid, organic and cyclical in order to maintain balance is referred
to by Biber (1993: 255) as the ‘bottom-line’ in corpus design. Returning to an earlier point, Hunston
(2002) stresses that it is key to consider balance when interpreting data from a corpus. Many corpora
built to represent a variety of a language are, by nature, multi-purpose, therefore, the builder(s) cannot
predict all the queries that may be made of it. Thus, it is necessary to document all decisions made in
regard to the design criteria. Researchers who access the corpus can then check this documentation to
ensure that it is suitable for the proposed purpose. In this way, the responsibility for the integrity of
corpus literacy is shared, and should be guided by Hunston’s (ibid.: 23) principle that ‘a statement about
evidence in a corpus is a statement about that corpus, not about the language or register of which the
corpus is a sample’.
4. What can a corpus tell us about a language Variety? Irish English
Although long recognised as a variety of English in its own right, as illustrated by studies such as Joyce
(1910) and Bliss (1979), the description of Irish English (IrE) has benefitted considerably from the
study of a number of language corpora. There is a range of corpora of both written and spoken IrE
available to researchers interested in studying the variety (for more on these corpora see Vaughan and
Clancy, 2016). Two of the more commonly consulted corpora are LCIE and the Ireland component of
ICE, ICE-Ireland. LCIE is a one-million-word corpus of naturally occurring spoken IrE, whereas ICE-
Ireland, in keeping with the ICE project, represents both spoken (600,000 words) and written (400,000
words) IrE (Clancy 2016). The corpora allow the description of IrE as a variety in itself rather than how
it is similar to or different from other varieties of English such as British English (BrE). Both contain
detailed demographic information such as age and gender and also information about where the
speakers were born and where they lived at the time of recording (geographical information) and level
of education. ICE-Ireland also details the religious background of the participants in the corpus which
10
is relevant as the corpus contains speech from both the Republic of Ireland and Northern Ireland. LCIE,
on the other hand, was collected exclusively in the Republic. The spoken component of ICE-Ireland has
also been both prosodically and pragmatically tagged made available as SPICE-Ireland (see Kirk,
2016).
The size of both LCIE and ICE-Ireland, in addition to the emergence of SPICE-Ireland, has resulted in
a primary focus on the realm of linguistic pragmatics, given that it is possible to examine patterns of
linguistic form and function in light of their interactional and situational context of occurrence. This
has allowed researchers to provide some insightful findings in relation to linguistic representations of
socio-cultural norms in Irish society. Some of these studies are presented here for illustrative purposes.
In common with the growth of corpus pragmatic studies more generally, many of these studies have
focussed on the use of pragmatic markers (PMs). Although like is by no means unique to IrE (see, for
example, Andersen 2001; Tagliamonte 2005), it can be considered as a relatively new development in
IrE (Amador-Moreno 2010). The marker has emerged from corpus studies as a prominent item in the
socio-pragmatic system of Irish English (see O’Keeffe et al. 2020). Kallen (2013) and Schweinberger
(2015) demonstrate how clause- or sentence-final like is more frequent in ICE-Ireland than in ICE-Great
Britain. The marker now has also been identified as a key item in IrE. Clancy and Vaughan (2012) have
shown that now is more frequent in the spoken Irish English represented in LCIE than in other corpora
such as the spoken component of the BNC (1994) or the Scottish Corpus of Texts and Speech (SCOTS).
In IrE, now has emerged as highly polysemous, functioning as a temporal adverb, a discourse marker,
a pragmatic marker and a deictic presentative. The socio-pragmatic function of both like and now is
pivotal in understanding the behaviour of these PMs. In IrE, like and now frequently occupy final
position in an utterance to soften or mitigate face threatening behaviour such as disagreement, challenge
or evaluation, a function that is less frequent in corpora of British English (see also Vaughan et al.
2017).
Corpus studies in Irish English have also strived for a better understanding of pragmatic phenomena
beyond the sphere of pragmatic markers. One of these areas is the link between grammar and
pragmatics. O’Keeffe and Amador-Moreno (2009) and Filppula (2012) have examined the frequency
of occurrence of aspectual forms in IrE through a focus on the structures be after + V-ing and MODAL
+ be V-ing respectively. Both of these forms are more frequent in IrE than in British English (see also
Kirk, 2017). The be after + V-ing structure was shown to have important politeness functions in IrE,
whereas more frequent use of MODAL + be V-ing has been tentatively associated with a predisposition
toward high levels of personal commitment in IrE. Thus, these pragmatic functions of the progressive
are indicative of membership of the IrE speaking community. Another area that has received some
11
attention in IrE corpus research is the turn taking system. O’Keeffe and Adolphs (2008) and McCarthy
(2015) focus on primarily on non-minimal lexical items at turn-initial position and their distribution in
IrE and BrE. While there is an overlap between the form and function of these items across the two
varieties, each has its own distinct core of adjectives, adverbs and intensifiers at turn opening that are
used with greater frequency and convey different shades of meaning. IrE shows a greater preference for
items that convey intensification and informality in turn initial position which could help define the
variety.
These corpus studies of IrE are essential to our understanding of socio-pragmatic competence in the
variety. They are illustrative of a wider corpus linguistic bent to the study of IrE which is fast amounting
to a considerable body of research. In addition to corpora such as LCIE and ICE-Ireland, emerging
corpora of IrE such as CORIECOR (see Amador-Moreno and McCafferty, 2012), Millar’s (2015) blog
corpus and Vaughan and Moriarty’s array of corpora of performed discourses (see, for example,
Vaughan and Moriarty, 2018) are making important contributions to the description of the variety.
5. What can a corpus tell us about a language variety? The case of Irish intimate discourse in
LCIE
The Limerick Corpus of Irish English (LCIE) was designed to parallel the CANCODE corpus and,
therefore, adopted a framework that, in terms of data collection, features a number of cells, for example,
intimate collaborative idea or pedagogical information provision, that serve as a guide for achieving
representativeness and balance. In common with many spoken corpora, LCIE also adopted an
opportunistic approach to data collection which, given that the corpus design team were housed in a
university setting, resulted in much of the data collection, in the more informal context-types such as
intimate and socialising at least, being undertaken by university students. In terms of the distribution of
data in LCIE, 56% of the data collected for the corpus is from an intimate context (see Clancy, 2016).
In this sense, LCIE cannot be considered a balanced corpus; however, given that balance in a spoken
corpus designed to represent a variety of a language can be difficult to achieve, this is not necessarily
problematic. In fact, the data collected in the intimate context type might be considered representative
of a particular demographic – university students in their early to mid-twenties and their families and
close friends (also predominantly university students). This context-type, in essence, forms the core of
the LCIE corpus.
In keeping with studies that have used LCIE more generally, research on the intimate sub-corpus of
LCIE has also seen a focus on the understanding of pragmatic phenomena such as reference, turn taking,
12
(im)politeness and pragmatic markers. For example, the PMs kind/sort of and I think have been shown
to occur with less frequency in intimate discourse than in other varieties such as radio-phone in or
spoken academic discourse (Clancy 2005 and 2016). The function of kind/sort of has been linked to the
reduction of social distance between speakers and the creation of a relaxed relationship between
conversational participants (Holmes, 1993), functions that are arguably unnecessary in the intimate
sphere. Other items that are associated with the creation of group solidarity, such as the pronoun we,
have also been shown to occur with a lower frequency in intimate discourse than in other spoken
contexts such as the workplace or media discourse (Clancy, 2016). Intimates use we less frequently due
to a pre-existing, well-established group identity that may need less maintenance than in other groups
(see also Clancy, 2020). Similarly, I think, amongst other functions, has been shown to preface, and
thereby mitigate, disagreement sequences (see, for example, Muntigl and Turnbull, 1998). However, a
feature of IrE as a variety seems to be that mitigation occurs in towards the closing of a speaker turn,
and this has also been shown to be true of intimate discourse, specifically family discourse.
Disagreement sequences in IrE family discourse are characterised by low levels of mitigators such as I
think in preface position, as family members can become quickly involved in disagreements, frequently
using a yeah but turn preface, safe in the knowledge that mitigation can be performed later in the speaker
turn by PMs such as like or you know (Clancy, 2018). Similarly, Vaughan et al. (2017) found that Irish
intimates, in comparison to their British counterparts, perform mitigation in the form of indirectness
and imprecision in different ways, frequently through a clustering of vague category markers such as
or something like that with traditional PMs such as you know.
This is not to suggest that the pragmatic features of Irish intimate discourse is fundamentally distinct to
families and friendship groups in other cultures. Clancy and McCarthy (2015) have explored the
phenomenon of turn co-construction in the intimate and socialising context-types in IrE and BrE and
found that 10% of when-clauses at turn initial position were second speaker completions in both corpora
and, in the case of turn initial if-clauses, the figure for the co-construction rose to 19% in LCIE and 22%
in CANCODE. Co-construction has a pragmatic function in that speakers, in addition to viewing syntax
as a shared interactional resource (see Rühlemann, 2007), co-construct utterances through a process of
comment, qualification or evaluation. This process demonstrates that, for intimates in both IrE and BrE,
the concept of meaning is also viewed as a shared resource between speakers where utterance meaning
can be negotiated and modified unproblematically. This study, and, indeed, all of the studies outlined
in these sections represent the opportunities offered by corpora for the description of both user- and
use-related linguistic varieties.
13
Further reading
Biber, D. (1993) ‘Representativeness in corpus design’, Literary and Linguistic Computing 8(4): 243-
57. (In this article, Biber outlines how to construct a statistically representative corpus. In common with
many seminal texts, this work has proven to be both inspirational and contentious.)
Meyer, C. (2002) English Corpus Linguistics: An Introduction. Cambridge: Cambridge University
Press. (This book provides an accessible introduction to corpus linguistics in addition to a step-by-step
guide to corpus design, construction and analysis. Meyer draws heavily on corpora such as the BNC
and ICE in order to illustrate each stage.)
Kučera, K. (2002) ‘The Czech National Corpus: Principles, design, and results’, Literary and Linguistic
Computing 17(2): 245-257
Douglas, F. (2003) ‘The Scottish Corpus of Texts and Speech: Problems of corpus design’, Literary
and Linguistic Computing 18(1), 23-37.
(Both of these articles explore, and discuss practical solutions to, the problems encountered during the
design and construction of large, non-English language corpora representative of language varieties.)
References
Amador-Moreno, C. (2010) An Introduction to Irish English. London: Equinox.
Amador-Moreno, C. and McCafferty, K. (2012) ‘A Corpus of Irish English Correspondence
(CORIECOR): A tool for studying the history and evolution of Irish English’, in B. Migge and M. Ní
Chiosáin (eds) New Perspectives on Irish English. Amsterdam: John Benjamins, pp. 265-288.
Andersen, G. (2001) Pragmatic Markers and Sociolinguistic Variation: A Relevance-Theoretic
Approach to the Language of Adolescents. Amsterdam: John Benjamins.
Aston, G. and Burnard, L. (1998) The BNC Handbook: Exploring the British National Corpus with
SARA. Edinburgh: Edinburgh University Press.
Atkins, S., Clear, J. and Ostler, N. (1992) ‘Corpus design criteria’, Literary and Linguistic Computing
7(1): 1-16.
Biber, D. (1990) ‘Methodological issues regarding corpus-based analyses of linguistic variation’,
Literary and Linguistic Computing 5(4): 257-69.
Biber, D. (1993) ‘Representativeness in corpus design’, Literary and Linguistic Computing 8(4): 243-
57.
14
Biber, D., Johansson, S., Leech, G., Conrad, S. and Finnegan, E. (1999) The Longman Grammar of
Spoken and Written English. London: Longman.
Bliss, A. (1979) Spoken English in Ireland 1600-1740. Dublin: Dolmen Press.
Chafe, W., Du Bois, J. and Thompson, S. (1991) ‘Towards a new corpus of American English’, in K.
Aijmer and B. Altenberg (eds) English Corpus Linguistics. London: Longman, pp. 64-82.
Clancy, B. (2005) ‘You’re fat. You’ll eat them all: Politeness strategies in family discourse’, in A.
Barron and K. Schneider (eds) The Pragmatics of Irish English. Berlin: Mouton de Gruyter, pp. 177-
99.
Clancy, B. (2016) Investigating Intimate Discourse: Exploring the Spoken Interaction of Families,
Couples and Close Friends. London: Routledge.
Clancy, B. (2018) ‘Conflict in corpora: Investigating family conflict sequences using a corpus
pragmatic approach’, Journal of Language Aggression and Conflict 6(2): 228-247.
Clancy, B. (2020) ‘Intimacy and identity in Irish English: A corpus pragmatic approach to the study of
personal pronouns’, in R. Hickey and C. Amador-Moreno (eds) Irish Identities: Sociolinguistic
Perspectives. Berlin: Walter De Gruyter, pp. 153-172.
Clancy, B. and McCarthy, M. (2015) ‘Co-constructed turn-taking’, in K. Aijmer and C. Rühlemann
(eds) Corpus Pragmatics: A Handbook. Cambridge: Cambridge University Press, pp. 430-453.
Clancy, B. and Vaughan, E. (2012) ‘“It’s lunacy now”: A corpus-based pragmatic analysis of the use
of ‘now’ in contemporary Irish English’, in B. Migge and M. Ní Chiosáin (eds) New Perspectives on
Irish English. Amsterdam: John Benjamins, pp. 225-246.
Crowdy, S. (1993) ‘Spoken corpus design’, Literary and Linguistic Computing 8(4): 259-265.
Crystal, D. (2001) Language and the Internet. Cambridge: Cambridge University Press.
Filppula, M. (2012) ‘Exploring grammatical differences between Irish and British English’, in B. Migge
and M. Ní Chiosáin (eds) New Perspectives on Irish English. Amsterdam: John Benjamins, pp. 85-100.
Greenbaum, S. (1991) ‘The development of the International Corpus of English’, in K. Aijmer and B.
Altenberg (eds) English Corpus Linguistics. London: Longman, pp. 83-91.
Halliday, M.A.K. (1978) Language as a Social Semiotic: The Social Interpretation of Language and
Meaning. London: Edward Arnold.
Holmes, J. (1993) ‘“New Zealand women are good to talk to”: An analysis of politeness strategies in
interaction’, Journal of Pragmatics 20: 91-116.
15
Hunston, S. (2002) Corpora in Applied Linguistics. Cambridge: Cambridge University Press.
Joyce, P.W. (1910) English as We Speak it in Ireland. Dublin: M.H. Gill.
Johansson, S., Leech, G. and Goodluck, H. (1978) Manual of Information to Accompany the
Lancaster/Oslo- Bergen Corpus of British English, for Use with Digital Computers. Oslo: Department
of English, University of Oslo.
Kallen, J. (2013) Irish English Volume 2: The Republic of Ireland. Berlin: Walter De Gruyter.
Kirk, J. (2016) ‘The pragmatic annotation scheme of the SPICE-Ireland corpus’, International Journal
of Corpus Linguistics 21(3): 299-322.
Kirk, J. (2017) ‘The present perfect in Irish English’, World Englishes 36(2): 239-253.
Leech, G. (1991) ‘The state of the art in corpus linguistics’, in K. Aijmer and B. Altenberg (eds) English
Corpus Linguistics. London: Longman, pp. 8-30.
Leech, G. (2007) ‘New resources, or just better old ones? The Holy Grail of representativeness’, in M.
Hundt, N. Nesselhauf and C. Biewer (eds) Corpus Linguistics and the Web. Amsterdam: Rodopi, pp.
133-149.
Love, R., Dembry, C., Hardie, A., Brezina V. and McEnery, T. (2017) ‘The Spoken BNC2014:
Designing and building a spoken corpus of everyday conversations’, International Journal of Corpus
Linguistics 22(3): 319-344.
McCarthy, M. (1998) Spoken Language and Applied Linguistics. Cambridge: Cambridge University
Press.
McCarthy, M. (2015) ‘“’Tis mad, yeah”: Turn openers in Irish and British English’, in C. Amador-
Moreno, K. McCafferty, and E. Vaughan (eds) Pragmatic Markers in Irish English. Amsterdam: John
Benjamins, pp. 156-175.
McEnery, T., Xiao, R. and Tono, Y. (2006) Corpus-based Language Studies: An Advanced Resource
Book. London: Routledge.
Meyer, C. (2002) English Corpus Linguistics: An Introduction. Cambridge: Cambridge University
Press.
Millar, S. (2015) ‘Blathering Beauties: The use of pragmatic markers in an Irish beauty blog’, in C.
Amador-Moreno, K. McCafferty, and E. Vaughan (eds) Pragmatic Markers in Irish English.
Amsterdam: John Benjamins, pp. 292-317.
Muntigl, P. and Turnbull, W. (1998) ‘Conversational structure and facework in arguing’, Journal of
Pragmatics 29(3): 225-256.
16
Nelson, G. (1996) ‘The design of the corpus’, in S. Greenbaum (ed.) Comparing English Worldwide:
The International Corpus of English. Oxford: Oxford University Press, pp. 27-36.
O’Keeffe, A. and Adolphs, S. (2008) ‘Response tokens in British and Irish discourse: Corpus, context
and variational pragmatics’, in K. Schneider and A. Barron (eds) Variational Pragmatics: A Focus on
Regional Varieties in Pluricentric Languages. Amsterdam: John Benjamins, pp. 69-98.
O’Keeffe, A. and Amador-Moreno, C. (2009) ‘The pragmatics of the be + after + V-ing construction
in Irish English’, Intercultural Pragmatics 6(4): 517-534.
O’Keeffe, A., Clancy, B. and Adolphs, S. (2020). Introducing Pragmatics in Use. London: Routledge.
Quirk, R. (1995) Grammatical and Lexical Variance in English. London: Longman.
Rühlemann, C. (2007) Conversation in Context. London: Continuum.
Schweinberger, M. (2015) ‘A comparative study of the pragmatic marker like in Irish English and in
south-eastern varieties of British English’, in C. Amador-Moreno, K. McCafferty and E. Vaughan (eds)
Pragmatic Markers in Irish English. Amsterdam: John Benjamins, pp. 114-134.
Simpson-Vlach, R. and S. Leicher, 2006. The MICASE Handbook: A Resource for Users of the
Michigan Corpus of Academic Spoken English. Ann Arbor, MI: Michigan University Press.
Sinclair, J. (2005) ‘Corpus and text – basic principles’, in M. Wynne (ed.) Developing Linguistic
Corpora: A Guide to Good Practice. Oxford: Oxbow Books, pp. 1-16.
Tagliamonte, S. (2005) ‘So who? Like how? Just what? Discourse markers in the conversation of young
Canadians’, Journal of Pragmatics 37: 1896-1915.
Vaughan, E. and Clancy, B. (2013) ‘Small corpora and pragmatics’, Yearbook of Corpus Linguistics
and Pragmatics 1: 53-73.
Vaughan, E. and Clancy, B. (2016) ‘Sociolinguistic information and Irish English corpora’, in R.
Hickey (ed.) Sociolinguistics in Ireland. London: Palgrave, pp. 365-388.
Vaughan, E. and Moriarty, M. (2018) ‘Voicing the ‘knacker’: Analysing the comedy of the
Rubberbandits’, in D. Villabuena-Romero, C. Amador-Moreno and M. Sánchez-García (eds) Voice and
Discourse in the Irish Context, London: Palgrave, pp. 13-45.
Vaughan, E., Clancy, B. and McCarthy, M. (2017) ‘Vague category markers as turn-final items in Irish
English’, World Englishes 36(2): 208-223.