Article

Review of Love (2020): Overcoming challenges in corpus construction: The Spoken British National Corpus 2014

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

Article
Full-text available
This paper investigates changes in swearing usage in informal speech using large-scale corpus data, comparing the occurrence and social distribution of swear words in two corpora of informal spoken British English: the demographically-sampled part of the Spoken British National Corpus 1994 (BNC1994) and the Spoken British National Corpus 2014 (BNC2014); the compilation of the latter has facilitated large-scale, diachronic analyses of authentic spoken data on a scale which has, until now, not been possible. A form and frequency analysis of a set of 16 ‘pure’ swear word lemma forms is presented. The findings reveal that swearing occurrence is significantly lower in the Spoken BNC2014 but still within a comparable range to previous studies. Furthermore, FUCK is found to overtake BLOODY as the most popular swear word lemma. Finally, the social distribution of swearing across gender and age groups generally supports the findings of previous research: males still swear more than females, and swearing still peaks in the twenties and declines thereafter. However, the distribution of swearing according to socio-economic status is found to be more complex than expected in the 2010s and requires further investigation. This paper also reflects on some of the methodological challenges associated with making comparisons between the two corpora.
Article
Full-text available
On the surface, it appears that conversational language is produced in a stream of spoken utterances. In reality conversation is composed of contiguous units that are characterized by coherent communicative purposes. A large number of important research questions about the nature of conversational discourse could be addressed if researchers could investigate linguistic variation across functional discourse units. To date, however, no corpus of conversational language has been annotated according to functional units, and there are no existing methods for carrying out this type of annotation. We introduce a new method for segmenting transcribed conversation files into discourse units and characterizing those units based on their communicative purposes. In this paper, the development and piloting of this method is described in detail and the final framework is presented. We conclude with a discussion of an ongoing project where we are applying this coding framework to the British National Corpus Spoken 2014.
Preprint
Full-text available
Reliable high-quality transcription and/or annotation (a.k.a. 'coding') is essential for research in a variety of areas in Humanities and Social Sciences which make use of qualitative data such as interviews, focus groups, classroom observations or any other audio/video recordings. A good tool can facilitate the work of transcription and annotation because the process is notoriously time-consuming and challenging. However, our survey indicates that few existing tools can accommodate the requirements for transcription and annotation (e.g. audio/video playback, spelling checks, keyboard shortcuts, adding tags of annotation) in one place so that a user does not need to constantly switch between multiple windows, for example, an audio player and a text editor. 'Transcribear' (https://transcribear.com) is therefore developed as an easy-to-use online tool which facilitates transcription and annotation on the same interface while this web tool operates offline so that a user's recordings and transcripts can remain secure and confidential. To minimize human errors, the functionality of tag validation is also added. Originally designed for a multimodal corpus project UNNC CAWSE, this browser-based application can be customized for individual users' needs in terms of the annotation scheme and corresponding shortcut keys. This paper will explain how this new tool can make tedious and repetitive manual work faster and easier and at the same time improve the quality of outputs as the process of transcription and annotation tends to be prone to human errors. The limitations of Transcribear and future work will also be discussed.
Article
Full-text available
This article focuses on how register considerations informed and guided the design of the spoken component of the British National Corpus 2014 (Spoken BNC2014). It discusses why the compilers of the corpus sought to gather recordings from just one broad spoken register – ‘informal conversation’ – and how this and other design decisions afforded contributors to the corpus much freedom with regards to the selection of situational contexts for the recordings. This freedom resulted in a high level of diversity in the corpus for situational parameters such as recording location and activity type , each of which was captured in the corpus metadata. Focussing on these parameters, this article provides evidence for functional variation among the texts in the corpus and suggests that differences such as those observed presently could be analysable within the existing frameworks for analysis of register variation in spoken and written language, such as multidimensional analysis.
Article
Full-text available
This study investigates whether "I’m sure" seems to be on the same grammaticalization trajectory as "I think." It does so by tracking the frequency of these two constructions over time to explore (i) their distribution across clausal positions (syntagmatic variability) and (ii) the extent to which the complementizer that is omitted (paradigmatic variability). The study uses spoken data from the BNC and the newly compiled Spoken BNC2014. The results show that the two constructions exhibit remarkable similarity, not only in terms of their proportional distribution across clausal positions, but also in terms of their propensity for "that"-omission. For example, both constructions show adverb-like behavior with regard to the clausal positions. Furthermore, even though the time span covered is relatively short, a clear increase in "that"-omission was noted for "I’m sure", mirroring the frequencies for "I think" very closely. It thus seems that "I’m sure" is on the same path as "I think", despite differences in frequency entrenchment.
Article
Full-text available
'Corpus Design Criteria' beings (Section 1) by defining the object to be created, a corpus, and the constituents of it, texts themselves, noting briefly the pragmatic constraints on the sort of documents which will actually be available, spoken as well as written.It then (Section 2) reviews the practical stages in the process of establishing a corpus, from selection of sources through to mark-up, assigning annotations to the texts assembled. This is followed by a consideration of copyright problems (Section 3).Section 4 points out the major difficulties in defining the population of texts that the corpus will sample, contrasting the sets of texts received versus those produced by a target group, and internal (linguistic) versus external (social) means of defining such groups.The next three sections look at the sets of markers which can be useful at different levels Section 7 begins at the highest level, considering the different types of corpus there may be. Section 6 is intermediate, considering how to distinguish the different types of text occurring within a corpus. Then, for the intra-text level. Section 7 reviews considerations governing mark-up, distinguishing those markers useful for written and spoken texts. Of these three sections. Section 6 is the most fully explicit, listing twenty-nine significant attributes assignable to a text.Sections 8 and 9 turn away from the corpus design itself, to focus on its social context and function, both of the corpus design process, and of the corpus when implemented: to what extent are there now accepted standards relevant to the criteria reviewed in preceding sections? And what are the major classes of potential users and uses for corpora, both now and in the future?.
Article
Full-text available
This article looks at how a comprehensive list of one category of idioms, that of `core idioms', was established. When the criteria to define a core idiom were strictly applied to a dictionary of idioms, the result was that the large number of `idioms' was reduced to a small number of `core idioms'. The original list from the first source dictionary was added to by applying the same criteria to other idiom dictionaries, and other sources of idioms. Once the list was complete, a corpus search of the final total of 104 `core idioms' was carried out in the British National Corpus (BNC). The search revealed that none of the 104 core idioms occurs frequently enough to merit inclusion in the 5,000 most frequent words of English.
Conference Paper
Full-text available
The UCREL semantic analysis system (USAS) is a software tool for undertaking the automatic semantic analysis of English spoken and written data. This paper describes the software system, and the hierarchical semantic tag set containing 21 major discourse fields and 232 fine-grained semantic field tags. We discuss the manually constructed lexical resources on which the system relies, and the seven disambiguation methods including part-of-speech tagging, general likelihood ranking, multi-word-expression extraction, domain of discourse identification, and contextual rules. We report an evaluation of the accuracy of the system compared to a manually tagged test corpus on which the USAS software obtained a precision value of 91%. Finally, we make reference to the applications of the system in corpus linguistics, content analysis, software engineering, and electronic dictionaries.
Book
Cambridge Core - Discourse Analysis - Register, Genre, and Style - by Douglas Biber
Article
This paper concerns the relationship between ‘dubitative’ questions, which include a lexical marker of uncertainty (as, for example, the adverb maybe) and the questioner’s epistemic position those questions come from. Our main aim is to find out, from an epistemic stance perspective, (1) why polar question (polar interrogatives, tag and declarative questions) and alternative questions can be made dubitative, (2) why wh-questions cannot, (3) whether the presence of a lexical marker of uncertainty in polar and alternative questions changes anything in the questioners’ epistemic commitment in comparison with the corresponding plain questions (and, if so, what changes can be identified). The results concerning (1) and (2) show that polar and alternative questions can be made dubitative since they come from the questioner’s uncertain position (which ranges from the Not Knowing Whether pole to the Believing one), while whquestions cannot be made dubitative, since they come from the unknowing position. The results concerning (3) show that, when added to questions coming from the Believing pole (tag and declarative questions), the presence of maybe raises the degree of uncertainty, while, when added to questions coming from the Not Knowing Whether pole (alternative questions and polar interrogatives), the adverb makes them, paradoxically, less uncertain.
Article
This paper presents practices in the compilation of FOLK, the Research and Teaching Corpus of Spoken German, a large collection of spontaneous verbal interaction from diverse discourse domains. After introducing the aims and organisational circumstances of the construction of FOLK, the general idea is discussed that good practices cannot be developed without considering methodological, technological and organisational aspects on equal footing. Starting from this idea, the this paper inspects a little more closely some actual practices in FOLK, namely the handling of legal (especially privacy protection) issues, the decisions taken for the transcription and annotation workflow, and the question of how to best disseminate a corpus like FOLK. The final section sketches some possible future improvements for practices in FOLK.
Article
Everyday spoken language has a long tradition of being seen as the poor rela- tion of the written language. The use of certain terminologies in corpus lingui- stic studies of conversational grammar reveals that this tradition is continuing. This paper argues that an alternative view is possible, a view which recognises the inherent value of conversation, which lies in the adaptedness of conversa- tional language to constraints set by the conversational 'situation type' (Halli- day 1978). The use of I goes is examined as a case in point. The form is investi- gated in terms of its distribution across registers, its morphosyntax, and the discourse and situational factors that bear on its use. The discourse and situa- tional factors are discussed on the basis of a detailed analysis of a sample of 90 occurrences of I goes in the context of 100 words each. It is shown that I goes acts both as a multi-turn quotative, that is, as a reporting clause in presentations of extended stretches of anterior conversation with frequent occurrences of speaker change, and as a speech-economic device freeing processing resources that the narrator can bring to bear on the achievement of the underlying pur- pose of storytelling, namely to indicate 'the point' of the narrative (Labov 1972). In this perspective, I argue, I goes can be seen as a skilled adaptation to two constraints set by the conversational situation: the fundamental scarcity of time and its relational goal-orientation. In the concluding section, I argue that a situation-based approach may foster a tradition of acknowledging the value of conversational language as adapted language, an acknowledgment which is needed particularly in EFL teaching, where the status of Standard English as the unrivalled model for teaching both writing and speech is preventing impor- tant corpus linguistic insights from trickling into EFL classrooms. Finally, I also stress the usefulness of relating corpus linguistic findings to theories derived from non-corpus linguistic research.
Article
In recent years, the use of large corpora has revolutionized the way we study language. There are now numerous well-established corpus projects, which have set the standard for future corpus-based research. As more and more corpora are developed and technology continues to offer greater and greater scope, the emphasis has shifted from corpus size to establishing norms of good practice. There is also an increasingly critical appreciation of the crucial role played by corpus design. Corpus design can, however, present peculiar problems for particular types of source material. The Scottish Corpus of Texts and Speech (SCOTS) is the first large-scale corpus project specifically dedicated to the languages of Scotland, and therefore it faces many unanswered questions, which will have a direct impact on the corpus design. The first phase of the project will focus on the language varieties Scots and Scottish English, varieties that are themselves notoriously difficult to define. This paper outlines the complexities of the Scottish linguistic situation, before going on to examine the problematic issue of how to construct a well-balanced and representative corpus in what is largely uncharted territory. It argues that a well-formed corpus cannot be constructed in a linguistic vacuum, and that familiarity with the overall language population is essential before effective corpus sampling techniques, methodologies, and categorization schema can be devised. It also offers some preliminary methodologies that will be adopted by SCOTS.
Article
Based on a large set of data from one of the biggest available corpora of spoken British English (the 10-million word spoken component of the BNC), this article explores central lexical-grammatical aspects of progressive forms with future time reference. Among the phenomena investigated are verb preferences, adverbial co-selection, subject types, and negation. It is demonstrated that future time progressives in spoken British English are patterned to a considerable extent (for example that it is individual verbs, rather than semantic groups of verbs, that preferably occur in such constructions) and that actual language use often runs counter to claims that can be found in traditional grammatical descriptions of the construction. A number of general and often neglected issues in the analysis of lexical-grammatical patterns are also addressed, in particular the notion of pattern frequency.
Article
The present paper addresses a number of issues related to achieving ‘representativeness’ in linguistic corpus design, including: discussion of what it means to `represent’ a language, definition of the target population, stratified versus proportional sampling of a language, sampling within texts, and issues relating to the required sample size (number of texts) of a corpus. The paper distinguishes among various ways that linguistic features can be distributed within and across texts; it analyzes the distributions of several particular features, and it discusses the implications of these distributions for corpus design. The paper argues that theoretical research should be prior in corpus design, to identify the situational parameters that distinguish among texts in a speech community, and to identify the types of linguistic features that will be analyzed in the corpus. These theoretical considerations should be complemented by empirical investigations of linguistic variation in a pilot corpus of texts, as a basis for specific sampling decisions. The actual construction of a corpus would then proceed in cycles: the original design based on theoretical and pilot-study analyses, followed by collection of texts, followed by further empirical investigations of linguistic variation and revision of the design.