Article

Corpus linguistics with BNCweb—A practical guide

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... At the same time, work with current speech corpora often requires a considerable level of specialist knowledge and tailormade solutions. On a practical level, we present a new feature of BNCweb (Hoffmann et al. 2008), a user-friendly interface to the British National Corpus, which gives users access to audio and phonemic transcriptions of more than five million words of spontaneous speech. With the help of a pilot study on the variability of intrusive r we illustrate the scope of the new possibilities. ...
... The aim of this paper is threefold: First, on a rather basic level, the paper is intended to provide an overview of the functionality of a new feature of BNCweb (Hoffmann et al. 2008), a user-friendly interface to the 100-million word British National Corpus (BNC, Burnard 2007). This feature gives users access to the audio and the phonemic transcriptions of more than five million words of spontaneous speech. ...
... collocations, distribution across metatextual categories, sorting of query results). Originally developed at the University of Zurich (see Lehmann et al. 2000), the functionality of BNCweb was further extended by the first author and Stefan Evert in the first decade of the 21st century (see Hoffmann and Evert 2006, Hoffmann et al. 2008. Since then, the tool saw little development, given that CQPweb, a more modern version of the interface, had been created by Andrew Hardie, which is not restricted to the use with the BNC but which can search any corpus that meets the necessary format requirements (see Hardie 2012). ...
Article
Full-text available
In spite of the wide agreement among linguists as to the significance of spoken language data, actual speech data have not formed the basis of empirical work on English as much as one would think. The present paper is intended to contribute to changing this situation, on a theoretical and on a practical level. On a theoretical level, we discuss different research traditions within (English) linguistics. Whereas speech data have become increasingly important in various linguistic disciplines, major corpora of English developed within the corpus-linguistic community, carefully sampled to be representative of language usage, are usually restricted to orthographic transcriptions of spoken language. As a result, phonological phenomena have remained conspicuously understudied within traditional corpus linguistics. At the same time, work with current speech corpora often requires a considerable level of specialist knowledge and tailor-made solutions. On a practical level, we present a new feature of BNCweb (Hoffmann et al. 2008), a user-friendly interface to the British National Corpus, which gives users access to audio and phonemic transcriptions of more than five million words of spontaneous speech. With the help of a pilot study on the variability of intrusive r we illustrate the scope of the new possibilities.
... Several online tools also offer powerful search capabilities. The addition of the CQP (Corpus Query Processor) syntax to the BNCweb interface to the British National Corpus (hereafter BNC; Burnard 2007) greatly improved the flexibility with which it can be used (see Hoffmann and Evert 2006;Hoffmann et al. 2008). 1 Previously, BNCweb was based on the SARA server software and suffered from the limitation that searches needed to be based on a specific lexical item, making it impossible to search for grammatical patterns based solely on POS tags. 2 With the introduction of the CQP syntax, BNCweb now offers highly complex searches: for one thing, searches can combine different layers of annotation such as part of speech, lemma and word. For another, the CQP syntax also allows for regular expressions, including the | operator, meaning "or", thereby enabling different patterns to be represented by one search string. ...
... For another, the CQP syntax also allows for regular expressions, including the | operator, meaning "or", thereby enabling different patterns to be represented by one search string. In addition, BNCweb allows the user to specify intervals of different ranges between different tags, as well as whether a particular tag should occur at the beginning or at the end of a sentence (see Hoffmann et al. 2008). As a consequence, for those working with the BNC, BNCweb is now one of the most powerful search interfaces available. ...
... The association scores whose cognitive realism are at issue in this paper cover a spectrum of diverse yet widely established metrics. We compare the six association scores that are readily computed on the BNC QCP-Edition (Hoffmann 2008). All measures quantify the strength of mutual attraction between words on the basis of corpus data, with higher scores indicating stronger attraction, i.e., higher mutual expectation. ...
... All association scores were extracted directly from British National Corpus (BNC), accessed via the online CQP-Edition (Hoffmann 2008). For each bigram, this was done in the following way: ...
Article
In the following self-paced reading study, we assess the cognitive realism of six widely used corpus-derived measures of association strength between words (collocated modifier–noun combinations like vast majority ): MI, MI3, Dice coefficient, T -score, Z -score, and log-likelihood. The ability of these collocation metrics to predict reading times is tested against predictors of lexical processing cost that are widely established in the psycholinguistic and usage-based literature, respectively: forward/backward transition probability and bigram frequency. In addition, the experiment includes the treatment variable of task : it is split into two blocks which only differ in the format of interleaved comprehension questions (multiple choice vs. typed free response). Results show that the traditional corpus-linguistic metrics are outperformed by both backward transition probability and bigram frequency. Moreover, the multiple-choice condition elicits faster overall reading times than the typed condition, and the two winning metrics show stronger facilitation on the critical word (i.e. the noun in the bigrams) in the multiple-choice condition. In the typed condition, we find an effect that is weaker and, in the case of bigram frequency, longer lasting, continuing into the first spillover word. We argue that insufficient attention to task effects might have obscured the cognitive correlates of association scores in earlier research.
... The usage of corpora in cognitive studies is well described in (Brezina, 2018;Hoffmann et al., 2008;Stefanowitsch, 2020;Wallis, 2021). Nowadays, corpus managers provide plenty of statistical methods and a strong way to verify any hypothesis (Arppe, 2009;Cameron & Panović, 2014;Cohen et al., 2014;Flowerdew, 2012;Glynn & Fisher, 2010;Glynn, 2010;Haider, 2019). ...
... Though it was mentioned earlier in (Hoffmann et al., 2008) that a collocation study becomes more reliable if more complex statistical measures are used (like log-likelihood or mutual information criteria), we limit ourselves to raw frequency data in this exploration analysis. The more in-depth study will use complex criteria if necessary. ...
... Despite this, lexical verb frequency measures were included, since semantic familiarity arguably plays a role in the reconstruction of the propositional meaning expressed by the gapped sentence, including its temporal, aspectual and voice meanings. Frequency information comes from the spoken and written components of the British National Corpus (henceforth BNC-S and BNC-W; Hoffmann et al. 2008) and SUBTLEXus (Brysbaert and New 2009). Several related measures were employed (features 37-54 in Table 4). ...
... The linguistic features tested in the present study were extracted as follows. Features 37-46 were extracted using BNCweb (Hoffmann et al. 2008 ...
Article
Full-text available
Link: https://rdcu.be/bBHpL Advances in computer technology and artificial intelligence create opportunities for developing adaptive language learning technologies which are sensitive to individual learner characteristics. This paper focuses on one form of adaptivity in which the difficulty of learning content is dynamically adjusted to the learner’s evolving language ability. A pilot study is presented which aims to advance the (semi-)automatic difficulty scoring of grammar exercise items to be used in dynamic difficulty adaptation in an intelligent language tutoring system for practicing English tenses. In it, methods from item response theory and machine learning are combined with linguistic item analysis in order to calibrate the difficulty of an initial exercise pool of cued gap-filling items (CGFIs) and isolate CGFI features predictive of item difficulty. Multiple item features at the gap, context and CGFI levels are tested and relevant predictors are identified at all three levels.Our pilot regression models reach encouraging prediction accuracy levels which could, pending additional validation, enable the dynamic selection of newly generated items ranging from moderately easy to moderately difficult. The paper highlights further applications of the proposed methodology in the area of adapting language tutoring, item design and second language acquisition, and sketches out issues for future research.
... Because of the variability and complexity of real language in a corpus, the abstract specification of a VAC pattern in terms of grammatical dependency relations is likely to miss some unanticipated but common patterns. We carried out recall analyses using a very simple search, i.e. just the single word about for the V about n VAC, in the BNCweb interface (Hoffmann et al. 2008). Each of two annotators carried out this search and each reviewed 500 separate results, which were randomly selected. ...
... This is the reverse of the usual practice adopted in corpus linguistic research (e.g. Hoffmann et al. 2008), which advocates an approach that produces wide coverage (i.e. high recall) and the use of manual correction, such as would be the case in a KWIC-centered analysis (e.g. ...
... Scientific literature review reveals that corpus linguistics has already established itself as a self-sufficient scientific field of knowledge (Gvishiani, 2008;Plungyan et al., 2009;Suvorina, 2011;Hoffmann et al., 2008). Furthermore, it continues to be used not only in research (Sinclair, 2006), but also in educational areas (Wu &Peng, 2016;Wang et al., 2013;Kennedy,2003). ...
... (McEnery et al., 2012). The outcomes of the corpus processing by BNC web can be presented in the form of tables of summary quantitative characteristics of the statistical data, which clearly and systematically show the total results of the material, its digital characteristics, the state of the phenomenon, and are the basis for formulating assumptions and conclusions (Hoffmann et al., 2008). ...
Article
Full-text available
The article deals with the application of corpus-based direction in English language teaching of university students, suggested by Ukrainian scholars. The most representative corpus for English language teaching (ELT) is the British National Corpus (BNC), which offers many opportunities (e.g. search for specific word forms, search for word forms by lemmas, search for groups of word forms in the form of syntagms, etc.). The article presents the methodological algorithm of university students' work with the BNC during English classes based on the verbs denoting human emotional states. The methodology of work with BNC consists of three stages: 1) a student has to compile the initial lexicographic register of basic verb denoting emotional states; 2) a student has to measure the frequency of each unit in the corpus usage; and 3) a student has to analyse, described and record all corpus calculations. The main benefits of the findings for the future relevant studies may be described in the following way: the work with corpus tools in ELT is aimed at students performing the following successive steps: 1) processing concordances, 2) calculating the absolute frequency, 3) analysing the left and right valence, and 4) modelling clusters to build cognitive-semantic profiles of the studied units, which will allow university students to understand the essence of every grammatical, lexical, and syntactical unit.
... To measure DURATION as well as NUCLEUS, the recordings for each turn in the 800 10-word turn sample were accessed through BNCweb, an online interface for the BNC (cf. Hoffmann, Evert, Smith, Lee, & Berglund Prytz, 2008), exported and analyzed in Praat, a phonetic analysis tool (Boersma & Weenink, 2012). Each sound file was listened to repeatedly by a research assistant who was unaware of the research questions of this study; spectral waveforms ('sonograms') were inspected and zoomed in on to determine word boundaries based on 'valleys' in the waveform. ...
... While contextual variables may include, inter alia, discourse, syntax, culture, and world knowledge (Piantadosi et al., 2011) these variables are not easily quantified. We followed the practice of approximating surprisal by computing the negative binary log of the conditional probability of a word given the previous word based on a frequency list of all individual words and file-internal bigrams of the complete spoken component of the British National Corpus (c. 10 million words; cf Hoffmann et al., 2008). Then, two kinds of SURPRISAL values were computed: ...
Article
Turn transition in talk-in-interaction is achieved with remarkable precision, most commonly following a gap of no more than 200 ms (e.g., Stivers et al., 2009). How the precision is achieved is a complex issue given the wide range of variables co-participants to talk-in-interaction deploy to project (as speakers) and predict (as listeners) turn completion. This paper aims to contribute to a deeper understanding of one such variable used by speakers to project turn-completion: changes in word duration in turns-at-talk. As word duration varies significantly due to influences from a large number of confounds, we approach the challenges inherent in “[p]roviding robust, quantified, comparative measures of duration” (Local & Walker, 2012: 259) by fitting mixed-effects models based on naturally occurring corpus data. Contrary to previous research, which hailed the turn-final drawl as a turn-yielding cue, the models indicate that drawling, or rallentando, affects not just the turn-final syllable/word but large portions of the turn. Rallentando appears to be, not a one-off cue marking the turn’s end-point upon its occurrence, but an extended process advance-projecting the turn’s durational envelope. Also, as a graded advance-projecting resource, rallentando is in and of itself insufficient to signal turn completion reliably; listeners are likely to rely on turn rallentando in unison with other, preferably discrete cues marking the turn-completion point upon its occurrence, for “recogniz[ing] that a turn is definitely coming to an end” (Levinson & Torreira, 2015: 12) and triggering the launch of the next turn.
... CLiC has since been further expanded with version 2.0 (used for Table 10 of this article), so if readers of this article run searches on live CLiC there might be small quantitative differences, which, however, will not affect the overall results. For comparisons with real spoken language, we used the 'original' BNC1994 (XML version http://www.natcorp.ox.ac.uk/) to generate cluster counts and, for concordance searches, we used the BNCweb (Hoffmann et al., 2008). For each results table below, we indicate the respective data source. ...
Article
Full-text available
We propose a lexico-grammatical approach to speech in fiction based on the centrality of ‘fictional speech-bundles’ as the key element of fictional talk. To identify fictional speech-bundles, we use three corpora of 19th-century fiction that are available through the corpus stylistic web application CLiC (Corpus Linguistics in Context). We focus on the ‘quotes’ subsets of the corpora, i.e. text within quotation marks, which is mostly equivalent to direct speech. These quotes subsets are compared across the fiction corpora and with the spoken component of the British National Corpus 1994. The comparisons illustrate how fictional speech-bundles can be described on a continuum from lexical bundles in real spoken language to repeated sequences of words that are specific to individual fictional characters. Typical functions of fictional speech-bundles are the description of interactions and interpersonal relationships of fictional characters. While our approach crucially depends on an innovative corpus linguistic methodology, it also draws on theoretical insights into spoken grammar and characterisation in fiction in order to question traditional notions of realism and authenticity in fictional speech.
... In this thesis, I prefer the term 'structure', because it is broader and describes the entirety of the formal dimension rather than only a part of it (i.e., the syntactic component).8 Accessed via BNCweb(Hoffmann, Evert, Smith, Lee, & Berglund Prytz, 2008). ...
Thesis
Full-text available
This thesis is concerned with spoken dialogue and the dynamic negotiation of meaning in English conversation. It serves two aims, one theoretical and the other practical. The theoretical aim is to further our understanding of the kinds of properties that influence the meaning of constructions in spoken dialogue and the role of underlying socio-cognitive processes. The practical aim is to compile a new corpus of spoken British English, the London–Lund Corpus 2, modelled on the same principles as the first London–Lund Corpus from 50 years prior. The aims are addressed in the four articles included in the thesis. The first article focuses on a very common construction in English, namely I think COMPLEMENT and the family of complement-taking predicate constructions. It questions the rigid treatment of the constructions in APPRAISAL theory as always having the same dialogic meaning. For example, I think is considered to always open up the space for dialogic alternatives. By combining data from the London– Lund Corpus 1 with a laboratory experiment, we show that I think COMPLEMENT serves not only to expand the dialogic space, but it may also close it down. The factors that influence the dialogic meaning of the construction are not only semantic but also prosodic, collocational and social. The second article draws on data from the London–Lund Corpus 2 to shed new light on the interaction of intersubjective processes and priming mechanisms in dialogic resonance, which emerges when speakers reproduce constructions from prior turns. It does so by investigating the intersubjective functions that resonance has in discourse and the time it takes for speakers to resonate with each other. The results show that resonance is often used to express divergent views, which are produced very quickly. We argue that, while priming reduces the gap between speaker turns, intersubjective processes give the speakers the motivation to respond early. This is due to the increased sense of interpersonal solidarity that resonance is assumed to evoke. The third and the fourth articles are both concerned with the reactive what-x construction, which has not received any attention in the literature so far. The aim of the third article is to define and describe the constructional properties of the construction based on data from the London–Lund Corpus 2. The constructional representation includes not only lexical–semantic information but also essential dialogic and prosodic information, which are mostly missing in Construction Grammar. The fourth article combines data from the London–Lund Corpora to demonstrate the complex interplay between social motivations and cognitive mechanisms in the diachronic development of constructions in spoken dialogue. It shows that the development of the reactive what-x construction is triggered by the pragmatic strengthening of discourse-structuring and turn-taking inferences, and proceeds through metonymic micro-adjustments of the conceptual structure of the construction itself. In sum, the thesis provides a systematic and empirically grounded account of the dynamic negotiation of meaning in spoken dialogue. It contributes new knowledge to our understanding of the broad and interactive nature of constructional meaning and the complex interaction of underlying socio-cognitive processes. The compilation of the London–Lund Corpus 2 will facilitate many more investigations of this kind.
... In summary, the text search results prefer plural pronoun de + body part, even when the search input does not have -de. Regarding the reliability of grammaticality judgments from corpus-based studies, Schlüter (2006) argues that whether the application of modern software guarantees the quality of the research results is still quite difficult to answer, for example, the size of the machine-readable, mostly written, language database, the overall frequency of the data, and the temporal specification of the corpora affect the value of the software employed (see also Fillmore 1992;Biber et al. 1998;Hoffmann et al. 2008). Regarding the inconsistency in grammaticality judgments between linguists and naïve speakers, Achimova et al. (2015) discuss the potential effects of scale adjustment and unconscious accommodation via lexical substitution. ...
Article
Full-text available
In this paper, I first introduce what inalienable possession structure (IPS) is cross-linguistically as well as how to form an IPS in Mandarin Chinese, i.e., pronoun + body part or kinship term, etc. With the help of postverbal IPS, I relate the lack of plural pronominal possessor in IPS, which is never discussed in the literature, to the prohibition of distributivity over distributivity, i.e., the semantic anomaly of distributive plural possessor over the stubborn distributivity inherent to Chinese IPS nouns. I also argue that the requirement of a plural pronominal possessor seen in the IPS of public places, spatial directions, and professional titles is a result of stubborn collectivity shared by these nouns. In the end, I discuss the association between the distinction of inalienable and alienable nouns and that of active and stative verbs.
... As for gender, intensifiers have often been associated with female usage (for early views antedating modern sociolinguistic methodology, see, e.g., Stoffel 1901: 101;Jespersen 1922: 249-50). Generalizable overarching results on gender-specific usage of established forms remain modest (Ito & Tagliamonte 2003;Nevalainen 2008;Tagliamonte 2008), 6 although Fuchs (2017) and Hessner & Gawlitzek (2017), in their studies on data from the British National Corpus (BNC; 1994 and 2014 versions), point to gender being an influential factor, with women leading in the use of intensifiers, including amplifiers, in PDE (for the BNC (1994), see Hoffmann et al. 2008, and for the Spoken BNC2014, see Love et al. 2017). This was also the result that Bernaisch (2014) reached in his study of the Old Bailey Corpus: women used amplifiers more than men, despite some fluctuation in the degree to which they did so across the period studied. ...
Article
Full-text available
Based on an investigation of the Old Bailey Corpus , this article explores the development and usage patterns of maximizers in Late Modern English (LModE). The maximizers to be considered for inclusion in the study are based on the lists provided in Quirk et al. (1985) and Huddleston & Pullum (2002). The aims of the study were to (i) document the frequency development of maximizers, (ii) investigate the sociolinguistic embedding of maximizers usage (gender, class) and (iii) analyze the sociopragmatics of maximizers based on the speakers’ roles, such as judge or witness, in the courtroom. Of the eleven maximizer types focused on in the investigation, perfectly and entirely were found to dominate in frequency. The whole group was found to rise over the period 1720 to 1913. In terms of gender, social class and speaker roles, there was variation in the use of maximizers across the different speaker groups. Prominently, defendants, but also judges and lawyers, maximized more than witnesses and victims; further, male speakers and higher-ranking speakers used more maximizers. The results were interpreted taking into account the courtroom context and its dialogue dynamics.
... BNCweb 4 (Hoffmann et al., 2008) is a web-interface allowing browsing of the BNC corpus. It provides functionality such as concordancing, frequency lists, and collocations. ...
Article
Full-text available
This paper reports SuperCAT, a corpus analysis toolkit. It is a radical extension of SubCAT, the Sublanguage Corpus Analysis Toolkit, from sublanguage analysis to corpus analysis in general. The idea behind SuperCAT is that representative corpora have no tendency towards closure-that is, they tend towards infinity. In contrast, non-representative corpora have a tendency towards closure-roughly, finiteness. SuperCAT focuses on general techniques for the quantitative description of the characteristics of any corpus (or other language sample), particularly concerning the characteristics of lexical distributions. Additionally, SuperCAT features a complete re-engineering of the previous SubCAT architecture.
... The analysis was carried out manually on random samples from the written BNC (henceforth BNCw), using BNCweb (Hoffmann et al. 2008). In order to maximize the number of relevant structures in each sample, complex queries were used. ...
Article
Full-text available
It has often been claimed that conditionals have a special relation to modality. This study empirically tests this claim by examining the frequency of modal marking in a number of conditional and non-conditional structures using a corpus-based approach. It then seeks to provide explanations for the emerging frequency patterns in light of the tenets of two linguistic theories: Lexical Grammar and Construction Grammar. This juxtaposition was motivated by the significant overlap in their tenets: both theories take into account meaning (semantic and pragmatic), as well as lexical and grammatical factors.
... Most verbs patterning strongly with poverty as object had meanings associated with the need to end it (e.g., TACKLE, ALLEVIATE, ERADICATE, REDUCE, FIGHT, ESCAPE; see Concordance 2). Here the focus on poverty as a distant and abstract issue was even more accentuated, with references to science (line 4), world migration (lines 6 and 7), the arts (line 1), and world poverty (lines 2, 3, 5 and 8 In the British National Corpus (BNCweb, 2018;Hoffmann et al., 2008), these verbs often co-occur with negative circumstances like diseases (for example, to eradicate an infection, tumour or disease), conflicts and risky situations (such as to fight a battle, war, or blaze) or captivity (for example, to escape death or to escape from prison). Similarly, TACKLE, ALLEVIATE and REDUCE are used in contexts such as tackling problems, issues, crisis or alleviating suffering, symptoms, anxiety and reducing costs or risks. ...
... In order to see the general linguistic behaviours of love, the results from the British National Corpus (BNC) were further extracted through the BNCweb (cf. Hoffmann et al., 2008), an online interface with additional analytical functions designed specifically for the BNC. 3 The BNC is a 100-million-word corpus of written (90%) and spoken (10%) language from a variety of sources. The corpus represents British English from the late twentieth century. ...
Article
Full-text available
Corpora are resourceful tools for linguistic observation, as they can provide quick statistics about a word compared with manual analysis. However, in some circumstances where qualitative analysis is needed, the use of a quantitative method may not produce the required results. Based on the keyword love, four types of materials were used to show the plausibility and implausibility of using corpora. First, romantic English songs were used to illustrate the kind of input students received from these songs. Second, we compared the songs to students’ essays on general essay topics. Then, we contrasted the first two sets of data to a native speakers’ corpus. Lastly, we showed how the creativity element can be reproduced by students through extended use of metaphors “Love is X”. While the students may have been exposed to more colloquial uses of love in love songs, their general essays showed a more neutral use, with fewer occurrences of love. In comparing the different texts, this paper will show how a corpus-based approach can be complemented by a qualitative examination of texts. Regarding pedagogical implications, corpora are a good resource of authentic materials and should be encouraged in the classroom as a reference for recurring language patterns.
... Thus, what is inferred as the main focus of the utterance is 9 The relatively small number of examples of the construction in the sample prompted one of the reviewers to inquire about its occurrence in datasets similar to LLC. Consequently, we carried out a small-scale exploratory investigation of the construction in one of the few freely available corpora of spoken British English that also provides access to the original sound files, the British National Corpus, accessed via BNCweb (Hoffmann et al., 2008). A random sample of 1,700 examples of what were extracted from face-to-face conversations involving adult speakers only. ...
Article
With data from two comparable corpora of spoken British English, the London-Lund Corpus and the new London-Lund Corpus 2, this study tracks the development of the reactive what-x construction half a century back in time. The study has two goals: (i) to describe the uses of the construction over time and (ii) to establish the motivations and mechanisms related to its development in spoken dialogue. The corpus data show that the reactive what-x construction was already in use in the mid-20th century but has gained ground since then. By combining Invited Inferencing Theory with focus on speaker- initiated decisions in interaction and a Cognitive Semantic approach to meaning shift and change from a Construction Grammar perspective, we demonstrate that the devel- opment of a construction has to be explained with reference to both the social motivations in spoken conversational discourse and the cognitive processes that operate at the con- ceptual level. The development of the reactive what-x construction, which is simulta- neously used to express reaction and make a request, was motivated by the interaction of discourse-structuring and turn-taking inferences at the functional level that proceeded through metonymic micro-adjustments of the conceptual structure of the construction itself.
... The broadsheet data amounts to approximately 3 million words (for more on the dataset and how it was collected, see Silvennoinen 2017). The corpus was accessed through the BNCweb interface 7 (for documentation, see Berglund et al. 2002;Hoffmann et al. 2008). Approximately 2,000 tokens of contrastive negation were collected by first searching for negators and then manually finding instances of contrastive negation among all other negative construction types. ...
Article
This paper discusses constructional variation in the domain of contrastive negation in English, using data from the British National Corpus. Contrastive negation refers to constructs with two parts, one negative and the other affirmative, such that the affirmative offers an alternative to the negative in the frame in question (e.g. shaken, not stirred; not once but twice; I don’t like it – I love it). The paper utilises multiple correspondence analysis to explore the degree of synonymy among the various constructional schemas of contrastive negation, finding that different schemas are associated with different semantic, pragmatic and extralinguistic contexts but also that certain schemas do not differ from each other in a significant way.
... Statistické zpracování dat vychází z doporučení uvedených v knize Corpus Linguistics with BNCweb -a Practical Guide (Hoffmann et al. 2008), která jsou přímo určená pro testování hypotéz o četnostech výskytu jazykových jevů v korpusech. Vzhledem k možnosti zjistit přesnou p-hodnotu bylo zvoleno testování v prostředí softwaru R 2 pomocí tzv. ...
Article
Full-text available
This article aims to test the so-called unique items hypothesis on Czech language data. The hypothesis formulated by Sonja Tirkkonen-Condit presumes the underrepresentation of unique items (target-language elements that have no direct counterparts in the source language) in translated texts compared to non-translations in the same language. A monolingual comparable sample corpus consisting of Czech translations and non-translations served as language data, both subcorpora containing approximately 17 million tokens. The tested Czech unique items were chosen from lexical units, word-formation phenomena, syntactic structures and language use phenomena. The frequency of these items was subjected to statistical testing with R software. The results reveal a general tendency of translated Czech language to contain less unique items. However, some of the individual items do not correspond to this tendency.
... In fact, it means that the present day linguistic researches are much more dynamic and more importance (and opportunity) is given to researches of interdisciplinary character. Therefore, some linguists (Lindqvist, 2009, Hoffmann et al., 2008 tend to define corpus linguistics rather as a methodology throughout which linguistic research in various language areas can be realized. They state that in comparison with the autonomous linguistic disciplines, which describe or explain certain language features, corpus linguistics does not centre on particular features, but can be utilized to explore any area of language use. ...
... The seven texts taken from the BNC were KB3, KB5, KBA, KBN, KBS, KBV and KCR, amounting to 50,858 words. 6 A description of the use of these queries can be found in Hoffmann et al. (2008). the more detailed and fine-grained analysis, of both a quantitative and a qualitative nature, all the occurrences from the selected texts in LLC and BNC were examined. ...
Chapter
Question tags in English have received widespread attention in the literature. The discussion has often focused on issues concerning the polarity, intonation, speech functions (illocutionary force) of tags and their semantic-pragmatic properties. By examining spoken samples of two British English corpora, this chapter intends to contribute to the study of question tags from the main perspective of the discourse functions they perform in the context of spontaneous conversation, taking into account aspects of the interaction of information flow, which have been arguably less discussed in relation to tags. It also aims at contributing to studies that have addressed the meanings of linguistic expressions in the so-called right periphery. Question tags are shown to have epistemic, interactive and discourse-structuring meanings, including functions such as seeking agreement and recognition, reinforcing the common ground among the participants, performing a role in the co-construction of discourse, in relation to the expression of topic-shift, acting as dialogical markers and contributing to the spoken discourse segmentation. The study also explores the correlation of tags with evaluative statements and suggests that more attention should be given to two alternative constructions to question tags, which present some superficially-similar properties to canonical question tags but should be seen as constituting different phenomena. The chapter concludes by arguing in favour of the hypothesis that question tags can be seen as one type of discourse marker in English.
... Using the spoken section of the corpus ensures that the resulting compounds have been spontaneously produced by a speaker at least once. The BNCweb interface (Hoffmann et al. 2008) was used to search for strings of two nouns, excluding strings that crossed a sentence boundary or that included a pause or any other form of interruption, such as a cough, between the two nouns. The corpus queries also specified that the word after the second noun should not be another noun, an adjective or a possessive. ...
Article
Full-text available
Many studies have shown that syntagmatic and paradigmatic aspects of morphological structure may have an impact on the phonetic realisation of complex words (e.g. Cohen 2014a,b; Kuperman et al. 2007; Lee-Kim et al. 2013; Lõo et al. 2018; Plag et al. 2017; Schuppler et al. 2012; Smith et al. 2012; Sproat and Fujimura 1993; Zimmermann 2016, among many others). The majority of these studies have been concerned with affixes, often focusing on the acoustic properties of segments at a morphological boundary. The present study extends this line of investigation to compounds, exploring the extent to which consonant duration at compound-internal boundaries in English is dependent on morphological structure. Three competing hypotheses about the relationship between fine phonetic detail and morphological structure are tested. According to the Segmentability Hypothesis, greater morphological segmentability, i.e. a stronger morphological boundary, leads to acoustic lengthening (Ben Hedia and Plag 2017; Hay 2003; Plag and Ben Hedia 2018). The Informativity Hypothesis, on the other hand, states that higher informativity leads to lengthening (e.g. Jurafsky et al. 2001; van Son and Pols 2003). Finally, the Paradigmatic Support Hypothesis says that stronger paradigmatic support leads to lengthening (Cohen 2014b; Kuperman et al. 2007). To test these hypotheses, an experimental study was carried out using 62 compound types taken from the British National Corpus. The compounds were spoken by 30 speakers, yielding more than 1500 acoustic tokens overall. The data provide no support for the Segmentability Hypothesis, and only limited support for the Informativity Hypothesis. In contrast, the Paradigmatic Support Hypothesis makes correct predictions: consonant duration at compound-internal boundaries is positively correlated with the probability of the relevant consonant following the first noun, and the duration of compound-internal geminate consonants is negatively correlated with the family size of the first noun. In other words, longer durations are associated with lower paradigmatic diversity.
... The intervention was supposed to involve the sampled students in the carrying out the language research projects with the application of free or as trials of the corpus analytical tools such as the Voyant Tools (accessible at https://voyanttools.org/), the Linguistic Inquiry and Word Count software (accessible via the link of http://liwc.wpengine.com/), British National Corpus (BNC) web (CQP-Edition) (can be accessed via http://corpora.lancs.ac.uk/BNCweb/) (Harvey, 2010;Hoffmann et al., 2008) and Raven's Eye (can be accessed through the link https://ravens-eye.net/). The intervention lasted one year and four months, from September 2019 to the end of October 2020. ...
Article
The purpose of the study was to explore how technological advances incorporated into the Philology Studies curriculum could impact the students’ research skills and the quality of their research projects and what students’ and teachers’ impressions of the reshaped research component of the curriculum were. The study used qualitative and quantitative methods with the dominance of qualitative methods. It employed the baseline study, checklist to assess students’ research papers, assessment criteria, and the Triangular Assessment Method to assess the students’ papers. The consensus meeting was held to allow the experts to express their reasoning for the scores. The semi-structured interview was administered to the students’ and teachers’ to identify their impressions of the reshaped research component of the curriculum of philology. The technological advances incorporated into Philology Studies curriculum improve the students’ research skills and the quality of their research projects. Both students and teachers appreciated the reshaped research component of the curriculum. The analytical software can be successfully incorporated in the corpus analysis-purpose student research. The students found the intervention a challenging experience that ‘pumped up’ their intellectual, research, and technical skills. They reported improvement in interpreting corpus using correlations, frequencies, distributions, and collecting information using software to organise it in a professional way. The lecturers agreed that the technology-based instructional model incorporated into Philology Studies curriculum improved both students’ research skills and the quality of their research projects.
... An individual word query was performed for the expression [bio à ]. The expression was designed using the Simple Query Syntax (Hoffmann et al. 2008) where the symbol < à > functions as a wildcard denoting zero or more characters. For the purposes of this study, the wildcard was positioned after the prefix bio-to request a search for word formations that begin with bioand are followed by zero or more characters to return the widest possible range of BID terms. ...
Article
The research presented in this article tackles the problem of terminological disharmony within biologically informed disciplines (BID). Lexical semantic theories and methods are applied to corpus-based investigations to assess the scope of BID terminology. The results are analysed using statistical and qualitative methods and mapped against known academic domains. The resulting map is evaluated via the analysis and consequent positioning of biologically informed textile research. The findings suggest that the experimental framework embodies an alternative approach to mapping practice within BID landscape that overrides the need for broad, generic terms. Instead presents the work within an established network of theories and concepts with transparent interdisciplinary connections.
... A representative and balanced sample of written and/or spoken texts is compiled in a linguistic corpus, and, thus, observations on linguistic behavior of queried items on this corpus constitute both quantitative and qualitative linguistic findings. These findings lead linguists to make generalizations on typical and central properties of that language overall (see Hoffmann et al. 2008). Otherwise, "without representativeness whatever is found to be true of a corpus, is simply true of that corpus-and cannot be extended to anything else" (Leech 2007, p. 135). ...
Chapter
Full-text available
Usage-based linguistic studies have gained new insights as corpus-based and corpus-driven analyses have advanced in recent years. Linguists working in different domains have turned to corpora as a major source in their study of language at all levels of representation. Currently, corpus linguistics is evolving into a sophisticated methodology in extracting and analyzing data. Building and using corpora in Turkish linguistics is a recent undertaking, initially motivated by work on natural language processing (NLP) research. The number of available corpora is increasing and linguistic research has come to test hypotheses on attested data, or uncover more lexical and grammatical patterns of use that have gone unnoticed in the absence of corpus data. Advances in NLP research and tools provided for corpus building and annotation further contribute to corpus studies in Turkish linguistics.(to cite : Aksan, Mustafa & Yeşim Aksan (2018). Linguistic corpora: A view from Turkish (pp. 301-327). Kemal Oflazer & Murat Saraçlar (Eds.) Studies in Turkish Language Processing. Springer Verlag . )
... From among the other association measurements (AM) like T-scores and Log Dice, only MI scores were used as a reference to calculate the probability of co-occurrence of the collocations for some reasons. First, T-scores are considered to be the best indicator for lexical PP-verb collocations among all association measures (Hoffmann et al., 2008) so it was not the best alternative to measure adj+noun combinations' strengths. Although another AM, Log Dice, has been introduced as an alternative to MI scores, it has not been explored enough in language learning research yet (Gablasova et al., 2017). ...
... 15 Access to the Spoken BNC1994 audio material is free of charge. Moreover, users can choose between two formats: 1) the complete WAV audio files are available for download from Audio BNC(Coleman et al. 2012), and 2) the BNCweb online interface allows users to play back, as well as download, the audio of the query match and its immediate context(Hoffmann et al. 2008; Arndt-Lappe submitted). The only downside is that neither Audio BNC nor BNCweb provides access to the complete dataset. ...
Article
Full-text available
This article aims to describe key challenges of preparing and releasing audio material for spoken data and to propose solutions to these challenges. We draw on our experience of compiling the new London-Lund Corpus 2 (LLC-2), where transcripts are released together with the audio files. However, making the audio material publicly available required careful consideration of how to, most effectively, 1) align the transcripts with the audio and 2) anonymise personal information in the recordings. First, audio-to-text alignment was solved through the insertion of timestamps in front of speaker turns in the transcription stage, which, as we show in the article, may later be used as a valuable complement to more robust automatic segmentation. Second, anonymisation was done by means of a Praat script, which replaced all personal information with a sound that made the lexical information incomprehensible but retained the prosodic characteristics. The public release of the LLC-2 audio material is a valuable feature of the corpus that allows users to extend the corpus data relative to their own research interests and, thus, broaden the scope of corpus linguistics. To illustrate this, we present three studies that have successfully used the LLC-2 audio material.
... The intervention was supposed to involve the sampled students in the carrying out the language research projects with the application of free or as trials of the corpus analytical tools such as the Voyant Tools (accessible at https://voyanttools.org/), the Linguistic Inquiry and Word Count software (accessible via the link of http://liwc.wpengine.com/), British National Corpus (BNC) web (CQP-Edition) (can be accessed via http://corpora.lancs.ac.uk/BNCweb/) (Harvey, 2010;Hoffmann et al., 2008) and Raven's Eye (can be accessed through the link https://ravens-eye.net/). The intervention lasted one year and four months, from September 2019 to the end of October 2020. ...
... The study employed two random samples from the written BNC (BNCw), accessed via BNCweb (Hoffmann et al. 2008). The first random sample contained if-conditionals; the second contained sentences from the whole BNCw, and was used to represent written British English as a whole (henceforth, baseline). ...
Article
Full-text available
This paper discusses the frequency distribution of the types of if-conditionals recognised in the corpus-based classification developed in Gabrielatos (2010: 230-265). It is pertinent to mention at the outset that if-conditionals have been estimated to account for about 80 per cent of all conditional constructions in written British English (Gabrielatos 2010: 49). The classification was partly adapted from Quirk et al. (1985: 1072-1097), and was based on two interrelated criteria: a) the nature of the link between the two parts of a conditional, (henceforth, protasis and apodosis, respectively) and b) the modal nature of the apodosis. The quantitative analysis discussed here provides insights into the nature of each type, and the ways that the interaction of the type of link between protasis and apodosis, and the type of modality expressed by the apodosis gives rise to their potential for use in communication.
... 1. have + past participle verb 2. has + past participle verb 3. 've + past participle verb 4. 's + past participle verb (Hoffmann et al., 2008) Similar to the analysis of general corpora, the concordance lines with modal verbs found in TB were excluded. After all data from TB were filtered, similar steps were taken in analysis of PP samples in TB, including the use of co-raters, in which case the percentage agreement reaches 84.72%. ...
Article
Full-text available
One of the reasons why EFL learners have difficulties with the English present perfect tense is that little attention has been paid to the relationship between patterns and meanings of the tense (Yoshimura et al., 2014). To fill this gap, the present study takes a corpus-driven approach to the pattern-meaning interface of the present perfect, using British and American English corpora. It is found that the present perfect can express seven groups of meanings: 'accomplishment with relevance to the present', 'continuing from the past to the present', 'change of condition', 'experience', 'recency', 'discovery', and 'possession'. These meanings are found to be associated with distinctive co-occurrence patterns. The corpus-informed insights were then applied to an analysis of present perfect instances presented in textbooks used in Thailand's universities. It is found that the corresponding patterns and meanings can also be found in the sampled textbooks, but the textbooks tend to under-present a core meaning and highlight only a few uses of the present perfect. The study thereby offers a new perspective on the English present perfect and also provides empirical evidence for development of EFL textbooks and teaching materials.
Presentation
Full-text available
Chapter
In this chapter Ranger studies various different values of the discourse marker “anyway”. After a summary of previous work on the question, the author proposes a schematic form according to which “anyway” specifies that an end-point or conclusion, q, is located indifferently relative to more than one possible path of access, p or p*. This abstract description is illustrated with a series of values labelled “concessive”, “additive” and “corrective” on the one hand, and “resumptive”, “topic-changing”, “conclusion” or “closure”, on the other. These contextually situated values of “anyway” depend upon the nature of the terms related and the nature of the relation, while the referential values for q result from its dialectic confrontation with the different configurations of the p / p* pair.
Chapter
This chapter studies the markers “indeed” and “in fact”. After an overview of previous work on the question, the markers are described in terms of a schematic form according to which both markers determine a proposition q relative to a previous proposition p, the first marking identification, the second differentiation. The schematic forms are illustrated with a series of contextualised examples of different values. A provisional discussion is followed by an exploration of collocational affinities for “indeed” and “in fact” based on targeted corpus queries. These are shown to provide independent support for the proposed modelisation. A number of further non-prototypical cases of each marker are studied before a conclusion which situates the approach relative to that of existing studies cited earlier.
Chapter
Usage-based linguistic studies have gained new insights as corpus-based and corpus-driven analyses have advanced in recent years. Linguists working in different domains have turned to corpora as a major source in their study of language at all levels of representation. Currently, corpus linguistics is evolving into a sophisticated methodology in extracting and analyzing data. Building and using corpora in Turkish linguistics is a recent undertaking, initially motivated by work on natural language processing (NLP) research. The number of available corpora is increasing and linguistic research has come to test hypotheses on attested data, or uncover more lexical and grammatical patterns of use that have gone unnoticed in the absence of corpus data. Advances in NLP research and tools provided for corpus building and annotation further contribute to corpus studies in Turkish linguistics.
Chapter
In this chapter Ranger looks at three values of “like”. The schematic form for prepositional “like” serves as a template for discourse marking and quotative uses. Prepositional uses of “like” can construct two values: similarity and exemplarity. In both cases a locatum is determined by virtue of a property consensually shared in common with a locator. Values of exemplarity serve as a basis from which to derive discourse marking values of “like”, the difference being that in discourse marking values, the locator refers to linguistic rather than extralinguistic material. Quotative “be like” highlights the quoted material as emblematic of a generic situation or as a plausible report among others. The co-construction of meaning implied in the schematic form of “like” additionally lends itself to the construction of shared linguistic identities.
Chapter
This chapter first reconsiders the questions posed by the term “discourse marker” in the framework of the Theory of Enunciative and Predicative Operations (TEPO). After presenting the epistemological and methodological underpinnings, Ranger introduces the tools of the theory in the form of a limited number of fundamental operations and operands from which more complex polyoperations may be constructed. The multicategorial, multifunctional nature of discourse markers, evoked in Chap. 1, is revisited in the light of the TEPO. The concept of the schematic form is shown to transcend the polysemy versus monosemy debate, favouring a dynamic concept of meaning construction in which the “meaning” of an item corresponds to its specific latitudes of variation, which are configured in context to produce situated values. The chapter concludes with a specifically enunciative description of discourse marking.
Article
Full-text available
Bu çalışmada bir etkileşim belirleyicisi (EB) olarak görev yapan evet'in edimbilimsel işlevleri saptanarak konuşma çözümlemesi kapsamında incelenmiştir. Amaç evet'in Sözlü Türkçe Derlemi'ndeki görünümleriyle belirlenen etkileşimsel özelliklerini ve işlevlerini nicel sonuçlarıyla birlikte edimbilimsel katkıları da göz önünde bulundurularak ortaya çıkarmaktır. Bulgular evet'in en çok onaylama işleviyle, ardından fikir birliği, devam ettirme, soruya cevap verme ve konudan sapma/konuyu kapatma işlevleriyle ve çoğunlukla kısa karşılaşmalarda kullanıldığını göstermiştir. Söz konusu işlevlerle evet'in konuşma akışını sorunsuzca devam ettirmede ve etkileşimi dengede tutmada etkin olarak kullanıldığı saptanmıştır.
Article
Full-text available
Dünyada derlem dilbilim yöntemi dilbilimin çeşitli alanlarında 50 yılı aşkın bir süredir kullanılmaktadır. Oluşturulan genel ve özel amaçlı sözlü ya da yazılı derlemler yardımıyla araştırılan konu hakkında, gerçekleşmiş dil verisi temelinde nicel ve nitel sonuçlara ulaşılmakta ve böylece incelenen dilin dilbilgisi, anlambilim ya da edimbilim özellikleri kapsamlı bir biçimde betimlenebilmektedir (örn., Biber, Johansson, Leech, Conrad, Finegan, 1999; Aijmeir ve Stenström, 2004; McEnery, Xiono, Tono, 2006). Dünyadaki örneklerinde gördüğümüz gibi derlem araçlarını ve derlem dilbilim yöntemini etkin kullanabiirnek Türkçe için de derlem temelli ve derlem çıkışlı araştırmaların yapılmasında ve yaygınlaştırılmasında önemli roloynayacaktır. Bu çalışmada Türkçe Ulusal Derlemi (TUD) Tanıtım Sürümü'nün arayüz özellikleri kısaca tanıtılıp arayüzün sorgulanan sözcüğün sayısal sıralı eşdizim listelerini oluşturmada kullandığı istatistiksel ilişki ölçülerinin hesaplama ve listeleme özellikleri ve bunların dilbilim çözümlemesindeki yerleri örneklerle açıklanacaktır.
Article
Full-text available
Lay abstract: Any thriving society must recognise, accept and celebrate all of its diverse talent. But how accepting is British society towards autism and autistic people? This research addressed this question through the lens of the press since the press both reflects and helps shape public attitudes towards various social categories. We used specialised 'corpus-based' methods to carry out a large-scale study, which examined all articles referring to autism or autistic people in 10 national British newspapers in the period 2011-2020. We first investigated how often newspapers referred to autism. We found that the coverage of autism increased slightly over the years, suggesting that autism was becoming an increasingly newsworthy topic. Furthermore, the rise in autism coverage differed considerably between individual newspapers: it was more pronounced in the broadsheets than tabloids, and in left-leaning than right-leaning newspapers. But what was the focus of these articles? We found that newspapers emphasised the adversities associated with autism and portrayed autism with a lot of negative language. Newspapers also tended to focus on autistic children, and particularly on boys. There were some signs of change in more recent years, with some newspapers now representing autism as a difference and, in addition, referring to more diverse groups of autistic people. However, these changes tended to be confined to broadsheets and left-leaning newspapers. Our findings suggest that representations of autism in the contemporary British press are skewed towards stereotypically negative views, which may well hinder the acceptance of autism and the fostering of a more inclusive society.
Conference Paper
Full-text available
Dünyada derlem dilbilim yöntemi dilbilimin çeşitli alanlarında 50 yılı aşkın bir süredir kullanılmaktadır.Oluşturulan genel ve özel amaçlı sözlü ya da yazılı derlemler yardımıyla araştırılan konu hakkında, gerçekleşmiş dil verisi temelinde nicel ve nitel sonuçlara ulaşılmakta ve böylece incelenen dilin dilbilgisi, anlambilim ya da edimbilim özellikleri kapsamlı bir biçimde betimlenebilmektedir (örn., Biber, Johansson, Leech, Conrad, Finegan, 1999; Aijmeir ve Stenström, 2004; McEnery, Xiono, Tono, 2006). Dünyadaki örneklerinde gördüğümüz gibi derlem araçlarını ve derlem dilbilim yöntemini etkin kullanabiirnek Türkçe için de derlem temelli ve derlem çıkışlı araştırmaların yapılmasında ve yaygınlaştırılmasında önemli roloynayacaktır. Bu çalışmada Türkçe Ulusal Derlemi (TUD) Tanıtım Sürümü'nün arayüz özellikleri kısaca tanıtılıp arayüzün sorgulanan sözcüğün sayısal sıralı eşdizim listelerini oluşturmada kullandığı istatistiksel ilişki ölçülerinin hesaplama ve listeleme özellikleri ve bunların dilbilim çözümlemesindeki yerleri örneklerle açıklanacaktır .
Article
Full-text available
Özet: Görsel algı eylemi olan bakmak eyleminin buyrum ve istek çekimli biçimleri anlam aşınması sonucu algı anlamı aktarmaktan uzaklaşarak birer söylem işaretleyicisi işlevleriyle kullanılmaya başlamıştır. Ruhi (2011) Sözlü Türkçe Derlemi verisi ile yaptığı çalışmada, 2. tekil buyrum biçimlerinin (bak, bak-sana) öncelikli olarak söylemde konuşucu için dikkat çekmek amaçlı bir dikkat çağırısı ve vurgulama işlevi gösterdiğini, 1. tekil (bak-ayım) ve 1. çoğul (bak-alım) istek biçimlerinin ise konu başlatma ve konuşma sonlandırma işlevleriyle kullanıldıklarını belirler. Bu çalışma, sözlü birleşeni de olan daha büyük kapsamlı derlem (Türkçe Ulusal Derlemi) verisinden bakmak eyleminin buyrum biçiminin örneklerin sunduğu görünümlerini saptamayı ve dikkat ve ünlem işlevlerinin çeşitlenmelerini, kullanım bağlamlarını incelemeyi amaçlamaktadır. Söylemdeki görünümler ile sayısal değerlerin birlikte incelenmesi daha ayrıntılı sözeylem çalışmaları için temel veri setlerini sunacaktır. (to cite :Aksan, Mustafa & Umut Ufuk Demirhan (2017) Bakmak Eylemi ve Söylem İşlevleri: Eşdizimlilik Örüntülerinin Gösterdikleri. Dil ve Edebiyat Dergisi, 14(2), 85-107. )
Article
Full-text available
This paper explored a less-studied preposition, after, and its uses in a corpus. The analysis included an inspection of dictionary senses, followed by the distributional patterns of parts-of-speech, the co-appearing verbs of after, the contextual uses of [-ly after] and [-ly after+NOUN], and finally a pre-and post-teaching investigating the effectiveness of using a corpus. The goal was to see what semantic information is embedded in after in each of these analyses and to explain the contextual uses of after that are not often realized. We used probes and corpus techniques to search for instances in the British National Corpus to retrieve expressions of after that appear in certain contexts. Based on the data, two conceptualizations of after, comprising the relations of two events, were postulated. For the uses of after and its two constructions [-ly after] and [-ly after+NOUN], we also tested their production by EFL students before and after a corpus workshop was introduced to EFL learners. The results showed that the workshop not only was effective, but also helped accelerate students’ speed of generating expressions of after. This paper could serve as a sample work for research on function words, and it also imposes pedagogical implications on the effectiveness of using a corpus for EFL learners.
Article
This special issue focuses on key issues in epicentral research. Against the background of a brief discussion of the epicentre metaphor in the world Englishes paradigm, the various regional constellations in Africa, America, Asia and Australasia relevant to the concept of linguistic epicentres are highlighted and the temporal framing of epicentral influence is discussed with regard to often‐assumed evolutionary prerequisites of potential linguistic epicentres. More fine‐grained methodological perspectives on linguistic epicentres are subsequently provided with regard to: (1) linguistic variables studied; (2) data sources available; and (3) statistical approaches. Before short summaries of the papers featured in this special issue are presented, the role of linguistic epicentres as a component in modelling world Englishes is scrutinised.
Article
Full-text available
Animals exhibit a fascinating variety of skin patterns, but mechanisms underlying this diversity remain largely unknown, particularly for complex and camouflaged colorations. A mathematical model predicts that intricate color patterns can be formed by “pattern blending” between simple motifs via hybridization. Here, I analyzed the skin patterns of 18,114 fish species and found strong mechanistic associations between camouflaged labyrinthine patterns and simple spot motifs, showing remarkable consistency with the pattern blending hypothesis. Genomic analyses confirmed that the coloring on multiple labyrinthine fish species has originated from pattern blending by hybridization, and phylogenetic comparative analyses have further substantiated the pattern blending hypothesis in multiple major fish lineages. These findings provide a plausible mechanistic explanation for the characteristic diversity of animal markings and suggest a novel evolutionary process of complex and camouflaged colorations by means of pattern blending.
Research Proposal
Full-text available
Дисертація присвячена вивченню матричного профілювання семантики дієслів, їхніх форм і фразових дієслів на позначення емоційних станів людини – емотивів (В. І. Шаховський) – у Британському національному корпусі (далі – БНК), який надав структурно-опрацьований та розмічений матеріал для верифікації й подальшого впорядкування цього реєстру одиниць англійської мови в лінгвістичну базу даних на зразок когнітивно-тегової семантичної матриці. Сучасні цифрові технології, що значно розширили предмет і завдання корпусної лінгвістики, відкривають нові можливості для розвитку фундаментального наукознавства загалом, особливо в напрямі пошуків відповіді на запитання, як “працюють природно-мовні механізми в інформаційно-комп’ютерних системах” (В. І. Заботкіна). Стрімкий розвиток корпусної лінгвістики у ХХ-поч. ХХІ ст. (Т.О. Анохіна, О. О. Борискіна, О. І. Ванівська, Н. Б. Гвішиані, Н. П. Дарчук, О. М. Демська-Кульчицька, В. В. Жуковська, Є. А. Карпіловська, Н. Є. Леміш, В. О. Плунгян, Е. В. Суворіна, У. Н. Френсіс, S. Hoffmann, S. Evert, G. Kennedy, T. MacEnery, T. Otlogetswe, J. Sinclair, J. Svartvik, E. Tognini-Bonelli та ін.) засвідчив, що нові методи останньої наблизять дослідників до розкриття онтологічної сутності когнітивно-семантичних процесів номінації тих об’єктів і суб’єктів, які не лише оточують людину в навколишньому світі, а безпосередньо впливають на її емоційний стан і відчуття. І саме інтеграція здобутків когнітивної семантики і корпусного інструментарію надає змогу залучити для аналізу унікальних для кожної мови лексико-граматичних одиниць із емотивним компонентом значення (Л. Г. Бабенко, І. А. Галуцьких, М. В. Гамзюк, Т. В. Ларіна, О. І. Смашнюк та ін.) таку процедуру, як матричне профілювання (Р. Ленекер). В англійській мові до таких специфічних утворень належать фразові дієслова (Н. В. Авдєвич, С. Б. Берлізон, А. А. Воскрес, К. Є. Голубкова, Ю. О. Жлуктенко, S. Lindner, J. Povey та ін.).
Article
This study aims to explore the production of British English monophthongs by L2 Thai learners using an impressionistic study as well as to investigate the relationship between the target-like production of these monophthongs with vocabulary size, sex of speakers, target vowel, and L2 experience. Seventy-eight L2 Thai learners in Thailand produced 11 British English monophthongs /iː, ɪ, e, æ, ɒ, ɑː, ɔː, ʊ, uː, ʌ, and ɜː/ in the /b_t/ context. They were also tested for their English vocabulary size. The impressionistic study showed that participants had no difficulty producing most target vowels /ɑː, iː, uː, ɜː, æ, e, ɪ, and ʌ/ but found the other three vowels /ɔː, ɒ, and ʊ/ difficult. Results also showed that vocabulary size, sex of speakers, and L2 experience did not have a significant correlation with the target-like production of the monophthongs but did have an effect on target vowels. The implications of these results were discussed in the context of L1 positive transfer, the influence of spelling on the tested word, the influence of lip rounding of the vowels on the production of the monophthongs, and word frequency.
Article
Because of the ubiquity and importance of collocations in language use/ learning, how to effectively and efficiently identify collocations has been a topic of interest. Although some studies have evaluated many of the existing association measures (AMs) used in the automatic identification of collocations, the results so far have been inconsistent and unclear due to various limitations of the existing studies. Hence, this study makes a multi-dimensional evaluation of the effectiveness and efficiency of seven major AMs in the identification of three types of collocations across five genres and seven corpora of different sizes. The results indicate that while a few AMs, such as Log Likelihood Ratio and Cubic Mutual Information (MI 3), are consistently more effective and efficient than the other five AMs being examined, no one AM alone may be adequate in the identification of different types of collocations across different genres and corpus sizes. Research implications are also discussed.
Chapter
Corpus linguistics continues to be a vibrant methodology applied across highly diverse fields of research in the language sciences. With the current steep rise in corpus sizes, computational power, statistical literacy and multi-purpose software tools, and inspired by neighbouring disciplines, approaches have diversified to an extent that calls for an intensification of the accompanying critical debate. Bringing together a team of leading experts, this book follows a unique design, comparing advanced methods and approaches current in corpus linguistics, to stimulate reflective evaluation and discussion. Each chapter explores the strengths and weaknesses of different datasets and techniques, presenting a case study and allowing readers to gauge methodological options in practice. Contributions also provide suggestions for further reading, and data and analysis scripts are included in an online appendix. This is an important and timely volume, and will be essential reading for any linguist interested in corpus-linguistic approaches to variation and change.
Article
This paper introduces the Spoken British National Corpus 2014, an 11.5-million-word corpus of orthographically transcribed conversations among L1 speakers of British English from across the UK, recorded in the years 2012-2016. After showing that a survey of the recent history of corpora of spoken British English justifies the compilation of this new corpus, we describe the main stages of the Spoken BNC2014's creation: design, data and metadata collection, transcription, XML encoding, and annotation. In doing so we aim to (i) encourage users of the corpus to approach the data with sensitivity to the many methodological issues we identified and attempted to overcome while compiling the Spoken BNC2014, and (ii) inform (future) compilers of spoken corpora of the innovations we implemented to attempt to make the construction of corpora representing spontaneous speech in informal contexts more tractable, both logistically and practically, than in the past.
ResearchGate has not been able to resolve any references for this publication.