Book

Corpus Linguistics: Method, Theory and Practice

Authors:

Abstract

Corpus linguistics is the study of language data on a large scale – the computer-aided analysis of very extensive collections of transcribed utterances or written texts. This textbook outlines the basic methods of corpus linguistics, explains how the discipline of corpus linguistics developed and surveys the major approaches to the use of corpus data. It uses a broad range of examples to show how corpus data has led to methodological and theoretical innovation in linguistics in general. Clear and detailed explanations lay out the key issues of method and theory in contemporary corpus linguistics. A structured and coherent narrative links the historical development of the field to current topics in 'mainstream' linguistics. Practical tasks and questions for discussion at the end of each chapter encourage students to test their understanding of what they have read and an extensive glossary provides easy access to definitions of technical terms used in the text.
... And with these reflections come questions such as, are these tools being leveraged to their fullest potential, or is there a risk of becoming overly reliant on technology at the expense of deeper linguistic insight? (McEnery and Hardie, 2012). ...
... In the past, the primary goal often centred on academic pursuits, such as understanding language structure, variation, and use. Today, there is an increasing emphasis on applying research findings to real-world issues (McEnery and Hardie, 2012), from enhancing educational materials (Biber et al., 2002;Römer, 2009) to addressing biases in AI systems (Blodgett et al., 2020). In other words, corpus linguistics is no longer confined to purely academic inquiries, as in its beginnings. ...
... Perhaps one of the most significant developments is the 'democratization' of corpus linguistics (McEnery and Hardie, 2012). In the beginning, access to corpora and tools was limited to a select few, often requiring significant institutional backing. ...
Article
This paper explores the evolving landscape of corpus linguistics, focusing on the impact of artificial intelligence (AI) and its social implications. Over the past two decades, the study of language through corpus linguistics has evolved significantly, prompting ongoing reflection on the field's transformation. These reflections naturally give rise to pressing questions related to how corpus linguistics will evolve in a world defined by rapid technological progress and changing societal priorities. To validate the suppositions and reflections addressed in this contribution, the study explores a corpus that comprises scholarly papers from scientific journals, and a collection of AI-related articles taken from the media. This dual corpus enables a comparative analysis of how AI-driven corpus linguistics is represented, in order to explore how the integration of artificial intelligence is transforming corpus linguistics, and hence the methodological, theoretical, and socio-political implications of this shift. The methodological framework combines quantitative corpus analysis with qualitative discourse analysis. Collocation and keyword frequency retrieval is applied to identify prevalent themes. As expected academic literature emphasizes methodological advancements and data-driven rigor, while media discourse highlights ethical concerns and societal implications. These findings support the overview and contribute to understanding how AI is shaping both the practice and perception of corpus linguistics in contemporary society.
... To apply the DP deletion and DP movement in passive voice, the current study employed the techniques of vicinities to determine how passive voice is used with other syntactic structures in adjacent areas. Vicinity was originally coined by corpus linguists (Sinclair, 1996;McEnery & Hardie, 2011). The boundaries to the left and the right are simultaneously investigated with the key. ...
... Once the data were collected, the concept of vicinity, or the areas near or surrounding areas, follows McEnery and Hardie (2011). Vicinity refers to the whole sentential boundaries to the left and to the right that were collocated with passive constructions as the target sentence structure in this study. ...
Article
Full-text available
This study examines the syntax of DP deletion and pragmatics of DP movement in passive voice from applied linguistics methodology. The data collection was a purposive sampling method as the study specifically alternated the data from Q1 SCOPUS publications, Thai national publications (TCI 1), and Thai undergraduate students’ independent studies (IS) from a private university. There were 99 tokens. The data analyses were linguistics and inferential statistics. Linguistic analysis follows generative grammar, whereas the statistical analysis follows inferential statistical analysis SPSS29. The results of the study showed the similarities of DP by-phrase agent deletion between Q1 SCOPUS, TCI 1, and IS. However, the results in these publications were different pragmatically. The discussion was explained syntactically and pragmatically. The DP arguments in passive voice were omitted due to the reason of widely-known agents. Pragmatically, the movement of DP argument in Q1 SCOPUS and TCI 1 complies with the theory of pragmatic discourse of givenness, while this was not applied in IS. It is expected that the results in this study would be useful for English learners in how to apply passive voice to write research methodology appropriately.
... The research applies both quantitative and qualitative methods. Corpora may be analyzed with the help of qualitative and quantitative methods (McEnery & Hardie, 2011). The quantitative method included the UAM corpus tool determining the results as frequencies and statistics, while the qualitative method involved the interpretation and analysis of the results. ...
Article
Full-text available
This study explores transitivity in the ideational meta-function of language in the editorials of two Indian English newspapers, namely the Indian Express and the Hindu. This research aimed to examine the representation of Kashmir’s special status in Indian newspaper editorials and to find the frequently appearing types of transitivity processes in Indian newspapers about revoking Kashmir’s special status. The transitivity analysis was carried out by using the UAM corpus analysis tool through which the text of the editorials was first tagged and then analyzed. The study selected 31 editorials from the Indian Express and the Hindu for the study. This was a clause analysis, and all the types of processes, participants and circumstances were investigated. The study utilized quantitative and qualitative data analysis methods. The quantitative findings UAM corpus tool revealed: Transitivity processes =498, Participants=897, Circumstantial elements = 257. The material process accounted for 59.58% of the total (600), the mental process accounted for 5.06% (51), the relational process accounted for 16.39% (165), the verbal process accounted for 6.06% (61), the existential process accounted for 1.49% (15) and the behavioral process accounted for 1.99 percent (20). The transitivity analysis stood as one of the significant discourse models to unveil the hidden meaning present in the text. Transitivity analysis allowed for the in-depth analysis of the selected editorials. The analysis revealed three major themes, including the justification of the decision about the revoked Article 370, the regional and international implications of the decision and the counter-narrative of the decision. These three themes were decoded using transitivity analysis. The Indian editorials strategically utilized participants, process and circumstances to present the decision about the revocation of Kashmir's Special status. The linguistic choices used in the editorials unveiled power dynamics, tension, prospects, consequences, threats and other aspects.
... The methodology of this study is structured around a corpus-assisted discourse analysis approach (Partington et al., 2004) that integrates the principles of corpus linguistics and discourse analysis, particularly suitable for examining linguistic and communicative shifts within a specified genre or domain, as it combines quantitative precision with contextual depth, enabling a detailed exploration of changes in language use and communicative practices within PILs. The corpus-based component provides a robust empirical foundation by quantifying linguistic features and identifying patterns and trends (McEnery and Hardie, 2012). This quantification is complemented by the discourse analysis component, which allows for an in-depth investigation of how linguistic features function within their specific social and regulatory contexts (Schiffrin et al., 2018). ...
... The tool provided quantitative data on academic and general vocabulary distribution, which enabled the comparison between the two genres-qualitative and quantitative. The vocabulary profiles were essential for determining if qualitative and quantitative research papers used different levels of specialized academic vocabulary, supporting previous findings in the field (McEnery & Hardie, 2012). ...
Article
Full-text available
This study examines the abstract, introduction, results/discussion, and conclusion sections of quantitative and qualitative academic papers in the field of TEFL, focusing on their lexico-grammatical and move-structure features. Utilizing both quantitative and qualitative methods, the research explores potential differences between these two genre-specific corpora in terms of their linguistic and rhetorical characteristics. The analysis of move-structure was based on Swales' CARS model (2004). A mixed approach of computer-assisted and manual analysis was used to ensure validity. Fifty research articles from ELT journals, representing both quantitative and qualitative approaches, were selected for analysis. Statistical interpretation of the results, including vocabulary profiles, readability statistics, and move-step structures, was conducted using the non-parametric Mann-Whitney U test. The results, with a significance level of P < 0.05, indicated that most lexico-grammatical features across abstracts, introductions, results/discussion, and conclusions in quantitative and qualitative papers were not significantly different. However, move-structure analysis revealed distinct variations between the two genres across these sections. These findings provide valuable insights for academic researchers in the EFL context, suggesting that while research methodology is important, the choice of topic and the researcher's unique perspective may be more critical in shaping the study.
... The tool provided quantitative data on academic and general vocabulary distribution, which enabled the comparison between the two genres-qualitative and quantitative. The vocabulary profiles were essential for determining if qualitative and quantitative research papers used different levels of specialized academic vocabulary, supporting previous findings in the field (McEnery & Hardie, 2012). ...
Article
Full-text available
This study examines the abstract, introduction, results/discussion, and conclusion sections of quantitative and qualitative academic papers in the field of TEFL, focusing on their lexico-grammatical and move-structure features. Utilizing both quantitative and qualitative methods, the research explores potential differences between these two genre-specific corpora in terms of their linguistic and rhetorical characteristics. The analysis of move-structure was based on Swales’ CARS model (2004). A mixed approach of computer-assisted and manual analysis was used to ensure validity. Fifty research articles from ELT journals, representing both quantitative and qualitative approaches, were selected for analysis. Statistical interpretation of the results, including vocabulary profiles, readability statistics, and move-step structures, was conducted using the non-parametric Mann-Whitney U test. The results, with a significance level of P < 0.05, indicated that most lexico-grammatical features across abstracts, introductions, results/discussion, and conclusions in quantitative and qualitative papers were not significantly different. However, move-structure analysis revealed distinct variations between the two genres across these sections. These findings provide valuable insights for academic researchers in the EFL context, suggesting that while research methodology is important, the choice of topic and the researcher’s unique perspective may be more critical in shaping the study.
... The texts in the corpus do not cover all types of business texts. Involving both written and spoken forms (McEnery & Hardie, 2012;Grygiel, 2015), business discourses include the language used in letters, reports, academic textbooks (Johns, 1980), interviews, negotiations, business meetings, use of electronic media (Grygiel, 2015), as well as conversations of people in business organizations (Boden, 1994). This study excludes phonological and multimodal studies and limits itself to English-written business reports and their Chinese translations. ...
Article
Lexical cohesion involves the continuity of text on the level of lexis achieved through word choices; it embodies repetition, synonymy, and collocation (Halliday, 1985). This study attempts to elucidate the lexical cohesion and examine translation shifts in the English-Chinese translation of business texts. The English business texts are compared with two Chinese translations: one by human translators and the other by ChatGPT. The research is based on Halliday’s (1985) cohesion and Toury’s (2012) descriptive translation studies. The data of lexical cohesion are identified and collected manually in a parallel corpus. The analysis deals with the description of lexical cohesion and translation shifts. The research reveals that semantic meanings of the items of lexical cohesion are largely maintained in the English-Chinese business translation. Additionally, despite using translation methods like literal translation, addition, omission, and conversion in translating lexical cohesion, human translators make more translation shifts of lexical cohesion than ChatGPT.
... Questa conclusione è utilizzabile nell'insegnamento, che è l'obbiettivo finale di tutta la nostra ricerca. La sua applicabilità si vede non solo nel metodo di insegnamento, ma anche nel curriculum dell'insegnamento dell'italiano, così come è già stato fatto con altre lingue [34]. ...
Article
Full-text available
L'analisi di learner corpora offre la possibilità di recuperare stringhe di parole ricorrenti (gruppi di parole) nei testi di apprendenti. Negli ultimi decenni, tale metodo è stato utilizzato anche per identificare le sequenze lessicali che caratterizzano diversi generi testuali. Lo scopo del presente studio è quello di analizzare le caratteristiche delle sequenze di parole ricorrenti nei testi scritti dagli studenti del Dipartimento di Lingua e Letteratura Italiana dell'Università Nazionale e Capodistriaca di Atene, al fine di individuare pattern comuni nelle produzioni, grazie ai quali trarre delle generalizzazioni. A questo scopo, sono state interrogate quattro sottosezioni di SCRICREA, un learner corpus che raccoglie produzioni di studenti del corso di scrittura creativa, attraverso una ricerca per gruppo lessicale. I risultati preliminari di questo procedimento consentono di formulare una serie di osservazioni sui gruppi lessicali estratti. In primo luogo, i gruppi lessicali recuperati e analizzati confermano che gli studenti utilizzano un numero ridotto, ma molto diffuso di pattern prefabbricati. In secondo luogo, è possibile ricondurre all'interferenza della L1 la principale causa dei pochi ma ripetuti errori nelle sequenze. Tali osservazioni possono avere un impatto immediato non solo sui metodi di insegnamento, ma anche sul sillabo dell'insegnamento della lingua italiana, così come già verificatosi per altre lingue.
... Вивчення ключових слів поетичного тексту з використанням корпусного підходу також може бути цікавим і продуктивним методом лінгвістичного аналізу. Наведемо деякі особ-ливості корпусного підходу до визначення ключових слів [8,15,24]: ...
Article
У су­час­но­му ін­фор­ма­ційно­му сус­піль­стві ана­ліз тек­сто­вих ма­те­рі­алів і виз­на­чен­ня їх клю­чо­вих особ­ли­вос­тей ма­ють ве­ли­ке зна­чен­ня в різ­них га­лу­зях на­уки, зок­ре­ма в кор­пус­ній лін­гвіс­ти­ці. Вста­нов­ле­но, що нез­ва­жа­ючи на ве­ли­кий по­тен­ці­ал зас­то­су­ван­ня ме­то­до­ло­гій кор­пу­сів у різ­них га­лу­зях дос­лі­джен­ня, до­сі іс­нує пот­ре­ба їхньо­го опа­ну­ван­ня для прак­тич­но­го зас­то­су­ван­ня. На­яв­ність ве­ли­ко­мас­штаб­них комп'юте­ри­зо­ва­них кор­пу­сів тек­стів, які бу­ло вдос­ко­на­ле­но зав­дя­ки кра­щій циф­ро­вій ін­фрас­трук­ту­рі та тех­но­ло­гіч­ним до­сяг­нен­ням, що від­бу­ва­ють­ся в епо­ху ін­фор­ма­ції, за­без­пе­чує ба­зис для лін­гвіс­тич­них дос­лі­джень. Про­ана­лі­зо­ва­но спе­ці­алі­зо­ва­не прог­рам­не за­без­пе­чен­ня з по­туж­ни­ми фун­кці­ями об­роб­лен­ня та ана­лі­зу кор­пу­сів тек­стів, пот­ріб­них для здійснен­ня лін­гвіс­тич­них дос­лі­джень та його прак­тич­не ви­ко­рис­тан­ня у різ­них дос­лі­джен­нях. Та­кож з'ясо­ва­но, що не від­сут­ність ефек­тив­них, по­туж­них ста­тис­тич­них ал­го­рит­мів або ал­го­рит­мів ма­шин­но­го нав­чан­ня, а дос­туп до них дос­лід­ни­ків є вузь­ким міс­цем у роз­вит­ку під­хо­дів на ос­но­ві кор­пу­сів і су­між­них дис­цип­лін. На­ве­де­но ре­зуль­та­ти вив­чен­ня мож­ли­вос­тей і ме­то­дів ви­ко­рис­тан­ня кор­пус­них інстру­мен­тів для ви­яв­лен­ня та ана­лі­зу клю­чо­вих слів тек­стів що­до кор­пус­ної лін­гвіс­ти­ки. Та­кі прог­рам­ні інстру­мен­ти, як кор­пус­ний ме­не­джер AntConc та веб­сис­те­ма Sketch En­gi­ne, ма­ють важ­ли­ве зна­чен­ня, на­да­ючи мож­ли­вість здійсню­ва­ти різ­но­ма­ніт­ні лін­гвіс­тич­ні дос­лі­джен­ня, се­ред яких ана­ліз жан­ро­вих особ­ли­вос­тей тек­стів. Дос­лі­джен­ня про­ве­де­но на під­ста­ві кор­пу­су тек­стів, який на­ра­хо­вує 35 ук­ра­їнсь­ких стрі­лець­ких і пов­стансь­ких пі­сень. Про­ана­лі­зо­ва­но лек­си­ко-се­ман­тич­ні особ­ли­вос­ті клю­чо­вих слів, вста­нов­ле­но їх­ні ро­лі в ана­лі­зі мо­ви та де­таль­но вив­че­но фун­кці­онал кор­пус­них інстру­мен­тів для їхньо­го по­шу­ку та ана­лі­зу. Зап­ро­по­но­ва­но ре­зуль­та­ти ана­лі­зу ме­то­дів та інстру­мен­тів, ви­ко­рис­та­них для ана­лі­зу тек­стів стрі­лець­ких і пов­стансь­ких пі­сень, виз­на­чен­ня клю­чо­вих слів, ви­яв­лен­ня ос­нов­них те­ма­тич­них і лін­гвіс­тич­них оз­нак дос­лі­джу­ва­них пі­сень. Для все­біч­но­го ана­лі­зу клю­чо­вих слів ви­ко­рис­та­но фун­кції Col­lo­ca­tes, N-Grams та Word List у кор­пус­но­му ме­не­дже­рі AntConc, а та­кож фун­кцію Key­words у веб­сис­те­мі Sketch En­gi­ne. Ви­яв­ле­но, що се­ред клю­чо­вих слів найбіль­шу час­то­ту вжи­ван­ня ма­ють та­кі час­ти­ни мо­ви, як ви­гу­ки, спо­луч­ни­ки і час­тки, що при­та­ман­но для фольклор­них пі­сень. Клю­чо­ві сло­ва, по­да­ні імен­ни­ка­ми, змальо­ву­ють ро­дин­ні зв'яз­ки, війсь­ко­ві буд­ні та осо­бис­ті по­чут­тя во­яків. До­сить знач­ною є час­тка прик­мет­ни­ків і дієслів. Та­кож на­яв­на ве­ли­ка кіль­кість сло­во­форм з пес­тли­во-змен­шу­валь­ни­ми су­фік­са­ми у піс­нях цього жан­ру, що вка­зує на ніж­не став­лен­ня до опи­са­них об'єктів. От­ри­ма­ні ре­зуль­та­ти дос­лі­джен­ня є важ­ли­вим внес­ком у вдос­ко­на­лен­ня кор­пус­ної лін­гвіс­ти­ки та ком­плек­сне ви­ко­рис­тан­ня прог­рам­них інстру­мен­тів кор­пус­но­го ме­не­дже­ра AntConc та веб­сис­те­ми Sketch En­gi­ne для ана­лі­зу клю­чо­вих слів.
... Aijmer (2009) proposes that corpus-based research holds significant potential in the preparation of textbooks and course books. A corpus is a collection of computer-readable texts that can be used for quantitative linguistic analysis, which helps achieve an empirical investigation of language features in a statistical way (McEnery & Hardie, 2012, as cited in Zhou & Prado, 2024. A corpus can help textbook developers evaluate the texts by automatic machines on a large and systematic scale. ...
Article
Full-text available
Textbooks and exams are important aspects of English education as they essentially determine or influence what students learn. Previous studies have compared English language teaching (ELT) materials and proficiency exams to verify and ensure consistency and reliability in both testing and teaching materials designed for the same target students; a mismatch might lead to tests negatively affecting teaching and vice-versa. However, the Chinese ELT context remains under-investigated. This research aims to provide researchers with hands-on suggestions for the approach of constructing and analysing ELT-specialised corpora which can improve future ELT material development and assessment practices to be more objective statistics-based.
... One growing field in corpus linguistics is the learner corpus, which allows researchers to compare learners' authentic language with that of native speakers [7,8,11,12]. Learner corpora are particularly useful in error analysis, as they provide insights into common learner errors and their frequencies. ...
Article
Full-text available
Background The perceived language barrier in English is said to hinder, and in certain instances, impede the global dissemination of knowledge, including medical information, to non-native English speakers within medical institutions. As English for medical purposes instructors, we contend that the issue persists in medical universities across various EFL contexts. Medical students face the challenge of presenting their research findings in English for international journals and conferences. Given this, the present research study aimed to compile a comprehensive catalog of high-frequency errors and examine them in recurring linguistic patterns commonly found in the writing of Iranian medical students. Methods In conducting the present study, we developed a learner corpus of 1,040 essays (339,040 words and 18,235 sentences in total). Through using the results obtained from Wordsmith Tools 8 and sifting the leaner corpus, we identified 11 high-frequency errors and five commonly used linguistic patterns. Results Only five out of 11 high-frequency errors account for 61% of the total number of errors. Results also showed that a majority of errors were of grammatical nature. In this regard, cohesion and cohesive devices (16%) were the most prevalent errors followed by omission/misusing of articles/determiners (14%). Additionally, results showed that discourse markers were extensively used in the corpus (22.07%), followed by hedges (11.42%). Conclusions The outcomes of this study are expected to assist English for medical purposes instructors in designing focused lesson plans and classroom activities. Ultimately, these efforts might contribute to enhancing medical education in non-English speaking universities.
... Such a linguistic corpus, derived from theoretical foundations and survey data, provides a rich dataset for analysing trends, patterns, and evolving perspectives in curriculum design from theoretical and practical viewpoints. McEnery and Hardie (2012) highlighted that a linguistic corpus offers an invaluable resource for empirical research, allowing for detailed linguistic analysis across diverse contexts. The global scope of the responses ensures a wide-ranging perspective, aligning with the principles outlined by Baker (2006) for corpus representativeness. ...
Article
Gender differences have been found in the way parents communicate online, however it is unclear whether these differences apply in the context of postnatal depression (PND). This research aimed to evaluate online discourses surrounding PND and explore gender differences in communication style associated with PND. X (formerly Twitter) data (15,850 posts) was identified and collected based on a key term search (e.g. ‘PND’) and analysed using corpus linguistic analysis. Results showed that female X users were more likely to discuss PND using words with a negative connotation or to use self-referent items, compared to male users who discussed PND more generally. X content related to PND was mostly created by female users and generally revolved around the experiences of mothers. The limited discussion regarding paternal PND suggests a lack of acknowledgement and insufficient online resources available for fathers.
Article
Full-text available
Applied Linguistics is a dynamic and interdisciplinary field that examines the practical applications of linguistic theories and concepts in real-world contexts. This study delves into the diverse dimensions of applied linguistics aiming to provide an in-depth analysis of its historical roots, theoretical frameworks, methodological approaches, and practical implications. By examining key areas such as language acquisition, language teaching, sociolinguistics, and discourse analysis, this study aims to shed light on the profound impact of applied linguistics on various aspects of human communication and language use. The current study is relied on a wide range of academic sources, including influential works, contemporary research articles, and theoretical frameworks. The historical development of applied linguistics is traced back to its roots in the mid-20th century, highlighting the major contributions and milestones that have shaped the field. Furthermore, the study explores the theoretical frameworks that underpin applied linguistics, behaviorism, sociocultural, connectionism, and Universal Grammar perspectives, providing a deeper understanding of the conceptual foundations of the discipline.
Article
The current study analyzed the short story of Ernest Hemingway named “Hills Like White Elephants”. The study investigated the transitivity process in the short story genre, for which Halliday's ideational meta-function was utilised as the theoretical framework for the study. Moreover, the transcript of the short story “Hills Like White Elephants” was taken as the research material. Furthermore, the research was corpus-assisted in nature, and the basic unit for analysing transitivity is a clause. Similarly, for the purpose of this research, clauses from the text were used for the analysis as the study was based on clause level. The researcher used the UAM corpus tool, and elements of transitivity, such as participants, processes, and circumstances, were examined. The results exhibited that the writer used the material process most recurrently to characterize actions and goings-on in the text. It allowed the writer to vividly depict events, making the text more engaging and informative and elaborating Hemingway's minimalist and iceberg-style writing. This study also helped to emphasize how linguistic analysis can be helpful to understand the discourse of the text and ideology of the writer and would also be valuable for researchers and students in order to analyze and interpret texts of diverse genres from the standpoint of systemic functional linguistics. Keywords: Systemic functional linguistics, ideational meta-function, transitivity analysis, corpus-based. References Ahmad, S. (2019). Transitivity Analysis of the Short Story “The Happy Prince” Written by Oscar Wilde. IJOHMN (International Journal Online of Humanities), 5(2). https://doi.org/10.24113/ijohmn.v5i2.90 Anjum, & Javed, M. (2019). A Corpus-Based Halliday’s Transitivity Analysis of “To the Lighthouse” by Um-e- Ammara, Rehana Yasmin Anjum, Maryiam Javed :: SSRN. Linguistics and Literature Review, 05(02), 139–162. Bloor, T., Bloor, M., Bloor, T., & Bloor, M. (2004). The Functional Analysis of English. The Functional Analysis of English. https://doi.org/10.4324/9780203774854 Burton, G. M. (1982). Writing numerals Suggestions for Helping Children. Intervention in School and Clinic, 17(4). https://doi.org/10.1177/105345128201700405 Byrnes, H. (2009). Systemic-functional reflections on instructed foreign language acquisition as meaning-making: An introduction. In Linguistics and Education (Vol. 20, Issue 1). https://doi.org/10.1016/j.linged.2009.01.002 Byrnes, H. (2012). The Routledge Encyclopedia of Second Language Acquisition. In The Routledge Encyclopedia of Second Language Acquisition (pp. 622–624). Routledge. https://doi.org/10.4324/9780203135945 Emayakre, N. N. (2021). Using transitivity Analysis to Reveal the Nature of Characters in Louisa May Alcott’s Little Women. EAST AFRICAN JOURNAL OF EDUCATION AND SOCIAL SCIENCES, 2(Issue 4). https://doi.org/10.46606/eajess2021v02i04.0139 Ezzina, R. (2015). Transitivity Analysis of « The Crying lot of 49 » by Thomas Pynchon. International Journal of Humanities and Cultural Studies, 2(3). Gerot, L., & Wignell, P. (1994). Making sense of functional grammar. Antipodean Educational Enterprises. Halliday, Matthiessen, C. M. I. M. (2004). Halliday’s introduction to functional grammar (4th ed.). Routledge. Halliday, M. A. ., & Matthiessen, C. (2014). An Introduction to Functional Grammar. An Introduction to Functional Grammar. https://doi.org/10.4324/9780203783771 Halliday, M. A. K. (1985). Halliday, M. A. K. (1985). An Introduction to Functional Grammar (1st ed.). London Edward Arnold. - References - Scientific Research Publishing. Hodder Arnold. Halliday, M. A. K. (2009). The Essential Halliday: : M.A.K. Halliday: Continuum (Jonathan Webster (ed.); 1st ed.). Bloomsbury Publishing. Halliday, M. A. K., & Matthiessen, C. M. I. M. (2013). Halliday’s introduction to functional grammar: Fourth edition. In Halliday’s Introduction to Functional Grammar: Fourth Edition. https://doi.org/10.4324/9780203431269 Isti’anah, A. (2019). TRANSITIVITY ANALYSIS OF AFGHAN WOMEN IN ÅSNE SEIERSTAD’S THE BOOKSELLER OF KABUL. LiNGUA: Jurnal Ilmu Bahasa Dan Sastra, 14(2). https://doi.org/10.18860/ling.v14i2.6966 Jawaid, A., Batool, M., Arshad, W., ul Haq, M. I., Kaur, P., & Sanaullah, S. (2025). AI AND ENGLISH LANGUAGE LEARNING OUTCOMES. Contemporary Journal of Social Science Review, 3(1), 927-935. https://contemporaryjournal.com/index. php/14/article/view/387 Jawaid, A., Batool, M., Arshad, W., ul Haq, M. I., Kaur, P., & Arshad, S. (2025). ENGLISH LANGUAGE VOCABULARY BUILDING TRENDS IN STUDENTS OF HIGHER EDUCATION INSTITUTIONS AND A CASE OF LAHORE, PAKISTAN. Contemporary Journal of Social Science Review, 3(1), 730-737. https://contemporaryjournal.com/index.php/14/article/view/360 Jawaid, A., Mukhtar, J., Mahnoor, D. P. K., Arshad, W., & ul Haq, M. I. (2025). ENGLISH LANGUAGE LEARNING OF CHALLENGING STUDENTS: A UNIVERSITY CASE. Journal of Applied Linguistics and TESOL (JALT), 8(1), 679-686. https://jalt.com.pk/index.php/jalt/article/view/370 Mahmood, M. I., & Hashmi, M. A. (2020). A Corpus-based Transitivity Analysis of Nilopher’s Character in The Stone Woman. Sjesr, 3(4). https://doi.org/10.36902/sjesr-vol3-iss4-2020(351-361) Martin, J. R., & Rose, D. (2008). Genre relations: Mapping culture. In Language in Society (Vol. 39, Issue 03). McEnery, T., & Hardie, A. (2011). Corpus linguistics Method, theory and practice. Cambridge University Press. Mehmood, A., Amber, R., Ameer, S., & Faiz, R. (2014). Transitivity Analysis: Representation of Love in Wilde’S the Nightingale and the Rose. European Journal of Research in Social Sciences, 2(4). Meyer, C. F., Halliday, M. A. K., & Hasan, R. (1987). Language, Context, and Text: Aspects of Language in a Social-Semiotic Perspective. TESOL Quarterly, 21(2). https://doi.org/10.2307/3586740 Mushtaq, M., Saleem, T., Afzal, S., & Saleem, A. (2021). A corpus-based ideational meta-functional analysis of Pakistan Prime Minister Imran Khan’s speech at United Nations general assembly. Cogent Social Sciences, 7(1). https://doi.org/10.1080/23311886.2020.1856999 Nguyen, H. T. (2012). Transitivity Analysis of “Heroic Mother” by Hoa Pham. International Journal of English Linguistics, 2(4). https://doi.org/10.5539/ijel.v2n4p85 Qasim, H. M., Sabtin, M., & Talaat, M. (2018). A Transitivity Analysis of How to Get Filthy Rich in Rising Asia | Request PDF. ELF Annual Research Journal, 20, 181–200. Walliman, N. (2010). Research Methods: The Basics. In Research Methods: The Basics. https://doi.org/10.4324/9780203836071 Wang, J. (2010). A Critical Discourse Analysis of Barack Obama’s Speeches. Journal of Language Teaching and Research, 1(3). https://doi.org/10.4304/jltr.1.3.254-261 Webster, Jane, & Watson, R. T. (2002). Analyzing the Past to Prepare for the Future: Writing a Literature Review. MIS Quarterly, 26(2). https://doi.org/10.1.1.104.6570 Xu, L. (2012). A Comparative Analysis of"The Little Mermaid" and "The Little Mer-persun" An SFG Approach | Semantic Scholar. Journal of Literature and Art Studies Vol.2.No.7, 723–729. Yule, G. (2010). The Study of Language by Yule. In Cambridge (Vol. 4, Issue September).
Article
Full-text available
The current study analyzed the short story of Ernest Hemingway named “Hills Like White Elephants”. The study investigated the transitivity process in the short story genre, for which Halliday's ideational meta-function was utilised as the theoretical framework for the study. Moreover, the transcript of the short story “Hills Like White Elephants” was taken as the research material. Furthermore, the research was corpus-assisted in nature, and the basic unit for analysing transitivity is a clause. Similarly, for the purpose of this research, clauses from the text were used for the analysis as the study was based on clause level. The researcher used the UAM corpus tool, and elements of transitivity, such as participants, processes, and circumstances, were examined. The results exhibited that the writer used the material process most recurrently to characterize actions and goings-on in the text. It allowed the writer to vividly depict events, making the text more engaging and informative and elaborating Hemingway's minimalist and iceberg-style writing. This study also helped to emphasize how linguistic analysis can be helpful to understand the discourse of the text and ideology of the writer and would also be valuable for researchers and students in order to analyze and interpret texts of diverse genres from the standpoint of systemic functional linguistics.
Article
Full-text available
Word frequency is a fundamental concept in linguistics, computational linguistics, natural language processing (NLP) and language education. Word frequency plays a critical role in understanding the characteristics and usage patterns of a word. This study introduces the "Turkish Word Frequency Tool" (TWFT), developed as part of the LexiTR Project, along with its features. TWFT is based on a balanced corpus consisting of over 193 million words from four distinct text types: academic, social media, fictional, and informative texts. TWFT serves a scalable online platform that provides researchers with the ability to examine word usage trends across different text types. It enables comprehensive analyses through real-time querying, graphical data representation, and both raw and normalized frequency values. Additionally, it provides API support, presenting word frequency information in a structured format. By filling a significant gap in the existing literature, TWFT aims to establish a consistent, transparent, and comprehensive foundation for linguistic research and natural language processing applications.
Article
Full-text available
La question de la souveraineté a été au cœur des débats lors du référendum sur l’appartenance du Royaume-Uni à l’Union européenne (UE). Le slogan « Take Back Control » du groupe officiel pour quitter l’UE, Vote Leave, résume bien cette idée. L’objectif était double. Grâce au Brexit, le Royaume-Uni serait en mesure de reprendre le contrôle de ses finances, afin d’allouer au système de santé britannique en difficulté les £350 millions prétendument envoyées à l’UE. Plus important encore, le Brexit permettrait de freiner l’immigration incontrôlée de l’UE, puisque le Royaume-Uni reprendrait enfin le contrôle de ses frontières.La notion de « frontière(s) » est donc devenue centrale dans la rhétorique des membres de Vote Leave. Elle a en effet fourni un cadre au discours en faveur de la sortie de l’UE. C’est précisément ce que cet article propose d’examiner. L’analyse d’un corpus de documents officiels nous permettra d’étudier la manière dont le concept de « frontière(s) » a été utilisé par les membres de Vote Leave afin de défendre leur position pro-Brexit. Nous pourrons alors voir comment le discours national populiste des principaux Brexiteers a contribué à l’élaboration d’une représentation collective — et biaisée — de l’identité britannique.
Article
Establishing the credibility of scientific research involves several related but significantly different concerns. One potential problem in surveying different approaches to these concerns is that of terminology, as some of the basic terms used in the discussion — reproducibility, replicability, robustness, and generalizability — are often used in inconsistent or contradictory ways. This paper proposes to resolve such confusion by providing a terminological framework for discussing what kind of confirmation is necessary for a scientific study to be deemed credible. A study is said to be ‘reproducible’ if we can obtain identical results by performing an identical analysis on identical data, ‘replicable’ if we can obtain consistent results using the same analysis on different data, ‘robust’ if we can obtain consistent results from identical data using a different analysis, and ‘generalizable’ if we can obtain consistent results from different data using a different analysis.
Chapter
The World Health Organization (WHO. Climate Change. 2023. https://www.who.int/health-topics/climate-change.) claims that global warming is contributing to more intense and frequent heatwaves, leading to health problems such as heat-related illnesses and the exacerbation of pre-existing conditions such as cardiovascular and respiratory diseases. Mental health problems, including post-traumatic stress disorder (PTSD), have also been linked to climate change. Mitigating and adapting measures to climate change are considered crucial to safeguard public health (Abbas et al. Abbas, Ansar, Dian Ekowati, Fendy Suhariadi, and Rakotoarisoa Maminirina Fenitra. 2022. Health Implications, Leaders Societies, and Climate Change: A Global Review. In Ecological Footprints of Climate Change : Adaptive Approaches and Sustainability, ed. Uday Chatterjee, Angela Oyilieze Akanwa, Suresh Kumar, Sudhir Kumar Singh, and Abira Dutta Roy, 653–75. Springer Climate. Cham: Springer International Publishing. 10.1007/978-3-031-15501-7_26). In other words, it could be fruitful to view the intersecting challenges of planetary and human health through a polycrisis lens, which advocates for polysolutions (Henig and Knight Henig, David, and Knight, Daniel M. 2023. Polycrisis: Prompts for an Emerging Worldview. Anthropology Today, 39: 3–6. 10.1111/1467-8322.12793) that redress the interconnectedness of these domains. As O’Regan (O'Regan, Michael (ed). 2023. Off the Grid and on the Road in Europe: Living in an Age of Uncertainty and Polycrisis. Anthropological Journal of European Cultures Special issue 32 (2023). https://www.berghahnjournals.com/view/journals/ajec/32/2/ajec.32.issue-2.xml) reminds us, though, polycrisis is not experienced in the same way by every individual. Consequently, organisations like the WHO, whose role is to raise awareness about climate change, need to tailor their communications to different audiences whose experiences of polycrisis differ. This study examines how the WHO communicates the health impacts of climate change to different audiences. Using corpus linguistics and artificial intelligence, in particular generative pre-trained transformers (GPTs) from the MedAlpaca collection (Han et al. in MedAlpaca—An Open-Source Collection of Medical Conversational AI Models and Training Data, 2023a.) trained on medical data, and drawing on CDA, the study analyses the WHO's regional Web platforms, to identify the medical discourse in the context of the regional climate debate and to assess its significance. In addition, a multimodal analysis using Scikit-learn (Hackeling, Gavin in Mastering Machine Learning with Scikit-Learn—Second Edition. Packt, 2017. https://www.packtpub.com/product/mastering-machine-learning-with-scikit-learn-second-edition/9781788299879) will identify image similarities and show how images support health communication goals. The study aims to demonstrate how climate change discourse adapts to different audiences in different regions to promote proactive policymaking, raise awareness, and encourage collective action: in Stibbe’s (Stibbe, Critical Discourse Studies 11:117–128, 2014) words, a cluster of linguistic and multimodal features are used to convey a particular worldview.
Chapter
Goal 13 of the UNO Agenda 2030 highlights the urgent threat of climate change and its far-reaching consequences for both nature and society. To encourage action on climate change, the UNO Agenda 2030 redirects to the ACT NOW campaign site and the Climate Change website. Drawing on corpus linguistics, critical discourse analysis, and specifically the Discourse-Historical Approach (DHA), the aim of this chapter is to analyse if, and how, crises deriving from climate change are discursively communicated in official 2023 UNO documents. In order to do so, I concentrate on the linguistic analysis of three key terms (crisis and crises, challenge(s), and burden(s)): by taking into consideration aspects of nomination and predication, I will see if, in official UN documents, awareness of the various climate change crises is raised, and which discursive strategies are employed when discussed. Results seem to indicate that responsibility for the processes indicated in UNO documents is never specified. Instances of personification, passivization, and nominalization contribute to making the texts vaguer in terms of social actors’ responsibility, thus hiding agency and systematically relegating responsibility for actions to the background.
Article
Full-text available
Introduction: Cette étude analyse l’influence lexicale du français sur les textes culinaires en anglais du XIXe siècle écrits par des femmes. Elle se concentre sur l’identification et l’analyse des termes culinaires d’origine française. Méthodologie: En utilisant un sous-corpus tiré du Corpus of Women’s Instructive Writing (1550-1899), une analyse lexicale a été menée avec l’outil AntConc pour examiner les occurrences et cooccurrences des termes français. Le dictionnaire Oxford English Dictionary a été utilisé pour déterminer l’étymologie des mots. Résultats: Les résultats montrent une présence significative de terminologie française, principalement des substantifs liés aux ingrédients, ustensiles et techniques culinaires, soulignant l’importance de l’influence française sur la cuisine anglaise du XIXe siècle. Discussion: Ces emprunts lexicaux révèlent une évolution culinaire et culturelle, enrichissant le vocabulaire anglais et jouant un rôle clé dans la définition des pratiques culinaires de l’époque. Conclusions: Les conclusions montrent que les emprunts linguistiques français ont non seulement enrichi le lexique anglais mais ont également reflété et promu des changements dans les pratiques culinaires et la perception sociale de la gastronomie.
Article
Full-text available
Collocations typically refer to habitual word combinations, which not only occur in texts but also constitute an essential component of the mental lexicon. This study focuses on the mental lexicon of Chinese learners of English as a foreign language (EFL), investigating the representation of collocations and the influence of input frequency and L2 proficiency by employing a phrasal decision task. The findings reveal the following: (1) Collocations elicited faster response times and higher accuracy rates than non-collocations. (2) Higher input frequency improved the accuracy of judgments. High-proficiency Chinese EFL learners exhibit better accuracy and faster response times in collocation judgment tests. Additionally, input frequency and L2 proficiency interactively affected both response time and accuracy rate. These results indicate that L2 learners have a processing advantage for collocations, which function as independent entries in the mental lexicon. Both input frequency and L2 proficiency are crucial factors in collocational representation, with increased input frequency and proficiency shifting the representation from analytic retrieval toward holistic recognition in a continuum pattern.
Article
This study, conducted within the lexical approach paradigm, examines the influence of context and corpus-based instructional practices on enhancing lexical proficiency. The research specifically focuses on the instruction of the polysemous verb "al-" (to take/get). The quantitative aspect of the study was designed using a pretest-posttest matched control group; 21 students at the B1-B2 level constituted the experimental group, while 20 students comprised the control group. In the qualitative aspect, interview and error analysis techniques were used; the qualitative results were used to elaborate the quantitative results. A retention test was conducted for the experimental group nine weeks after the post-test. Interview data were analyzed through inductive qualitative analysis, while activity booklets were examined through error analysis. The findings indicated that the lexical approach for the instruction of the polysemous verb 'al-' was effective in enhancing student achievement. The results of the achievement test showed that the mean scores of the experimental group on the post-test were higher than those of the control group in literal, connotative, figurative, and idiomatic meaning sub-dimensions, and this difference was statistically significant. According to the error analysis, the students showed a decreasing trend in the number of errors made in linguistic units containing the verb "al-" throughout the process. The results of the interviews support the positive outcomes of the implementation. It has been concluded that instructional practices prepared within the lexical approach are effective in enhancing vocabulary proficiency.
Article
Full-text available
If we consider corpus linguistics as the study of a language through its samples, we should give credit to its contribution to the advancement of various sub-fields of linguistics: lexicography, translation studies, applied linguistics, diachronic studies and contrastive linguistics. The latter can be regarded as a special case of a linguistic typology that is distinguished from other types of typological approaches by a small sample size and a high degree of granularity (Gast 2011: 2-3). Nowadays, corpus-based contrastive studies can be treated as a growing research area that focuses on two or more languages. The present paper makes an attempt to discuss the usefulness of the specialized combined parallel-comparable corpus while dealing with legalese. The effectiveness is presented on the example of the legal institution fiducie. The methodology of research comprises the comparative analysis as well as the corpus-based analysis of the terms related to the fiducie-s presented in three varieties of French: France’s, Canadian and Luxembourgish. The carried out research reveals the juridical-semantic differences and the problematics of the verbal realization of the concepts related to three fiducie-s. These hinder a proper interpretation and complicate the process of translation. The major solution is found through the specification of meaning by renaming.
Article
Full-text available
This study was motivated by the need to understand the ways users perform speech acts on social media platforms, specifically Twitter, Facebook, and Instagram, and how these acts differ between public and private contexts. The purpose was to analyse the frequencies and types of speech acts (requests, apologies, and compliments) and identify the linguistic and pragmatic strategies employed. Using a mixed-methods research design, a corpus of 3 million posts was collected and analysed. Stratified random sampling ensured a balanced representation of speech acts, and both manual annotation and machine learning techniques were used for classification. Three major findings emerged: first, requests were significantly more frequent and direct in private messages than in public posts across all platforms; second, public apologies were more formal and detailed, while private apologies were concise and personal; third, Instagram had the highest frequency of compliments, with public posts being more explicit and enthusiastic compared to private messages. The study concluded that context and platform-specific features heavily influence communication strategies. These insights advance theoretical understanding and offer practical applications for optimizing social media communication.
Article
Full-text available
This article is devoted to the issues related to the definition of stable word combinability in speech. The research relevance is sustained by the existing need in profound linguistic knowledge about the factors that determine the formation of stable relationships between the elements of a word combination. The English Web Corpus (enTenTen) and its subcorpora are chosen as the source. The authors consider bigrams of a two-word combination: the verb take with an adjacent word. In addition to a critical examination of the measures used to determine word cohesion, the nature of the relationships between collocation elements is analysed. Particular attention is paid to the comparison of collocations in subcorpora, which contain texts of different genres and topics. More than 100 bigrams obtained through the association measures t-score, MI-score and Log Dice are analysed. The t-score measure differs across the investigated subcorpora, which demonstrates the correlation of the findings with the size of the subcorpora. It is concluded that it is not possible to determine the degree of stability of the associative relationship in the bigrams of the verb take based on this measure alone. The data obtained using the MI-score and Log Dice measures show little difference between subcorpora, demonstrating their independence of the corpus size. The variable nature of the relationships between the collocation elements has been revealed to lie in the dependency of the degree of coherence of words in a word combination on the frequency of their occurrence in the texts of different genres, registers and modalities. Special attention is given to the issue of identifying the degree of effectiveness of the measures in extracting verb collocations and their application to specific professional tasks.
Article
En este estudio se analiza la representación de China en la prensa española durante la pandemia de COVID-19, utilizando un marco analítico compuesto por la teoría de la valoración, el sistema de transitividad y la lingüística de corpus. Nuestra investigación se basa en un corpus especializado que consta de 271 artículos extraídos de los periódicos españoles El País, El Mundo, La Vanguardia y ABC. Se aplican diversas técnicas de análisis, como palabras clave, colocaciones y concordancias, con el fin de identificar patrones y tendencias en la representación de China. Los resultados indican que los periódicos españoles han transmitido una imagen multifacética de China, empleando diversos recursos lingüísticos para expresar su postura actitudinal. La prosodia valorativa hacia China es variable, con instancias de valoraciones tanto positivas como negativas. Por un lado, se valora de manera positiva la efectividad de las medidas anti-COVID implementadas por China y la capacidad del país en la gestión de la crisis. Por otro lado, se critica el sistema sanitario chino, se describen algunas de sus medidas como draconianas y se le atribuye en ocasiones una imagen ambiciosa y arrogante. El presente trabajo proporciona información relevante sobre cómo los medios de comunicación españoles presentan la imagen de China durante la pandemia de COVID-19 y contribuye al entendimiento de la importancia y la viabilidad de estudiar la imagen desde un enfoque discursivo.
Chapter
The entire world continues to experience challenges such as climate change, pandemics and wars. This trickles down to Africa, a continent that continues to be ravaged by governance deficits evident in political challenges such as institutional coups, military coups and corruption. Achieving political governance is becoming a pipe dream in Africa due to the diverse nature of the continent, characterized by a mesh of cultural diversity, language differences being a crucial facet. The political class has used languages as tools for political polarization, evident in the deliberate radicalization of ethnic political groups, formation of ethnically oriented political parties, and further barring sensitive messaging from diverse language groups in Africa through the use of the vernacular in political rallies. For decades, the implication of language and communication has been ignored in political transformation. Hence, this work adopts the theory of conversational implicature, which states that what an expression means should not be confused with what orators mean when they use that expression. This theory centers on the co-operative principle in language and communication, encompassing four maxims: quality, quantity, relation and manner. Most often, when politicians use language as a tool to advance their agenda, they tend to flout these maxims, either intentionally or unintentionally. Therefore, this study is guided by the co-operative principle in unearthing the political polarization due to language abuse by most sub-Saharan politicians, adopting a corpus-based approach to analysis of their selected audio-visual and written texts.
Article
Full-text available
This study investigates the effects of teacher-led questioning within blended synchronous learning environments (BSLEs) on formative assessment practices in Teaching English as a Foreign Language (TEFL) classrooms. Integrating principles from formative assessment, sociocultural, and constructivist theories, a comprehensive framework has been developed to understand how strategic questioning enhances learning outcomes. Employing a mixed-method approach, including classroom observations, conversation analysis, and interviews with teachers and students, this study examines effective communication and assessment strategies in culturally diverse educational settings in China. Analysis tools such as the GFIP model (Gap, Feedback, Involvement, Progression) and the ESRU cycle (Elicit, Student response, Recognition of student response, Use of Information) reveal inconsistencies in leveraging student responses for pedagogical adjustments and emphasize the impact of cultural and linguistic factors on assessment efficacy. The proposed Culturally Responsive Formative Assessment through Engaged Dialog (CRFAED) framework advocates for customized questioning techniques that integrate culturally sensitive practices with technology to enhance learning outcomes. Results indicate that strategic teacher-led questioning in BSLE settings substantially improves student engagement and learning outcomes. The critical role of culturally responsive pedagogy in optimizing formative assessment practices is also highlighted. The CRFAED framework demonstrates effectiveness in bridging cultural gaps, facilitating better teacher–student interactions, and promoting an inclusive and responsive learning environment. This study offers insights for improving educational practices through culturally responsive pedagogy and technology integration in BSLE settings, contributing valuable knowledge to the global TEFL community.
Article
The use of English as lingua franca in mainland Europe’s higher education and research sectors has rapidly expanded over the past several decades. Despite these expansions, critics claim that there is a conspicuous absence of European Union (EU) language policy concerned with the use of English as a “contact” language in education and research contexts. This study attempts to assess these claims by conducting a comparative analysis of English as a lingua franca discourse in the policy declarations and communications of two EU actors with remits for higher education and research – the European Higher Education Area (EHEA) and the European Research Area (ERA). Using methods that draw on critical discourse analysis and corpus linguistics, I compiled specialised corpora consisting of the policy documents published by the EHEA and the ERA over the past twenty years. I then used the corpus analysis tool Sketch Engine and a reference corpus of general EU discourse to assess how often and in what ways language policy was considered. The analysis reveals intra-institutional conflicts in how language is conceptualised between policy actors and across policy portfolios in the EU. The findings also confirm the absence of explicit discourse on the use of English in academic and research contexts over the institutional lifetimes of two actors with remits for these sectors. The study highlights the persistence of a “top-down/bottom-up” conceptual gap at the heart of EU language policymaking regarding English as a lingua franca in European higher education and research.
Article
How do city-level policymakers build support for substantive action in policy domains characterized by low levels of national salience and limited local capacity, and which evidentiary resources support as well as reflect these uses? Despite much attention to policymakers’ engagement with evidence, existing work tends to focus on domains where the issues at stake attract high levels of input and influence from central governments. This limits empirical and theoretical understanding of how local efforts to implement potentially contentious policies arise, and through which means. In response, we examine how municipal actors in 12 cities and regions across the UK have devised and communicated policies on immigrant integration—an area that lacks national policy inputs yet is locally consequential—through the mechanism of “action planning.” Drawing on 6 years’ worth of documentary evidence generated through a university-initiated collaboration with these municipalities, we show how action plans gather attention for objectives and propagate examples of practice to other cities—what we call “case-making.” This serves as a micro-foundation for the action planning mechanism, which links symbolic statements about immigrant integration with substantive intended actions.
Article
Full-text available
Unveiling the cognitive patterns that underpin linguistic expressions, conceptual metaphor serves not only as an effective means for speakers to convey their values but also as a crucial tool for listeners to comprehend unfamiliar topics. This study undertakes a corpus-based analysis of conceptual metaphor expressions within the European Union’s Artificial Intelligence Act. Utilizing a corpus derived from the European Union Artificial Intelligence Act and employing both Conceptual Metaphor Theory and Critical Metaphor Analysis Theory, this research examines metaphors in terms of their types, orientations, and underlying rationales. The study identifies the most-use semantic domains of Journey, Human, War, and Object metaphors, indicating that the overall orientations are characterized by Tool, Dependency, Human, and Risk, reflecting both the aspirations and concerns of humanity. This study addresses a gap in metaphor research regarding the European Union’s Artificial Intelligence Act, offering valuable insights for policymakers and AI developers in understanding and shaping public perception of AI technologies.
Article
Full-text available
Corpus compilation is a critical process in linguistics that involves gathering and organizing large datasets for language analysis and model training. This article examines key aspects of corpus compilation, with a particular focus on data collection. It explores the sources of data, strategies for ensuring representativeness, and challenges such as copyright constraints and data quality issues. Ethical considerations, such as anonymization and consent, are also discussed. By understanding these factors, researchers can build effective and ethically sound corpora for linguistic research and computational applications.
Article
Full-text available
Automatic authorship identification is a challenging task that has been the focus of extensive research in natural language processing. Regardless of the progress made in attributing authorship, the need for corpora in under-resourced languages impedes advancing and examining present methods. To address this gap, we investigate the problem of authorship attribution in Albanian. We introduce a newly compiled corpus of Albanian newsroom columns and literary works and analyze machine-learning methods for detecting authorship. We create a set of hand-crafted features targeting various categories (lexical, morphological, and structural) relevant to Albanian and experiment with multiple classifiers using two different multiclass classification strategies. Furthermore, we compare our results to those obtained using deep learning models. Our investigation focuses on identifying the best combination of features and classification methods. The results reveal that lexical features are the most effective set of linguistic features, significantly improving the performance of various algorithms in the authorship attribution task. Among the machine learning algorithms evaluated, XGBoost demonstrated the best overall performance, achieving an F1 score of 0.982 on literary works and 0.905 on newsroom columns. Additionally, deep learning models such as fastText and BERT-multilingual showed promising results, highlighting their potential applicability in specific scenarios in Albanian writings. These findings contribute to the understanding of effective methods for authorship attribution in low-resource languages and provide a robust framework for future research in this area. The careful analysis of the different scenarios and the conclusions drawn from the results provide valuable insights into the potential and limitations of the methods and highlight the challenges in detecting authorship in Albanian. Promising results are reported, with implications for improving the methods used in Albanian authorship attribution. This study provides a valuable resource for future research and a reference for researchers in this domain.
Article
Full-text available
Even if to mainstream psychologists fear is one of the seven universal emotions, discrete, measurable and with clearly distinct features, in the humanities we consider fear as a widespread concept we associate with more complex prompts than the physiological response to a hazard. This research explores various ways we describe FEAR in Spanish. For this we have made use of digital humanities tools and methods, mainly corpus linguistics and natural language processing, which enable us to explore, present and visualize linguistic elements that define fear in Mexican society. Thus we have explored this emotion (and its family: anxiety, horror, apprehension, dread, panic, terror) by examining the way it is verbalized in an ad hoc corpus covering four genres/domains: chronicle, essay, the press, and social media, specifically, tweets. A set of semantic similarity and ranking metrics were applied to the texts to identify each genre’s characteristics in association with fear. The results show that fear is an emotion that, even if it differs depending on the genre, responds to the prompts of a modern society in which danger is still being represented by illness, violence, power, or an out-group.
Article
Full-text available
Major political events become a dominant topic of the discussions in the countries involved, and various actors, often other than political parties, take a stance towards them. The withdrawal of the UK from the European Union as a result of the 2016 UK referendum was one of those, and a quite polarised atmosphere from supporters of both sides was observed in the UK and the rest of Europe since the referendum was announced in 2015. Based on the impact of the Brexit on the economy, this study focused into the financial domain of the UK, and the ways that UK financial services referred towards it in their financial disclosures before, during and just after the referendum. For that purpose, I collected the 2015, 2016 and 2017 annual reports from five UK-based financial companies (Barclays, HSBC, Lloyds, Royal Bank of Scotland and Santander UK), and I used thematic keywords to identify all the texts referring to the 2016 UK referendum, its outcome, and the response of the financial services to the exiting process of the UK from the European Union. This set of texts composed the Brexit-related data set, in which different analytical tasks were performed. I explored the context in which the thematic keywords are found, compared statistically the three yearly subsets of the corpus, and searched the significant words of the subset in terms of their keyness strength. This case study revealed that the discourse around a major political event such as Brexit cannot be considered as neutral or objective, and that financial companies clearly expressed negative opinion regarding the 2016 referendum and the UK’s decision to exit the European Union.
Article
This Study employs a corpus driven approach to analyze the lexical profile of a Turkish verb that communicates sadness which is considered to be a negative emotion. The study focuses on the data obtained from Turkish National Corpus (TNC) within the framework of Extended Lexical Units offered by (Stubbs, 2005). Depending on the concordance lines of the verb dertlen- its syntactic properties, collocational tendencies, context dependent semantic and pragmatic qualifications were identified to understand the probable semantic value of it in terms of cognitive traits of sadness. It was found that dertlen- refers to a sadness which is derived from others’ troubles rather than self-related issues and this finding was discussed within the framework of empathy- altruism hypothesis (Batson,1991). From this respect, the verb gives some clues about the Turkish culture which is characterized with ‘altruistic’ behavior. The study also shows that dertlen- refers to past/present time negativities more frequently when compared to future related sadness concepts. Underlining the validity of corpus-driven approaches, the corpus data demonstrated that this sadness verb has its own cognitive patterns and this schematic nature of dertlen- dictates a particular linguistic environment which also shapes semantic and pragmatic load of the verb. Finally, thanks to this research, we were able to place dertlen- correctly inside Scherer's (2001) cognitive evaluation patterns for sadness.
Article
This study provides a semantic and pragmatic analysis of the Turkish verb "hüzünlen-" which conveys the meaning of sadness. Following a corpus-driven approach, query sequences from the Turkish National Corpus (TNC) were analysed according to the Extended Lexical Units model proposed by Stubbs (2005). In this context, 284 query sequences found in real spoken and written language samples were examined to determine the lexical profile of the verb "hüzünlen-." The aim of the study is to reveal the semantic and pragmatic usage differences of the verb "hüzünlen-" compared to other verbs expressing sadness. In this regard, the syntactic features, collocational tendencies, and context-dependent semantic and pragmatic characteristics of the verb "hüzünlen-" were thoroughly investigated. As a corpus-driven study, this research is the result of an inductive examination. The study demonstrated that the lexical framework of the verb "hüzünlen-" implies a sudden, intense sadness often related to past events and circumstances. The findings contribute to a better understanding of the cultural and linguistic features of expressing sadness in Turkish, as well as to comprehending the cognitive aspects of the emotion of sadness. Furthermore, the results of the study offer significant insights into how expressions of sadness are shaped in linguistic and cultural contexts in Turkish, allowing for a broader perspective on the linguistic representation of sadness. This research suggests that the verb "hüzünlen-" should be examined not only in Turkish but also in comparison with similar emotional expressions in other languages. Thus, it can contribute to understanding the universal and culture-specific features of language and emotion.
Article
There is an extensive body of work on taboo language, including metaphor and metonymy, but particular attention needs to be paid to (i) serious genres (and especially op-eds) and (ii) non-English speaking (or non-western) cultures. The present study uses the Sketch Engine search tool on a corpus of 1844 op-ed articles (967,715 words) by columnist Dandrawi El-Hawari of Egypt’s private pro-government newspaper Youm el-Saba. Questions about how taboo words are used in Arabic op-eds (or the selected corpus as a sample of the Egyptian population) thus arise. Other questions include: How frequently do vulgar, profane, discriminatory, threatening or potentially libellous words, cases where impoliteness (or rather hate speech) is genuine or presumably intended, occur in this serious discourse genre? Which taboo words feature more prominently in Egyptian opinion articles (and especially in the op-eds under investigation)? And what implications do our findings have for cross-cultural understanding and impoliteness research? The analysis of taboo words in this discourse genre can make a useful contribution not only to socio-cognitive and cross-cultural pragmatics, but also to forensic linguistics. We found frequent breaches of the expected conduct by the op-ed columnist in question, because private newspapers define different contexts. A general taxonomy of seven practices has been proposed.
Article
Full-text available
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Article
Full-text available
This paper is based on a nascent project on Lancashire dialect grammar, which aims to describe the relevant features of this dialect and to engage with related theoretical and methodological debates. We show how corpora allow one to arrive at more precise descriptions of the data than was previously possible. But we also draw attention to the need for other methods, in particular modern elicitation tasks and attitude questionnaires developed in perceptual dialectology. Combining these methods promises to provide more insight into both more general theoretical issues and the exact nature of the object of study, namely Lancashire dialect.
Article
Full-text available
Article
Full-text available
This paper seeks to explain why some semantically-opposed word pairs are more likely to be seen as canonical antonyms (for example, cold/hot) than others (icy/scorching, cold/fiery, freezing/hot, etc.). Specifically, it builds on research which has demonstrated that, in discourse, antonyms are inclined to favour certain frames, such as 'X and Y alike', 'from X to Y' and 'either X or Y' (Justeson and Katz, 1991; etc.), and to serve a limited range of discourse functions (Jones, 2002). Our premise is that the more canonical an antonym pair is, the greater the fidelity with which it will occupy such frames. Since an extremely large corpus is needed to identify meaningful patterns of co- occurrence, we turn to Internet data for this research. As well as enabling the notion of antonym canonicity to be revisited from a more empirical perspective, this approach also allows us to evaluate the appropriateness (and assess the risks) of using the World Wide Web as a corpus for studies into certain types of low-frequency textual phenomena.
Article
Full-text available
The chi-squared test is used to find the vocabulary most typical of seven different ICAME corpora, each representing the English used in a particular country. In a closely related study, Leech and Fallon (1992, Computer corpora - what do they tell us about culture? ICAME Journal, 16: 29-50) found differences in the vocabulary used in the Brown Corpus of American English and that the Lancaster-Oslo-Bergen Corpus of British English. They were mainly interested in those vocabulary differences which they assumed to be due to cultural differences between the United States and Britain, but we are equally interested in vocabulary differences which reveal linguistic preferences in the various countries in which English is spoken. Whether vocabulary differences are cultural or linguistic in nature, they can be used for the automatic classification according to variety of English of texts of unknown provenance. The extent to which the vocabulary differences between the corpora represent vocabulary differences between the varieties of English as a whole depends on the extent to which the corpora represent the full range of topics typical of their associated cultures, and thus there is a need for corpora designed to represent the topics and vocabulary of cultures or dialects, rather than stratified across a set range of topics and genres. This will require methods to determine the range of topics addressed in each culture, then methods to sample adequately from each topical domain.
Article
Full-text available
Word alignment in bilingual or multilingual parallel corpora has been a challenging issue for natural language engineering. An efficient algorithm for automatically aligning word translation equivalents across different languages will be of use for a number of practical applications such as multilingual lexical construction, machine translation, etc. This paper presents a hybrid algorithm for English–Chinese word alignment, which incorporates co‐occurrence association measures, word distribution distances, English word lemmatization, and part‐of‐speech information. Eleven co‐occurrence association coefficients and eight distance measures of word distribution are explored to compare their efficiency for word alignment. The paper also describes an experiment in which the algorithm is evaluated on sentence‐aligned English–Chinese parallel corpora. In the experiment, the algorithm produced encouraging success rates on two test corpora, with the highest success rate of 89.37 per cent. It provides a practical tool for extracting word translation equivalents from English–Chinese parallel corpora.
Article
Full-text available
The British Academic Written English (BAWE) corpus is a collection of texts produced by undergraduate and Master's students in a wide range of disciplines, for assessment as part of taught degree programmes undertaken in the UK. The majority of the contributors to the corpus are mother tongue speakers of English, but, in order to be included in the corpus, each assignment had to be judged proficient by assessors in the contributor's discipline, regardless of the writer's mother tongue. The corpus contains, therefore, only texts that have met departmental requirements for the given level of study. University writing programmes are typically aimed at undergraduate and Master's students, and it would be useful for writing tutors to know more about student assignment genres and the linguistic features of successful writing at undergraduate and Master's level. However, most large-scale descriptive studies of academic writing focus on published or publicly accessible texts, or learner essays on general academic topics, probably because there are practical difficulties associated with collecting large amounts of well-documented student output. This paper charts the experience of collecting data for the BAWE corpus, highlighting the problems we encountered and the solutions we chose, with a view to facilitating the task of future developers of academic student writing corpora.
Article
Full-text available
This paper describes the investigation of a small corpus of writing in English for academic purposes produced by L1 speakers of Mandarin. The investigation involved the development of a tagset for the identification of formal errors in the corpus, and the subsequent analysis of these errors with a view to creating remedial grammar materials for Chinese students studying in the medium of English. Some prior approaches to error analysis are discussed, the process of developing the tagging system is described, and error types are identified, categorised, quantified, described and (as far as possible) explained.
Chapter
The Information Retrieval (IR) (Manning et al., 2008) domain can be viewed, to a certain extent, as a successful applied domain of NLP. The speed and scale of Web take-up around the world have been made possible by freely available and effective search engines. These tools are used by around 85% of Web surfers when looking for some specific information (Wolfram et al., 2001).
Chapter
Natural language generation (NLG) is the process by which thought is rendered into language. It has been studied by philosophers, neurologists, psycholinguists, child psychologists, and linguists. Here, we examine what generation is to those who look at it from a computational perspective: people in the fields of artificial intelligence and computational linguistics.
Chapter
BioNLP, also known as biomedical language processing or biomedical text mining, is the application of natural language processing techniques to biomedical data. The biomedical domain presents a number of unique data types and tasks, but simultaneously has many aspects that are of interest to the “mainstream” natural language processing community. Additionally, there are ethical issues in BioNLP that necessitate an attention to software quality assurance beyond the normal attention (or lack thereof) that is paid to it in the mainstream academic NLP community.
Article
The construction as far as NP is a common topic restrictor in modern English, but its verbal coda (goes/is concerned) is often omitted. We examine potential constraints on this variation and find significant effects for syntactic, phonological, discourse mode, and social variables. The internal effects are also relevant to 'Heavy NP Shift' and other weight-related phenomena. Diachronic data on the as far as construction, and the evidence of synchronic age distributions and usage commentators, suggest that the verbless variant has become markedly more frequent in recent decades, allowing us a rare opportunity to study syntactic change in progress. In addition to documenting the nature of variation and change in this construction, our study has larger implications for the study of syntax and sociolinguistic variation, and demonstrates the value of integrating methods from different linguistic subfields (in this case, sociolinguistics and variation theory, historical linguistics, corpus linguistics, and syntax).
Article
Verb complementation has long been neglected as an area in which second-language varieties of English deviate from native Englishes. The present paper focuses on Indian English as the largest institutionalised second-language variety of English and investigates differences between Indian English and British English at the level of ditransitive verbs and ditransitive verb complementation. By using various corpora, including large databases obtained from the World Wide Web, we show (1) that ditransitive verbs like GIVE are associated to different extents with individual complementation patterns in present-day Indian and British English, (2) that the range of verbs used in the basic ditransitive pattern (with two object noun phrases) is different between present-day Indian and British English, and (3) that the "new ditransitives" in Indian English do not represent cases of superstrate retention but rather genuinely innovative forms that Indian English users create on grounds of analogy. From a theoretical perspective, we argue that the concept of verb-complementational profile is a useful framework for comparative studies of varieties of English.
Article
Grammaticalization is an important concept in general and typological linguistics and a prominent type of explanation in historical linguistics. For historical corpus linguists, grammaticalization theory provides a frame of orientation in their effort to analyze and systematize a fast-accumulating mass of data. Students of grammaticalization have become increasingly aware of the potential of existing corpora and established corpus-linguistic methodology for their work. This book continues and develops the dialogue between the two fields. All the contributions are based on extensive use of various electronic corpora. Relating corpus practices to recent theoretical concerns of grammaticalization studies they deal with grammaticalization and historical sociolinguistics, lexicalization and grammaticalization, layering, frequency, grammaticalization and dialects, degrammaticalization and grammaticalization in a contrastive perspective. The papers show that a synthesis of corpus methodology and grammaticalization studies leads to new and interesting insights about the mechanisms of language change and the communicative functions of language.
Article
This article compares two approaches to genre analysis: Biber’s multidimensional analysis (MDA) and Tribble’s use of the keyword function of WordSmith. The comparison is undertaken via a case study of conversation, speech, and academic prose in modern American English. The terms conversation and speech as used in this article correspond to the demographically sampled and context-governed spoken data in the British National Corpus. Conversation represents the type of communication we experience every day whereas speech is produced in situations in which there are few producers and many receivers (e.g., classroom lectures, sermons, and political speeches). Academic prose is a typical formal-written genre that differs markedly from the two spoken genres. The results of the MDA and keyword approaches both on similar genres (conversation vs. speech) and different genres (the two spoken genres vs. academic prose) show that a keyword analysis can capture important genre features revealed by MDA.
Article
This article explores negation in Chinese on the basis of spoken and written corpora of Mandarin Chinese. The use of corpus data not only reveals central tendencies in language based on quantitative data,it also provides typical examples attested in authentic contexts. In this study we will first introduce the two major negators bu and mei (meiyou) and discuss their semantic and genre distinctions. Following this is an exploration of the interaction between negation and aspect marking. We will then move on to discuss the scope and focus of negation, transferred negation, and finally double negation and redundant negation.
Article
The following paper applies Douglas Biber's linguistic stylistic model to the complete prose works of the Australian Aboriginal author Mudrooroo Nyoongah. The aim of this application is twofold. Firstly, to critically analyze Nyoongah's prose for a perceived diachronic stylistic shift. Throughout his career he has closely identified with the unique problems facing the Australian Aborigines. I argue that as Nyoongah has come to more openly recognise his Aboriginality he has shifted in style towards.a more oral and abstract form of expression. Secondly, to critically examine Biber's multifeatured/multidimensional model. I argue that this model is not without faults when Nyoongah's work is compared against Biber's corpus but that it is a very useful model for offering broad generalizations on an author's style and for suggesting further stylistic investigations using more narrow constraints of quantification.
Article
The present paper addresses several issues relating to the research goals and methodological techniques used in multi-dimensional analyses of register variation (e.g., Biber, 1988). The paper is prompted by the allegations of inadequacy and error published in Watson (1994). In my discussion, I point out that Watson fails to substantiate his claimed inadequacies with empirical evidence, and that previous investigations have shown that many of these criticisms are factually incorrect. In addition, I attempt to correct several fundamental misunderstandings of the multi-dimensional approach and the relationships among computational, statistical, and interpretive methodologies within the approach.
Article
This paper is a response to Douglas Biber's article, 'On the role of computational, statistical, and interpretative techniques in multi-dimensional analyses of register variation'. I dispute Biber's claim that I have misunderstood his model. I discuss aspects of functional interpretations of statistical co-occurrence, tagging procedures and quantitative versus qualitative methods of investigation.
Article
The prevalence of formulaicity in naturally occurring language use points to an important role in the way language is acquired, processed, and used. It is widely recommended that second-language instruction should ensure that learners develop a rich repertoire of formulaic sequences. If this is justified, it follows that learner failure to use formulaic sequences should present some barrier to communication. However, it seems that few researchers have sought to objectively evaluate how learner deviations from the target-language (formulaic or otherwise) impact on online processing. Operationalizing formulaic sequence through collocation, this article reports the combination of corpus-based approaches and psycholinguistic experimentation to investigate the processing by native speakers of learner collocations that deviate from target-language norms. Results show that such deviations are associated with an increased and sustained processing burden. These findings support the widely asserted claim that formulaic sequences offer processing advantages and provide empirical support for the importance of formulaic sequences in language learning. Usage-based models form the basis for some hypotheses concerning cognitive processes that underlie the increase in processing demands.
Article
In this paper we argue that corpus linguistics needs to expand to cover a wider set of languages. While the reasons that some languages have not been provided with corpus data to the date are clear, the intellectual and moral imperative to extend the range of corpus linguistics is strong. However, there are technical problems to be faced in such an extension of corpus linguistics. These problems are reviewed here and possible solutions to them explored. Following on from this, we consider what possible benefits the provision of appropriate corpus data may bring to languages currently untouched by the development of corpus linguistics.
Article
This paper presents a discourse-functional account of English inversion, based on an examination of a large corpus of naturally-occurring tokens. It is argued that inversion serves an information-packaging function, and that felicitous inversion depends on the relative discourse-familiarity of the information represented by the preposed and postposed constituents. The data moreover indicate that evoked elements and inferrable elements are treated alike with respect to inversion; both are treated as discourse-old information. Finally, it is suggested that discourse-familiarity correlates not with subjecthood, but rather with relative sentence position.
Article
Two theses concerning word order in Mandarin Chinese have been investigated through a quantified study of written and spoken contemporary Mandarin. It is found, first, that Mandarin is synchronically a typical VO language, in terms of text distribution of VO and OV orders. OV appears at the level of 10% or lower in text, and this is true for both definite and indefinite objects. Further, the functional distribution of OV in both texts suggests that it is an emphatic/contrastive discourse device, having little to do with the contrast between definite and indefinite object. Finally, neither the evidence from our text distribution data nor a comparison with a recent study of the acquisition of Mandarin by native children suggests the existence of a diachronic drift toward SOV order.
Article
Two research questions are examined in this work regarding the uses of 'market' in Mandarin, Malay and English. The first question asks whether the use of 'market ' in these three languages is similar or different. The second question asks whether the collocates of the 'market' are similar or different across these languages when used in different grammatical relations. Implications of the similarities and differences will be discussed. In order to answer these two questions, 'market ' metaphors used by different communities are laid out based on the frequency counts of its source domains and the collocates according to different grammatical roles (subject, object, modifier, etc.) of 'market.'The results show that certain source domains have preferences for different grammatical roles for 'market.' In addition to this finding, the choice of source domains by different speech communities may also reflect their perspectives regarding their countiy's economy. Therefore, through using quantitative data, this paper is able to infer the perspectives of these speech communities when referring to 'market ' in their languages. This can be done not only through analyzing the semantic meanings of the metaphors, but also through their interface with grammatical relations.
Article
The lexical semantic structures of change-of-state verbs are explored via linguistic theory, corpus analysis, and psycholinguistic experimentation. The data support the idea that these verbs can be divided into two classes, those for which the change of state is internally caused and those for which it is externally caused (Levin & Rappaport Hovav 1995, cf. Smith 1970). External causation change-of-state verbs have been hypothesized to denote two subevents, internal causation change-of-state verbs only one event. Consistent with this difference, the psycholinguistic data indicate that, in both transitive AND intransitive constructions, sentences with external causation verbs take longer to comprehend than sentences with internal causation verbs.
Article
Do languages die gradually or abruptly? Spontaneous speech samples were elicited from 40 speakers of Trinidad Bhojpuri in rural Caroni, Trinidad, ranging in age from 95 to 26. The oldest 28 subjects were representative of their age group; the youngest 10 were distinctly unrepresentative. Data corpora were pro-rated to equal length, and the errors and test features in each corpus were counted to distinguish statistically between native and non-native competence. Speakers formed two internally age-independent competence clusters-with the 10 youngest speakers and one older speaker a group apart, signaling a discontinuity between native and post-native competence.
Chapter
Until fairly recently, linguistics has been classified as a 'science' by definition, averral, and ideology rather than because of the uniformity of its practices across its many schools of thought. It is seldom the case in any discipline that a particular phenomenon begins to question that discipline's raison d'etre, withdraw the option and luxury of its often directionless and eclectic practices and proceed to force unwelcome and sweeping changes upon the discipline by beginning to dictate its method. This paper re-states its author's earlier proofs as claims that collocation as instrumentation for meaning is a scientific fact. The burden of this proof has acquired renewed urgency of an interdisciplinary nature that makes this paper both timely and necessary. The claim for collocation as science is reinforced by a number of new discoveries: the fact that all devices are brought about by relexicalisation as a marked form rather than the purported markedness that is mentalist and hence, merely averred.
Article
Findings from corpus-based research are increasingly being exploited for the creation of teaching materials. Although such studies provide valuable insights into language use, they are mostly based on 'expert' corpora and therefore cannot tell us about what is not known by students. In order for materials to address students' deficiencies fully, insights gleaned from interlanguage corpora could usefully be used to complement those from 'expert' corpora for course design. Several valuable studies on interpersonal strategies in scientific discourse have been undertaken and the area of pragmatic failure is also well documented in the literature on contrastive rhetoric. However, it would also be useful to examine these strategies from a learner corpus-based perspective. This paper outlines the findings from an error analysis of interpersonal strategies of a 200,000-word sub-section of the HKUST Learner Corpus. These findings not only provide useful insights into students' deficiencies in this area but have also been used to shape and refine classroom materials.
Article
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Article
The increasing availability of electronic text and text analysis tools has made it possible to analyse vast amounts of data in a short amount of time. However, natural language processing is not a solved problem, and even large research systems representing decades of development do not perform at the level of human language processors. Since such systems are not sufficiently robust for general use, most literary and linguistic corpus analysts make use of heuristics and simple tools for text analysis. But while such 'shallow' approaches offer improvements in speed and accuracy over traditional manual methods, there are many pitfalls for the unwary. In this paper we consider some pitfalls and temptations that attend the automated analysis of large text corpora: sample size, the recall problem, analysing only what is easy to find, and counting what is easiest to count. We suggest that, given the state of the art in text processing tools, such tools must be used with a full awareness of their limitations, and should be coupled with or replaced by manual methods when appropriate.
Article
The TOSCA (Tools for Syntactic Corpus Analysis) working group at Nijmegen University develops a computerised method for interactive syntactic analysis of unprepared text material. This method yields a data base of syntactically analysed material in the form of trees. There is a program available for the exploitation of these syntax trees for linguistic research.In this paper we sketch the consequences of this approach. Since the method involves no language specific programs, but instead uses formal descriptions of lexicon, morphology, and syntax of the target language, all in the same grammatical formalism, the whole working environment is languageindependent.
Article
Youth-orientated social networking sites, like MySpace, are important venues for socialising and identity expression. Analysing such sites can, therefore, provide a timely insight into otherwise hidden aspects of contemporary culture. In this paper, MySpace member home pages are used to analyse swearing in the US and UK. The results indicate that almost all young MySpaces, and about half of middle-aged MySpaces, contain some swearing, in terms of both males and females. There was no significant gender difference in the UK for strong swearing, especially for younger users (16–19). This is perhaps the first significant evidence of gender equality in strong swearing frequency in any informal English-language context. By contrast, US male MySpaces contain significantly more strong swearing than those of females. The assimilation by UK females of traditional male swearing in the informal context of MySpace, suggests deeper changes in gender roles in society – possibly related to the recent rise in `ladette culture'.
Article
The information contained in a document is only partly represented by the wording of the text; in addition, features of formatting and layout can be combined to lend specific functionality to chunks of text (e.g., section headings, highlighting, enumeration through list formatting, etc.). Such functional features, although based on the ‘objective’ typographical surface of the document, are often inconsistently realised and encoded only implicitly, i.e., they depend on deciphering by a competent reader. They are characteristic of documents produced with standard text-processing tools. We discuss the representation of such information with reference to the British Academic Written English (BAWE) corpus of student writing, currently under construction at the universities of Warwick, Reading and Oxford Brookes. Assignments are usually submitted to the corpus as Microsoft Word documents and make heavy use of surface-based functional features. As the documents are to be transformed into XML-encoded corpus files, this information can only be preserved through explicit annotation, based on interpretation. We present a discussion of the choices made in the BAWE corpus and the practical requirements for a tagging interface.