Article

Getting into bed with embeddings? A comparison of collocations and word embeddings for corpus-assisted discourse analysis

Authors:
To read the full-text of this research, you can request a copy directly from the author.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Newer forms of racism in the context of right-wing extremism are characterised by an apparent distancing from overt racist devaluations. In addition or even beyond biological features, it is now cultural characteristics attributed to social groups which serve as grounds for practices of othering and social exclusion. This paper analyses racist discourse in the comment sections of the influential far-right blog pi-news.com where these practices can be observed in detail. With reference to discourse analytical approaches to racism and using corpus-linguistic, data-driven methods, especially word embeddings and collocations, it is shown how racism is linguistically and discursively expressed. Next to both overt and more implicit racist nominations and predications, the notion of Heimat (‘homeland’) is analysed; it is used to draw racist demarcations without relying on overtly racialising terms.
Article
Full-text available
Despite the large diffusion and use of embedding generated through Word2Vec, there are still many open questions about the reasons for its results and about its real capabilities. In particular, to our knowledge, no author seems to have analysed in detail how learning may be affected by the various choices of hyperparameters. In this work, we try to shed some light on various issues focusing on a typical dataset. It is shown that the learning rate prevents the exact mapping of the co-occurrence matrix, that Word2Vec is unable to learn syntactic relationships, and that it does not suffer from the problem of overfitting. Furthermore, through the creation of an ad-hoc network, it is also shown how it is possible to improve Word2Vec directly on the analogies, obtaining very high accuracy without damaging the pre-existing embedding. This analogy-enhanced Word2Vec may be convenient in various NLP scenarios, but it is used here as an optimal starting point to evaluate the limits of Word2Vec.
Article
Full-text available
This paper examines how the UK print media represents risk in reporting about obesity. Using corpus linguistics methods (keywords, collocations and consideration of concordance lines) combined with qualitative discourse analysis, references to risk were analysed in a 36-million-word corpus of articles from the national British press about obesity, published between 2008 and 2017. Two main analytical directions were followed: differences between newspapers (in terms of political affiliation and format) and change over time. Obesity was found to be both a risk factor for diseases like cancer but also itself the consequence of risk factors such as over-eating or not getting enough sleep. When talking about risk, tabloid newspapers tended to discuss the former type of risk, whereas broadsheets focussed on the latter. Left-leaning newspapers tended to focus on the role of powerful institutions, while right-leaning newspapers wrote more about risk in terms of individuals, either focussing on personal responsibility or the role of biological factors in determining an individual’s risk. References to risks relating to obesity increased both in terms of raw frequency and proportional frequency over the decade examined, with the largest increase occurring between 2016 and 2017. The year 2017 was characterised by more reference to scientific research and risks of health conditions that were referred to in dramatic terms (e.g. as a deadly risk), as well as containing more personalised language (e.g. more use of the second person pronoun your). The analysis indicates how notions of risk intersect with neoliberal principles of illness and self-management. In addition, readers receive different messages about risks relating to obesity depending on which newspapers they read, and there is evidence for an increasing reliance on a discourse of fear around obesity in the British national press overall.
Article
Full-text available
The current study provides a new level of empirical evidence for the nature of ethnic stereotypes in news content by drawing on a sample of more than 3 million Dutch news items. The study’s findings demonstrate that universally accepted dimensions of stereotype content (i.e., low-status and high-threat attributes) can be replicated in news media content across a diverse set of ingroup and outgroup categories. Representations of minorities in newspapers have become progressively remote from factual integration outcomes, and are therefore rather an artifact of news production processes than a true reflection of what is actually happening in society.
Chapter
Full-text available
In this chapter, we provide an overview of quantitative approaches to co-occurrence data. We begin with a brief terminological overview of different types of co-occurrence that are prominent in corpus-linguistic studies and then discuss the computation of some widely-used measures of association used to quantify co-occurrence. We present two representative case studies, one exploring lexical collocation and learner proficiency, the other creative uses of verbs with argument structure constructions. In addition, we highlight how most widely-used measures actually all fall out from viewing corpus-linguistic association as an instance of regression modeling and discuss newer developments and potential improvements of association measure research such as utilizing directional measures of association, not uncritically conflating frequency and association-strength information in association measures, type frequencies, and entropies.
Article
Full-text available
As a way of comparing qualitative and quantitative approaches to critical discourse analysis (CDA), two a®nalysts independently examined similar datasets of newspaper articles in order to address the research question ‘How are different types of men represented in the British press?’. One analyst used a 41.5 million word corpus of articles, while the other focused on a down-sampled set of 51 articles from the same corpus. The two ensuing research reports were then critically compared in order to elicit shared and unique findings and to highlight strengths and weaknesses between the two approaches. This article concludes that an effective form of CDA would be one where different forms of researcher expertise are carried out as separate components of a larger project, then combined as a way of triangulation.
Article
Full-text available
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Article
Full-text available
In this paper, I examine the representation of men and women in the British National Corpus (BNC) by focussing on the collocational and grammatical behaviour of the noun lemmas MAN and WOMAN (i.e., the nouns man/men and woman/women). Using Sketch Engine (a powerful corpus query tool, which is described) I explore the functional distribution of the target lemmas, and reveal the structured and systematic nature of the differences in the way these terms for adult male and female human beings pattern with other word forms in different grammatical relations.
Article
The public understanding of science has produced a large body of research about general attitudes toward science. However, most studies of science attitudes have been carried out via surveys or in experimental conditions, and few make use of the growing contexts of online science communication to investigate attitudes without researcher intervention. This study adopted corpus-based discourse analysis to investigate the negative attitudes held toward science by users of the social media website Reddit, specifically the forum r/science. A large corpus of comments made to r/science was collected and mined for keywords. Analysis of keywords identified several sources of negative attitudes, such as claims that scientists can be corruptible, poor communicators, and misleading. Research methodologies were negatively evaluated on the basis of small sample sizes. Other commenters negatively evaluated social science research, especially psychology, as being pseudoscientific, and several commenters described science journalism as untrustworthy or sensationalized.
Article
Topic modelling is a method of statistical data mining of a corpus of documents, popular in the digital humanities and, increasingly, in social sciences. A critical methodological issue is how ‘topics’ (groups of co-selected word types) can be interpreted in analytically meaningful terms. In the current literature, this is typically done by ‘eyeballing’; that is, cursory and largely unsystematic examination of the ‘top’ words in each algorithmically identified word group. We critically evaluate this approach in a dual analysis, comparing the ‘eyeballing’ approach with an alternative using sample close reading across the corpus. We used MALLET to extract two topic models from a test corpus: one with stopwords included, another with stopwords excluded. We then used the aforementioned methods to assign labels to these topics. The results suggest that a close-reading approach is more effective not only in level of detail but even in terms of accuracy. In particular, we found that: assigning labels via eyeballing yields incomplete or incorrect topic labels; removing stopwords drastically affects the analysis outcome; topic labelling and interpretation depend considerably on the analysts’ specialist knowledge; and differences of perspective or construal are unlikely to be captured through a topic model. We conclude that an interpretive paradigm founded in close reading may make topic modelling more appealing to humanities researchers.
Article
Political communication researchers studying the news media coverage often distinguish between broadsheets and tabloids when sampling relevant news outlets. But recent work has pointed towards a ‘tabloidization’ of news coverage, complicating the empirical distinction between the two. Computational methods for text analysis can help us better understand how distinct the news coverage between these two types of news outlets is. We take the Brexit referendum as a case study illustrating various aspects in which broadsheets and tabloids cover an issue permeated by othering and divisive rhetoric. We focus on Brexit-related news coverage before and after the referendum (N = 32,946) and use word embeddings to analyze the portrayal of different groups of citizens that can generate an in- and outgroup divide. First, we document the presence of media-based othering in the form of overly similar migrant and European Union citizen representations that are, in turn, very dissimilar to the UK citizen representation. Second, we show partial convergence between tabloid and broadsheet newspapers, as differences in the degree and characteristics of media coverage are rather small and specific.
Article
Machine learning is a rapidly growing research paradigm. Despite its foundationally inductive mathematical assumptions, machine learning is currently developing alongside traditionally deductive inferential statistics but largely orthogonally to inductive, qualitative, cultural, and intersectional research—to its detriment. I argue that we can better realize the full potential of machine learning by leveraging the epistemological alignment between machine learning and inductive research. I empirically demonstrate this alignment through a word embedding model of first-person narratives of the nineteenth-century U.S. South. Situating social categories in relation to social institutions via an inductive computational analysis, I find that the cultural and economic spheres discursively distinguished by race in these narratives, the domestic sphere distinguished by gender, and Black men were afforded more discursive authority compared to white women. Even in a corpus over-representing abolitionist sentiment, I find white identities were afforded a status via culture not allowed Black identities.
Conference Paper
We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities.
Article
It has often been noted that corpus-assisted discourse analysis is inherently comparative (e.g. Partington 2009) but in this paper I want to emphasise that such comparison does not exclusively entail the analysis of difference and that the analysis of similarity can be productively incorporated into the framework. As Baker (2006: 182) notes, the way that differences and similarities interact with each other is ‘an essential part of any comparative corpus-based study of discourse’. In this paper, first, I outline why the search for similarity is relevant to the analysis of discourse using corpus linguistics, I then go on to survey some possible ways of doing this, and finally I take the representation of boy/s and girl/s in British broadsheet newspapers as an example.
Article
The view of academic discourse as a rhetorical activity involving interactions between writers and readers is now central to most perspectives on EAP, but these interactions are conducted differently in different disciplinary and generic contexts. In this paper I use the term proximity to refer to a writer's control of those rhetorical features which display both authority as an expert and a personal position towards issues in an unfolding text. Examining a corpus of texts in two very different genres, research papers and popular science articles, I attempt to highlight some of the ways writers manage their display of expertise and interactions with readers through rhetorical choices which textually construct both the writer and the reader as people with similar understandings and goals.
How can a corpus be used to explore patterns?
  • Hunston
Word embeddings: a survey
  • F Almeida
  • G Xexéo
Comparative study of LSA vs Word2vec embeddings in small corpora: a case study in dreams database
  • Altszyler
text2vec: Modern Text Mining Framework for R
  • Selivanov