PreprintPDF Available

Changing Language Ideological Concepts in the Human-Machine Era. Questions, Themes and Topics

Preprints and early-stage research may not have been peer reviewed yet.


This is a joint project of the members of the working group 'Ideologies, Beliefs, Attitudes' of the COST network 'Language in the Human-Machine Era' and discusses current themes and topics related to changing language ideologies in the context of digital technologies.
Schneider, Britta, Bettina Migge (eds.)
Changing Language Ideological Concepts in the
Human-Machine Era.
Questions, Themes and Topics
Doris Dippold, University of Surrey, UK
Iker Erdocia, Dublin City University, Ireland
Theres Fester-Seeger, European University Viadrina, Germany
Sviatlana Höhn, University of Luxemburg, Luxemburg
Ledia Kazazi, University of Elbasan, Albania
Mandy Lau, York University, Canada
Didem Leblebici, European University Viadrina, Germany
Barbara Lewandowska-Tomaszczyk, University of Konin, Poland
Miriam Lind, University of Mainz, Germany
Morana Lukač, University of Greifswald, Germany
Philipp Meer, University of Münster, Germany
Bettina Migge, University College Dublin, Ireland
Susanne Mohr, Norwegian University of Science and Technology, Norway
Martina Podboj, University of Rijeka, Croatia
Agnese Sampietro, University of Jaume I, Spain
Beatrice Savoldi, Università degli Studi di Trento, Italy
Britta Schneider, European University Viadrina, Germany
Auli Viidalepp, University of Tartu, Estonia
(with editorial support from Victoria Kozlova, European University Viadrina, Germany)
Working Paper, November 2022
This working paper is an outcome of joint work and discussions in the COST Network
Language in the Human-Machine Era, Working Group 'Ideologies, Beliefs, Attitudes' and
was edited by the chairs of the working group, Britta Schneider and Bettina Migge.
It introduces general thoughts on language ideological reformulations in the age of digital
technologies as well as exemplary empirical studies that display language ideological
changes, conducted by individual members of the Working Group. In this sense, it functions
like an informal 'mini-handbook in progress' and gives an overview of current and future
topics in the field to study and discuss in the future. Note that the text makes no claim to be
exhaustive and that the separate sections range from short notes to more elaborate
discussions or presentations. All sections have been peer-reviewed by the editors and by
the authors involved.
Schneider, Britta, Bettina Migge (eds.)
Changing Language Ideological Concepts in the
Human-Machine Era. Questions, Themes and Topics
1) Introduction – Language Ideologies in the Human-Machine Era (Britta
Schneider, Bettina Migge)
2) Theory Background – Language Ideologies, Media Ideologies and Their Social
2.1. Language ideologies and media ideologies (Britta Schneider, Bettina Migge,
Susanne Mohr)
2.2. Excursus and terminology disclaimer: concepts of harms and bias in NLP and
how they relate to language ideologies (Beatrice Savoldi)
3) New Forms of Language Policing in Digital Technologies Content
Moderation, Content Ranking, Commercial Corpus-building, Bias Treatment,
Digital Dictionaries
3.1. Language policy and multilingualism in digital contexts (Iker Erdocia)
3.2. Corpus-building and bias treatment in the industry as ideological act (Bettina
Migge, Britta Schneider)
3.3. Excursus on language ideological effects in corpus-based tools 1: Gender bias
in language technologies (Beatrice Savoldi)
3.4. Excursus on language ideologies and corpus-based tools 2: Language
technologies and their socio-cultural embedding (Marie-Theres Fester-Seeger)
3.5. Commercial content moderation (Mandy Lau)
3.6. Commercial content ranking (Sviatlana Höhn)
3.7. Digital dictionaries as changing concepts of language standard (Bettina Migge)
3.8. Normative enforcement through grammar checkers (Agnese Sampietro)
4) Human-to-Human Interaction in Platform Society, New Indexical Functions of
Language and New Discourse Genres
4.1. Interactivity and the blurring of speaker and addressee roles (Barbara
4.2. Identity construction in social media (Ledia Kazazi)
4.3. Interactive practices in gaming (Barbara Lewandowska-Tomaszczyk)
4.4. How affordances of digital platforms shape the way people use language online
(Martina Podboj)
4.5. Conflictual interaction on social media platforms (Bettina Migge)
5) Sound-Based Speech Technology – Effects on Language Practice, Language
Ideologies, Social Bonding and Linguistic Research
5.1. Voice-based human-to-human communication (Agnese Sampietro)
5.2. How sound-based technology works (Mandy Lau)
5.3. Sound-based corpora – histories, sources, biases (Mandy Lau, Didem
Leblebici, Britta Schneider)
5.4. Language ideologies in human-voice assistant interaction (Miriam Lind, Doris
5.5. Excursus: User expectations regarding chatbots (Sviatlana Höhn, Doris
5.5. The role of sound in anthropomorphisation and agency attribution (Miriam Lind,
Marie-Theres Fester-Seeger)
5.6. Ideologies and biases in linguistic speech processing tools (Philipp Meer)
6) Final thoughts (Britta Schneider)
1) Introduction – Language Ideologies in the
Human-Machine Era (Britta Schneider, Bettina
Many ideas we have about language, language change, languages' distribution in territorial
space, and language norming are not conceivable without the cultural practices of writing
and printing (see e.g. Ong 1982, Linell 2005). In the human-machine era, we observe a
change and diversification of media technologies. Besides print media and writing, there
are, for example, language technologies that co-produce human language like grammar
checkers, machine translation, automatic speech recognition, automatic text generation or
voice assistants. In addition, there are social media platforms, which allow for new forms of
human-to-human communication. Powerful printing presses and educational institutions are
no longer the only ones who control written language (Schneider 2022). Since the advent of
Web 2.0, or the so-called democratization of communication, social media applications and
the internet allow different stakeholders, including lay people, commercial entities and
official institutions, to make available or generate language in the public domain. They thus
set new trends that may become conventionalised. Although there are many attempts by
governments and media companies to control communication and language use via human
and technological means, we already see that social media has changed our understanding
of written language. While earlier views assumed that written norms and orthography in
particular were fixed and neutral, we now see that orthography is frequently manipulated to
express stances of truthfulness or alignments with minoritised varieties, for example (e.g.
Androutsopoulos 2006, Moll 2015). While mindful of traditional norms, official and
commercial entities increasingly also adapt their texts in various ways to facilitate
interaction with their customers such as through juxtaposing written text with spoken
language, visual options and colloquial speech practices. All this suggests that earlier views
that privileged written, homogeneous and normative representations of language are
changing. At the same time, the sheer volume of information that needs to constantly be
made available in different languages for an increasingly greater number of
users/consumers, requires innovative options such as machine translation and automatic
text generation (that is, text created by computer programmes). These types of language
practice heavily rely on existing digitized text whose quality is quite variable and rarely
audited (see Kreutzer et al. 2021). Taken together, the diversity of language practices in
digital environments immediately opens up the question of whether traditional concepts and
norms of language and attitudes towards standardized and non-standardized language are
changing in the context of digital language practices.
In the remainder of this text, when we talk about language technologies, we include the
following types of tools and applications:
machine translation
content moderation
content ranking
grammar and spell checkers, assisted writing
social media use
voice assistants/ automatic speech recognition (ASR)
digital dictionaries
video games
forced alignment (see 5.6.)
automatic vowel formant prediction (see 5.6.)
coreference resolution (checks pronoun/noun relations, see 3.3.)
We discuss how digital technologies impact language and affect our concepts of language,
language genres, normative discourses about language and language attitudes. We use
the term language ideologies as an overarching concept that understands languages as
culturally, discursively constructed phenomena and that refers to normative, value-oriented
discourses about language in general (see section 2). The insights we present here were
developed within the working group 'Ideologies, Beliefs, Attitudes' that started as a
sub-group of the COST Action 'Language in the Human-Machine Era' in autumn 2020. In
our joint explorations, we set out to examine how the notion of 'language' may change in a
digital age. How does digital technology impact linguistic concepts, models and beliefs
about language and communication? How are languages discursively and technologically
constructed, managed and policed in the realm of digital language technologies and AI
tools? Are ideas of what language entails reconfigured in the multimodal practices of digital
communication that light-heartedly combine linguistic signs with images, videos and other
forms of sign-making? And what do online user practices tell us about changing genres of
language and language ideologies? Our discussions led us to conclude that understanding
language practices and their conceptualization are central to a more nuanced
understanding of who we are as humans and social beings.
In the following, we first introduce the theoretical concepts underlying these questions
(section 2). The section first elaborates on the notion of language ideologies and brings it in
relation to media ideologies. This is followed by an excursus on the discussion of social
problems associated with digital language technologies as it is currently taking place in the
field of natural language processing (NLP). In the third section, we focus on both the active
and also tacit processes of the regulation of language in the realm of digital space and
language technology. This includes overt regulation by governments but also practices of
programmers and of commercial actors. The cultural discourses and technological
decisions of the latter group, for example, in defining corpora, regulating contents or
rankings, or building dictionaries, have a clear impact on how language is normed and
understood. Section 4 focuses on digital human-to-human language practices, which can
be observed in social media applications, and inspects how the affordances of digital tools
impact on how people use language. We go on to ask what this tells us about potentially
changing language attitudes, genres of language, and concepts of conflict and linguistic
identity construction. In the fifth section, we illustrate some of the specific practices and
discourses that appear in the human-machine era by focusing on tools that exploit the
sounds of human language, as in voice-controlled devices and speech processing tools. In
our final discussion, we summarize our findings and suggest some ways of rethinking
current approaches to language, to language technologies and to studying language and
technology. Given that human language and communication have central functions in
constructing social belonging and social hierarchies, we argue that our discussion is an
important addition to understanding language in the human-machine era and to contributing
to a critical monitoring of the effects of digital language technologies on human culture,
social structure and social inequality.
2) Theory Background - Language Ideologies,
Media Ideologies and Their Social Relevance
In this section, we introduce fundamental previous research on the concepts of language
ideologies and language attitudes. We also elaborate on how media technologies
interrelate with language ideological concepts.
2.1. Language ideologies and media ideologies (Britta Schneider, Bettina Migge and
Susanne Mohr, Agnese Sampietro)
Language Ideologies
Language ideology is a technical term first coined by linguistic anthropologists (see e.g.
Woolard & Schieffelin 1994). The term ideology here needs to be clearly distinguished from
its usage in the lay community, where it refers to a distorted or misleading belief. Rather,
scholars argue that any (!) idea related to language is normative and can in this sense be
referred to as an 'ideology', that is, a socially grounded concept (Gal & Irvine 2019: 11-14).
The interests of language ideology research overlap to a certain extent with language
attitude research. While the study of language attitudes is mostly interested in attributions
of value to particular languages or varieties, and treats languages as more or less fixed
entities, language ideology research covers a wider area of research that also implies the
study of the ontology of language and of lay and academic concepts of language. It
maintains that views, perceptions, etc., are dynamic and shifting with interactional and
social contingencies. In a nutshell, language ideology research is interested in discourses
about language, that is, culturally conditioned concepts of language. A very prominent
language ideology is the idea that human language appears in languages that are tied to
culturally homogenous and territorially-bound groups. This is contested by some scholars,
for instance by Makoni (2011) who puts forth an “integrative sociolinguistic” approach,
suggesting that, for example, many African languages are a colonial construct. In contrast
to the questioning and often deconstructive approach of language ideology research, the
study of language attitudes typically takes an empirical approach and studies particular
groups of speakers, for example with an interest in which social attributes are associated
with which forms of speech (see e.g. Garrett 2010) As Kroskrity (2018) observes:
Language attitudes, as a concept, is generally associated with an objectivist
concern with quantitative measurement of speakers’ reactions. This
concern is surely related to its conceptual origins in social psychology,
quantitative sociolinguistics, and educational linguistics. In contrast, the
concept of language ideologies is associated with qualitative methods such
as ethnography, conversational analysis, and discourse analysis, …
In the context of US American linguistic anthropology, language ideology research
developed throughout the 1980s, 1990s and early 2000s and was, roughly speaking,
concerned with the role of discourses and concepts on language in scientific
epistemologies and in social contexts (Bauman & Briggs 2003; Heath 1989; Jaworski et al.
2004; Joseph & Taylor 1990; Kroskrity 2000; Kroskrity 2001; Kroskrity et al. 1992;
Schieffelin et al. 1998; Woolard 1998; Woolard & Schieffelin 1994). Particularly the work of
Gal, Irvine and Woolard is associated with the study of how languages, as distinct
categories, come into being and how this process is associated with Western concepts of
the social as appearing in nation-states and national publics (Gal & Irvine 1995; Gal &
Woolard 1995; Gal & Woolard 2001b; Irvine & Gal 2009). Language ideologies are usually
studied using participant observation and interviews rather than questionnaires and
judgment tasks, as is frequently the case in language attitude studies. Essentially, people’s
discourses as part of their work, social interactions or in semi-guided interviews are
analyzed together with an investigation of unelicited representations of language use
including written documents, signs, videos and official discourses from a range of
The realization that languages as standardized and relatively stable entities are an
outcome and not a precondition of social discourse has become increasingly prominent
with the rise of digital media and in times of globalization (see e.g. Pennycook 2004;
Pennycook 2010). The notions of languaging and translingualism (Canagarajah 2013;
García & Wei 2014; Jørgensen 2008) similarly consider languages and their place in
society as discursively constructed categories and are therefore also interested in the
language ideological concepts that co-constitute separate and standardized languages.
Critical perspectives on constructions of standard language are also found in related
cognitive research and research on embodiment, specifically, in the writing of those
concerned with distributed language. Contrary to classical views in linguistics that treat
language as a form-based system which consists of arbitrary, a priori repeatable entities
(also commonly referred to as ‘words’), proponents of the distributed language perspective
give due weight to linguistic bodily activity. In Cowley’s (2019) words: “human beings
happen in language” (p. 2). Therefore, rather than assigning language to be innate, i.e., it
can be ‘acquired,’ processed,’ and ‘stored’ in people's 'minds', within this perspective,
language emerges as human living beings engage with and in the world. While orthodox
linguists detach language from life, the distributed language perspective emphasizes the
value of paralinguistic aspects (such as, vocal dynamics, gestures, facial expressions) to
language. Emerged from the distributed cognition perspective (Hutchins, 1995) and
integrational linguistics (Harris, 1981), the distributed language perspective directs
empirical observations on how people language together. Such understandings of language
are in stark contrast to how language is typically understood, constructed and experienced
in many social and in digital realms. Overall, related to a rise in interest in the agency of
speakers, the study of language as embodied activity and the denaturalization of languages
as standardized entities, language ideology research has moved from a rather marginalized
field to a central concern of socially interested disciplines of linguistics and is foundational
to understanding the changes of language culture in the age of human-machine interaction.
Media ideologies
The popularization of language ideological research in fields such as sociolinguistics,
educational linguistics and linguistic anthropology is linked to changes in media technology
practice. Before the advent of digital communication, print literacy co-defined what was
perceived as ‘correct’ language in national publics (see e.g. Anderson 1985; Gal 2001).
One can thus say that writing and printing have had an epistemological impact on language
and language ideologies in society and in linguistic disciplines (discussed e.g. in Coupland
2010; Linell 1982; Linell 2005; Ong 1982). Writing, printing and reading as cultural
phenomena have been studied from different angles, as, for example, in cultural
anthropology (Fabian 2001), in critical educational perspectives (Street 1995), or in a
cultural science and historical perspective (Assman 2010; Ong 1982). In the context of
media philosophy, McLuhan’s famous argument that “the medium is the message”
(McLuhan 1964) shows that concepts are not only transmitted via media but that media
themselves contribute to the meanings of concepts. In relation to language ideology
research, it can thus be argued that “cultural system[s] of ideas about social and linguistic
relationships” (Irvine 1989: 255) in Western societies depend not only on the formation of
national publics (Gal & Woolard 2001) but furthermore on the technologies of writing and
print. Language ideologies that construct national standard languages are thus an outcome
of social formation as well as technological development (the general role of objects in
linguistic epistemology is discussed e.g. in Pennycook 2018).
Research on non-Western cultural contexts shows that discourses on standardization,
stability and homogeneity in language may differ where oral language culture holds a more
prominent place. Thus, in Caribbean Creole cultures, for example, idiosyncratic
appropriation and creative expressiveness are regarded as value, which results in different
attributions of value to language and in different kinds of standardizing practice (see e.g.
French & Kernan 1981; Pyndiah 2016; Schneider, accepted for publication). Research on
the specific question of the interrelationship of media change and language ideologies has
been published in the 2010s (see e.g. Androutsopoulos 2016; Coupland 2016; Eisenlohr
2010; Gershon 2010). The outcome of this strand of research confirms that media affect the
meaning of concepts and that productions of ‘truth’ and cultural value interact with media
choice (e.g. illustrated in Gershon’s work on using different media to end romantic
relationships, Gershon 2010a).
Apart from this, Keane (2003) has shown that verbal language itself is often treated as
more meaningful or sophisticated than other ways of communication at least in the Western
world (see Shelton & Okayama 2006), as also argued by Thurlow (2017), for instance, with
regard to the concept of semiotic ideologies. Metadiscourses on language and semiotic
signs are very much indicative of language ideologies. Studies on “semiotic ideologies”
(Thurlow 2017) in relation to emojis are a case in point (Sampietro 2018; Thurlow &
Jarowski 2020). Thurlow and Jarowski (2020: 45) mention that western newspaper
journalists referring to languages with alphabetic scripts consistently reproduce language
ideologies of digital discourse as exaggeratedly new and distinctive, relating this to
“long-standing narratives of linguistic degradation or ruin usually attributed to technology”.
They identify a discourse of “language endangerment”, with language itself being construed
as in need of protection from “evil” visual communication (the dichotomy is not, however,
present in cultures with other scripts, see Shelton & Okayama 2006). This is amply shown
in some of their examples, where emojis are framed as an external, unwanted assault on
language, of which they are no full fledged parts (Thurlow & Jarowski 2020: 46). Sampietro
(2018) observes that in the Spanish press, emojis not only are considered a threat to
literacy, but also to the integrity of writing, fearing a return to ancient scripts (such as
hieroglyphics); this further stresses the superiority of written over visual communication in
cultures with alphabetic scripts (see Kress & van Leeuwen 2006: 5). This illustrates how
visual discourse combines a range of language and media ideologies (Thurlow 2017;
Thurlow & Jarowski 2020), resulting in semiotic ideologies. These are concerned with
“people’s beliefs about signification or meaning-making, and, specifically, issues such as
intentionality, agency and arbitrariness” (Thurlow & Jarowski 2020: 53). Importantly, these
ideologies are related to wider systems of differentiation and ultimately issues of authority
within a society.
In this vein, online management of multilingualism, for example, is also a form of language
policy and planning (Kelly-Holmes 2019) and language ideologies play a role in this regard
(see also section 3.6). Official efforts can support movements to balance linguistic
inequalities online, and possibly in other media. Another issue that is relevant with regard to
semiotic ideologies and the higher value attributed to written language as compared to
other communicative resources used for meaning making (cf. also Herring 2013) is the use
- or rather absence of - sign languages in different media. Kurz and Cuculik (2015) show
that most online spaces are less accessible to deaf people due to the aforementioned
preference for written language and text on many platforms. While many deaf people are
literate in one of the national languages spoken in a respective country, this is not their
preferred medium of communication and illiteracy rates among deaf communities are in fact
still high (Kurz & Cuculik 2015).
To end this brief overview of concerns of language ideology research and the related strand
of media ideology research, a crucial question to ask in times to come is what will happen
to standard language ideologies where print literacy does not cease to exist but is found
besides an increasing dominance and diversification of digital media, with written as well as
oral, human-made as well as machine-made language, where we find oral and written
messenger communication, chats, machine translation and machine-based text design or
voice-controlled personal digital assistants (see also Schneider 2022). "Technology itself is
neither bad nor good, ‘not a destiny but a scene of struggle’ as well as possibility”
(Feenberg 1991: 14 as cited in Deumert 2014: 488). Public institutions always reflect and
are shaped by larger social inequalities and the internet as a whole is one example of such
a “public institution” (Deumert 2014: 488). Understanding the reconfiguration of language
ideologies in digital settings and in relation to different media technologies is thus an
important and socially relevant contribution to study of language in society in times to
come. The following excursus illustrates this relevance from the perspective of NLP
(Natural Language Processing).
References (Britta Schneider, Susanne Mohr, Bettina Migge, Marie-Theres
Anderson, B. 1983. Imagined communities. London: Verso.
Androutsopoulos, J. (2006). "Jugendsprachen als kommunikative soziale Stile.
Schnittstellen zwischen Mannheimer Soziolinguistik und Jugendsprachenforschung."
Deutsche Sprache 1-2: 106-121.
Androutsopoulos, J. 2016. "Theorizing media, mediation and mediatization." In
Sociolinguistics. Theoretical debates, edited by Nikolas Coupland. Cambridge:
Cambridge University Press. 282-302.
Assmann, A. 2010. Erinnerungsräume. Formen und Wandlungen des kulturellen
Gedächtnisses (5. Aufl.). München: C.H. Beck.
Bauman, R. & Briggs, C. 2003. Voices of modernity: Language ideologies and the politics
of inequality. Cambridge, UK: Cambridge Univ. Press.
Bommasani, R. & Liang, P. et al. 2021. "On the opportunities and risks of foundation
models." Center for Research on Foundation Models (CRFM) – Stanford University
Canagarajah, S. 2013. Translingual practice. Global Englishes and cosmopolitan relations.
London: Routledge.
Coupland, N. 2010. "Language ideology, media and social change." In Performing the Self.
SPELL: Swiss Papers in English Language and Literature, edited by K. Junod and D.
Maillat. Tübingen: Narr.
Coupland, N. 2016. "Five Ms for sociolinguistic change." In Sociolinguistics. Theoretical
Debates, edited by Nikolas Coupland. Cambridge: Cambridge University Press.
Cowley, Stephen. "Languaging: Towards a New Ecolinguistics." Chinese Semiotic Studies
(2019): doi: 10.1515/css-2019-0027.
Deumert, A. 2014. Sites of struggle and possibility in cyberspace. Wikipedia and Facebook
in Africa. In Jannis Androutsopoulos (ed.), Mediatization and Sociolinguistic Change.
Berlin & Boston: De Gruyter Mouton. 487-514.
Dodge, J, Sap, M., Marasovic, A., Agnew, A., Ilharco, G., Groeneveld, D., Mitchell, M. &
Gardner, M. 2021. "Documenting large webtext corpora: a case study on the colossal
clean crawled corpus." arXiv
Eisenlohr, P. 2010. "Materialities of entextualization: the domestication of sound
reproduction in Mauritian Muslim devotional practices." Journal of Linguistic
Anthropology 20: 314–333.
Fabian, J. 2001. Anthropology with an attitude: Critical essays. Stanford: Stanford
University Press.
Feenberg, A. 1991. Critical Theory of Technology. New York: Oxford University Press.
French, R. & Keith T., K. (1981). “Art and Artifice in Belizean Creole." American Ethnologist
8: 238-258.
Gal, S. 2001. "Linguistic theories and national images in 19th century Hungary." In
Languages and Publics. The Making of Authority, edited by Susan Gal and Kathryn
A. Woolard. New York: Routledge. 30-45.
Gal, S. & Irvine, J. T. 1995. "The boundaries of languages and disciplines: how ideologies
construct difference." Social Research 62: 967-1001.
Gal, S. & Irvine, J. T. 2019. Signs of Difference. Language and Ideology in Social Life.
Cambridge: Cambridge University Press.
Gal, S. & Woolard, K. A. 1995. "Constructing languages and publics: authority and
representation." Pragmatics 5: 129-138.
Gal, S. & Woolard, K. A. 2001. "Constructing languages and publics: authority and
representation." In Languages and Publics. The Making of Authority, edited by Susan
Gal and Kathryn A. Woolard. New York: Routledge. 1-12.
García, O. & Camila Leiva, O. 2014. "Theorizing and endacting translanguaging for social
justice." In Heteroglossia as Practice and Pedagogy, edited by Adrian Blackledge and
Angela Creese. Heidelberg: Springer. 199-216.
García, O., & Wei, L. (2014). Translanguaging. Language, bilingualism and education.
Basingstoke: Palgrave Macmillan.
Gerrand, P. 2007. Estimating linguistic diversity on the Internet: A taxonomy to avoid
pitfalls and paradoxes. Journal of Computer-Mediated Communication 12(4).
Garrett, P. 2010. Attitudes to Language. Key Topic in Sociolinguistics. Cambridge:
Cambridge University Press.
Gershon, I. 2010a. "Breaking up is hard to do: media switching and media ideologies."
Journal of Linguistic Anthropology 20: 389–405.
Gershon, I. 2010b. "Media ideologies: an introduction." Journal of Linguistic Anthropology
20: 283–293.
Harris, R. 1981. The language myth. London: Duckworth.
Herring, S. C. 2013. Discourse in Web 2.0: Familiar, reconfigured, emergent. In Deborah
Tannen & Anna Marie Trester (eds.), Discourse 2.0. Language and New Media.
Washington, DC: Georgetown University Press. 1-25.
Heath, S. B. 1989. "Language Ideology." In International Encyclopedia of Communications,
edited by Erik Barnouw. Oxford: Oxford University Press. 393-395.
Hutchins, E. 1995. Cognition in the wild. Cambridge, MA: MIT Press.
Irvine, J. T. 1989. "When talk isn't cheap: language and political economy." American
Ethnologist 16: 248-267.
Irvine, J. T. & Gal, S. 2009. "Language ideology and linguistic differentiation." In Linguistic
anthropology: a reader, edited by Alessandro Duranti. Oxford: Wiley Blackwell.
Jaworski, A., Coupland, N. & Galasiński, D. 2004. "Metalanguage: Social and ideological
perspectives." In Folk metalanguage, edited by Dennis Preston. Berlin: Mouton de
Gruyter. 75 - 103.
Jørgensen, J. N. 2008. "Polylingual languaging around and among adolescents." Journal
of Multilingualism 5: 161–176.
Joseph, J. E. & J. Taylor, T. eds. 1990. Ideologies of Language. London: Routledge.
Keane, E. W. 2003. "Semiotics and the social analysis of material things." Language and
Communication 23: 409-425.
Kelly-Holmes, H. 2019. Multilingualism and technology: A review of developments in digital
communication from monolingualism to idiolingualism. Annual Review of Applied
Linguistics 39. 24-39.
Kress, G. & van Leeuwen, T. 2006. Reading Images: The Grammar of Visual Design. 2nd
ed. London: Routledge.
Kreutzer, J., Caswell, I. & Wang, L. 2021. "Quality at a glance: an audit of web-crawled
multilingual datasets." Cornell University arXiv
Kroskrity, Paul. 2018. "Language ideologies and language attitudes." In Oxford
Bibliographies. DOI: 10.1093/OBO/9780199772810-0122
Kroskrity, P. V., ed. 2000. Regimes of language. Ideologies, polities and identities. Santa
Fe, New Mexico: School of American Research Press.
Kroskrity, P. V. 2001. "Language Ideology." In Handbook of Pragmatics, edited by Jan-Ola
Östman and Jef Verschueren, Amsterdam
Kroskrity, P V., Woolard, K. A. & Schieffelin, B. B. 1992. "Special Issue: Language
Ideologies." Pragmatics 2: 235-449.
Kurz, C. A. N. & Cuculik, J. 2015. International Deaf space in social media: The Deaf
experience in the US. In Michele Friedner & Annelies Kusters (eds.), It’s a Small
World. International Deaf Spaces and Encounters. Washington, DC: Gallaudet
University Press. 225-235.
Linell, P. 1982. The written language bias in linguistics. Linköping: The University of
Linell, P. 2005. The written language bias in linguistics. Its nature, origins and
transformations. London: Routledge.
Makoni, S. 2011. "Sociolinguistics, colonial and postcolonial: an integrationist perspective."
Language Sciences 33: 680-688.
McLuhan, M. 1964. Understanding media. The extension of man. New York: Mentor.
Moll, A. 2015. Jamaican Creole goes web. Sociolinguistic styling and authenticity in a
digital 'Yaad'. Amsterdam: Benjamins.
Ong, W. J. 1982. Orality and literacy. The technologizing of the word. London: Routledge.
Pennycook, A. 2004. "Performativity and language studies." Critical Inquiry in Language
Studies 1: 1-19.
Pennycook, A. 2010. Language as a local practice. London: Routledge.
Pennycook, A. 2018. Posthumanist applied linguistics. London: Routledge.
Pyndiah, G. 2016. "Decolonizing Creole on the Mauritius islands: creative practices in
Mauritian Creole." Island Studies Journal 11: 485-504.
Sampietro, A. 2018. "Los emojis en la prensa y la publicidad: análisis y propuestas
pedagógicas." SIBA Monograph 5:19-37.
Schieffelin, B., Woolard, K. & Kroskrity, P. 1998. "Language ideologies: Practice and
theory." In Anger, gender, language shift, and the politics of revelation in a Papua
New Guinean village, edited by Don Kulick. New York: Oxford Univ. Press. 87 - 102.
Schneider, B. 2019. "Methodological nationalism in linguistics." Language Sciences 76
Schneider, B. (2022). Multilingualism and AI – the regimentation of language in the age of
digital capitalism. Signs and Society, 10,
Schneider, B. accepted for publication. Liquid languages – constructing language in late
modern cultures of diffusion. Cambridge: Cambridge University Press.
Shelton, B. & Okayama, E. 2006. ‘Between Script and Pictures in Japan’. Visual Language
40 (2): 157–76.
Street, B. V. 1995. Social literacies: Critical approaches to literacy in development,
ethnography and education. London: Longman.
Sun, T., Gaut, A., Tang, S., Huang, Y., El Sherief, M., Zhao, J., Mirza, D., Belding, E.,
Chang, K. & Wang†, W. Y. 2019. "Mitigating Gender Bias in Natural Language
Processing: Literature Review." Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics
Thurlow, C. & Jaworski, A. 2017. "The discursive production and maintenance of class
privilege: Permeable geographies, slippery rhetorics." Discourse & Society 28:
Thurlow, C. 2017. “Forget about the words?” Tracking the language, media and semiotic
ideologies of digital discourse: The case of sexting. Discourse, Context and Media 20.
Thurlow, C. & Jarowski, V. 2020. “Emoji invasion”: The semiotic ideologies of language
endangerment in multilingual news discourse. In Crispin Thurlow, Christa Dürscheid
& Federica Diémoz (eds.), Visualizing Digital Discourse. Interactional, Institutional
and Ideological Perspectives. Berlin: de Gruyter. 45–64.
Virtanen, A., Kanerva, J., Ilo, R., Luoma, J., Luotolahti, J., Salakoski, T., Ginter, F. &
Pyysalo, S. 2019. "Multilingual is not enough: BERT for Finnish." arXiv
Woolard, K. A. 1998. "Introduction. Language ideology as field of inquiry." In Language
ideologies. Practice and theory, edited by Bambi B. Schieffelin, Kathryn A. Woolard
and Paul V. Kroskrity. Oxford: Oxford University Press. 3-47.
Woolard, K. A., & Bambi B. S. 1994. "Language ideology." Annual Review of Anthropology
23: 55-82.
2.2. Excursus and terminology disclaimer: concepts of harms and bias in NLP and
how they relate to language ideologies (Beatrice Savoldi)
To what extent do language technologies raise issues around language and power different
from those that already existed in non-digital language spaces? As an expert on bias in
language technologies, such was a question that the researcher Su Lin Blodgett received
when interviewed for the Radical AI podcast (Doyle-Burke & Smith 2020). “That’s a great
question [...] and I'll probably be mulling it over this whole afternoon after this”, Blodgett
Natural language processing (NLP) is the area of research interested in the creation of
language technologies able to automatically process and generate language. While the
more traditional areas of linguistics and NLP have many commonalities, the key difference
is their scope, and the way in which they approach the same object of study: language.
Rather than being strictly theoretically oriented and descriptive, as an engineering field,
NLP is concerned with how computer systems should be designed to treat language
efficiently and effectively (Tsujii 2021). That is, for deploying a specific, practical application.
The last few years have witnessed significant improvements in the capabilities of such
technologies, also due to the advent of better-performing deep learning methods in NLP
(Otter et al. 2020). In light of their success, language technologies have entered the social
arena, being adopted at a massive scale. From commercial automatic translation tools
used by millions of people, to virtual assistants inhabiting our houses, these systems have
become an integral part of our daily activities. Also - often unbeknownst to a larger portion
of the public - these technologies are ubiquitous in multiple aspects of our lives. Content
moderation on social media platforms, spam filtering, job advertisements online: albeit to
different degrees, NLP tools are employed in all of these areas and more. Along with the
widespread use and deployment of language technologies, however, there have been
rising concerns about their societal impact (Hovy & Spruit 2016). Evidence includes chatbot
spitting sexist and racist slurs (Vincent 2016),hate speech detection tools
disproportionately flagging African-American Vernacular English (AAVE) tweets as toxic
and abusive (Sap et al. 2019; Davidson et al. 2019), or the infamous case of a Palestinian
man being arrested due to a Facebook post automatically (mis)translated into “attack them”
(Hern 2017).Accordingly, to address such concerns, ethical issues in NLP are now devoted
to a proper area of research, populated with specific terminology, such as “harms”, “risks”,
“stakeholders”, and “bias” (Bender 2019; Blodgett et al. 2020, Bender & Friedman 2020;
Weidinger et al. 2021).
Indeed, the study of bias occupies a central place in NLP, where, as Blodgett et al. (2020)
note, it is used to indicate a wide range of disparate, but all somewhat undesirable
behaviors. Automatic speech recognition systems with higher error rates for women
(Tatman 2017), image captioning models that do not detect women sitting next to a
computer (Hendricks et al. 2018),and automatic résumé filtering discarding women’s jobs
applications regardless of their qualifications (Dastin 2018);these are all commonly cited
examples of gender bias. The thing is, that bias per se can be a very opaque, fraught term
(Campolo et al. 2018). Employed in cognitive science studies (Tversky & Kahneman 1974),
linguistics (Borchmann et al. 2019), or even in everyday life, bias refers to a skew, the
result of mental shortcuts that may be based on restricted observations, or prejudicial
judgments. In statistics, however, bias is traditionally interpreted with a neutral connotation,
implying a systematic deviation from an expected result (Shah et al. 2020): this is a sense
the NLP field used to be more familiar with. It is possible to presume that the shady nature
of the word is what has actually made it a particularly appealing umbrella term. Namely,
being dense and opaque, but yet a generally known concept, “bias” has allowed referring to
a bulk of concerning the behaviors of language technologies, without really delving into
their real-world effects and potentially discriminatory outcomes on users.
On this basis, Blodgett et al. (2020) took stock of the situation, analyzing how NLP works
on bias present “motivations [that] are often vague, inconsistent, and lacking in normative
reasoning”. As active creators of technology, NLP practitioners ought to be aware of the
design choice incorporated into their tools, by reflecting on what kinds of system behaviors
can be harmful, in what ways, and to whom. Namely, the recommendations suggest
forecasting which people could be impacted, under which scenarios, and to what extent,
also by recognizing the relationships between language and social hierarchies.
For instance, consider language technologies that don’t properly work on AAEV as an
example of racial bias (Koenecke et al. 2020). A directly impacted stakeholder (Bender
2019) would be a speaker of such a variety that decides to rely on ASR systems, only to
find out that it does not properly recognize their dialect. Depending on the context, such a
malfunction could just imply a loss of time or a sense of frustration for the user. Higher-risk
scenarios, however, include dire errors in the context of medical voice-dictation softwares
(Roger & Pendharkar 2003), or potential safety implications of voice recognition in cars.
However, speakers of such an English variant can also be indirectly affected by language
technologies that they do not personally choose to use, i.e. indirect stakeholders (Bender
2019). For instance, Twitter social media texts are being processed for health tracking
(Jurgens et al. 2017), or to gather the community’s input on public policy matters. If the
employed NLP techniques fail to recognize AAEV dialect, the Afro-American community
could not be made aware of infectious diseases in their area, and be underrepresented
when it comes to policy decisions.
To facilitate clearer, pinpointed discussions, the taxonomy of “harms” arising from AI
systems by Crawford (2017) has been widely used in NLP. It distinguishes two main types
of harm. Namely, allocative harms, which imply an unequal distribution of opportunities and
resources. A straightforward example of such type of harm concerns technology employed
for job interviews, such as speech and tone detectors biased against certain dialects that
compromise the performance of the interviewee. Representational harms, instead, regard
the detraction from the representation of social groups and their identity, which, in turn, can
affect attitudes and beliefs: these include the reproduction of stereotypes (e.g. black names
being associated with criminality records (Sweeney 2013)), or unequal performance of a
system across demographic groups. Along this line, Mengesha et al. (2021) found that
AAEV speakers would accommodate their communication with an ASR system: “I modify
the way I talk to get a clear and concise response [...]. What usually works for me is when I
talk real clear, and don’t use slang words like my regular talk.” Unlike allocative harms, the
impact of representational ones can be hard to quantify. However, as Crawford (2017) and
Blodgett et al. (2020) underscore, they ought to be treated as a problem in their own right,
and because they are typically the origin and basis for the allocative ones.
Admittedly, what is described as “representational harms” within the field of language
technology covers several aspects well-known in the fields of sociolinguistics,
psycholinguistics and linguistics anthropology. While variation is the natural state of
language -- whether exhibited through pronunciation, word choices, or grammatical
structures (Labov 1972, Eckert & Rickford 2001, inter alia) –, people have metalinguistic
beliefs, and attitudes about certain varieties and dialects. What is qualified and perceived
as “standard”, “unmarked” or “prestigious”, however, depends on social, hierarchical
reasons rather than linguistic justifications (Campbell-Kibler 2009;Preston 2009;
Loudermilk 2015). Such language ideologies, in turn, can play a key role in reinforcing
negative attitudes not only toward the language variety itself but also towards its speakers1
(e.g. Alim et al. 2016; Rosa and Flores 2017; Craft et al. 2020).Along the line of racial bias,
AAVE has been historically stigmatized, considered uneducated, offensive, grammatically
sloppy and simple, and its speakers decried as ignorant and lazy. This same attitude,
however, has been unveiled as systemic dialect discrimination towards AAVE in the
education system (Williams et al. 1971: Terry et al. 2010), courtrooms (Rickford & King
2016; Jones et al. 2019),and employment discrimination (Henderson 2001; Grogger 2011).
Current studies on bias in NLP touch on many of the issues that sociolinguistic and
linguistic anthropologists have long examined, just through a different vocabulary. So, once
again: what are the issues around language ideologies that are new to NLP
According to Blodgett “[...] unique or maybe different about technologies is [...] I think scale
and perhaps an opaqueness”. Inscrutable, black box language technology is pervasive, to
1A somewhat funny but quintessential example of such a phenomenon could be that of the
Californian dialect, which was literally condensed into the subcultural stereotype of the silly and
superficial valley girl (Hess, 2013; Habasque, 2021).
the point where users may not even always be aware of operating with it, or may not be
able to avoid operating or being processed through such systems. Such widespread
societal diffusion, together with current biases affecting language technologies, can bring
linguistic discrimination to a new level. Still, in Blodgett’s words: “there's no recourse. And
you can't really refuse to participate effectively” (Doyle-Burke and Smith 2020).
Overall, this shows that an intensified and critical interaction between those responsible for
designing digital technologies and socially oriented forms of language research is crucial.
The following sections give insight into recent work from the fields of sociolinguistics,
multilingualism research and linguistic and cognitive anthropology that, taken together, give
a first overview on themes and topics that are relevant in understanding how language
practices and concepts regarding language are reconfigured via technology change. As
language is quintessential to the formation of social structures and social reality, these
insights can contribute to the construction of socially aware technologies.
REFERENCES (Beatrice Savoldi)
Alim, H. S., Rickford, J., & Ball, A., R. (Eds). 2016. Raciolinguistics: How Language Shapes Our Ideas About
Race. Oxford University Press.
Bender, E.M. & Friedman, B. 2018. Data statements for natural language processing: Toward mitigating system
bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587–604.
Bender, E.M. 2019. A typology of ethical risks in language technology with an eye towards where transparent
documentation might help. In CRAASH. The future of Artificial Intelligence: Language, Ethics, Technology.
Cambridge, UK.
Blodgett, S. L., Barocas, S. Daume III, H., & Wallach, H. 2020. Language (technology) is power: A critical
survey of ‘‘bias’’ in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, pages 5454–5476, Online. Association for Computational Linguistics
Borchmann, S. U., Levisen, C., & Schneider, B. 2019. “Linguistics as a biased discipline: identifications and
investigations”. Language Sciences, 76, [101218].
Campbell-Kibler, K. 2009. The nature of sociolinguistic perception. Language Variation and Change,
Campolo, A., Sanfilippo, M. R., Whittaker, M. & Crawford, K.. 2017. AI Now Report 2017. New York: AI Now
Craft, J.T., Wright, K. E., Weissler, R. E. & Queen, R. M. 2020. “Language and Discrimination: Generating
Meaning, Perceiving Identities, and Discriminating Outcomes”. Annual Review of Linguistics, 6(1)
Crawford, K. 2017. The trouble with bias. In Conference on Neural Information Processing Systems (NIPS)
Keynote, Long Beach, USA.
Dastin, J. (2018). Amazon scraps secret AI recruiting tool that showed bias against women. REUTERS, (Date of Access:
Davidson, T., Bhattacharya, D., & Weber, I. 2019. Racial Bias in Hate Speech and Abusive Language Detection
Datasets. In Proceedings of the Third Workshop on Abusive Language Online, pages 25–35, Florence, Italy.
Association for Computational Linguistics.
Doyle-Burke, D. & Smith, J. (Hosts). Oct. 2020. Defining Bias with Su Lin Blodgett. The Radical AI Podcast.
Eckert, P., & Rickford, J. R. (Eds.). 2001. Style and sociolinguistic variation. Cambridge University Press.
Grogger, J. 2011. “Speech Patterns and Racial Wage Inequality”. J. Hum. Resour. 46 (1), 1–25. doi:
Habasque, P. 2021 "Is Creaky Voice a Valley Girl Feature? Stancetaking & Evolution of a Linguistic Stereotype."
Anglophonia. French Journal of English Linguistics 32.
Henderson, A. L. 2001. “Is Your Money where Your Mouth Is? Hiring Managers' Attitudes toward African
American Vernacular English,” ProQuest Dissertations and Theses. (Philadelphia, PA: Order No. 3003635,
University of Pennsylvania). Available at:
Hendricks, L., Burns, K., Saenko, K., Darrell, T., & Rohrbach, A. 2018. Women also snowboard: Overcoming
bias in captioning model. In Proceedings of the European Conference on Computer Vision (ECCV), pages
740–755. Munich, DE.
Hern, A. 2017. Facebook translates 'good morning' into 'attack them', leading to arrest. The Guardian.
Hess, A. 2013. “Why Old Men Find Young Women’s Voices So Annoying”. Slate.
Hovy D., & Spruit, S. L. 2016. The social impact of natural language processing. In Proceedings of the 54th
Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 591–598,
Berlin, DE. Association for Computational Linguistics.
Jones, T., Kalbfeld, J. R., Hancock, R. & Clark, R. 2019. “Testifying while black: An experimental study of court
reporter accuracy in transcription of African American English”. Language, 95(2).
Jurgens, D., Tsvetkov, Y., & Jurafsky, D. 2017. Incorporating Dialectal Variability for Socially Equitable Language
Identification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (pp.
Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z. & Goel, S. 2020. “Racial disparities in
automated speech recognition”. Proceedings of the National Academy of Sciences, 117(14), 7684-7689.
Labov. W. 1972. Sociolinguistic Patterns. 4. University of Pennsylvania Press.
Loudermilk, B. 2015. “Implicit attitudes and the perception of sociolinguistic variation”. In Alexei Prikhodkine and
Dennis R. Preston, editors, Responses to Language Varieties: Variability, processes and outcomes, pages
Mengesha, Z., Heldreth, C., Lahav, M., Sublewski, J., & Tuennerman, E. 2021. “I don’t Think These Devices are
Very Culturally Sensitive.”— Impact of Automated Speech Recognition Errors on African Americans. Frontiers in
Artificial Intelligence, 169.
Otter, D. W., Medina, J. R. & Kalita, J. K. 2020. "A survey of the usages of deep learning for natural language
processing." IEEE transactions on neural networks and learning systems 32.2: 604-624.
Preston, D. R. 2009. “Are you really smart (or stupid, or cute, or ugly, or cool)? Or do you just talk that way?”
Language attitudes, standardization and language change. Oslo: Novus forlag, pages 105– 129.
Rickford, J. R. & King, S. 2016. “Language and linguistics on trial: Hearing Rachel Jeantel (and other vernacular
speakers) in the courtroom and beyond”. Language, 2(4):948–988.
Roger, J. A., & Pendharkar, P. C. 2003. "A field study of the impact of gender and user's technical experience
on the performance of voice-activated medical tracking application." International Journal of Human-Computer
Studies 60.5-6 (2004): 529-544.
Rosa, J. & Flores, N. 2017. “Unsettling race and language: Toward a raciolinguistic perspective”. Language in
Society, 46:621–647.
Sap, M., Card, D., Gabriel, S., Choi, Y., & Smith, N.A. 2019. The Risk of Racial Bias in Hate Speech Detection.
In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1668–1678,
Florence, Italy. Association for Computational Linguistics.
Shah, D., Schwartz, H. & Hovy, D. 2020. Predictive biases in natural language processing models: A conceptual
framework and overview. In Proceedings of the 58th Annual Meeting of the Association for Computational
Linguistics, pages 5248–5264, Online. Association for Computational Linguistics.
Sweeney, L. (2013). “Discrimination in online ad delivery”. Communications of the ACM, 56(5), 44-54.
Tatman, R. 2017. Gender and dialect bias in youtube’s automatic captions. In Proceedings of the First ACL
Workshop on Ethics in Natural Language Processing, pages 53–59, Valencia, ES. Association for
Computational Linguistics. -1606
Terry, M., Hendrick, R., Evangelou, E. & Smith, R. 2010. “Variable dialect switching among African American
children: Inferences about working memory.” Lingua, 120(10):2463–2475.
Tsujii, J. “Natural Language Processing and Computational Linguistics.” 2021. Computational Linguistics 47.4:
Tversky, A. & Kahneman, D. 1974. “Judgment under uncertainty: Heuristics and biases”. Science,
185(4157):1124–1131., PubMed: 17835457
Vincent, J. 2016. Twitter taught Microsoft’s AI chatbot to be a racist asshole in less than a day. The Guardian.
Weidinger, L., et al. 2021. "Ethical and social risks of harm from Language Models." arXiv e-prints (2021):
Williams, F., Whitehead, J. L., & Miller, L. M. (1971). “Ethnic Stereotyping and Judgments of Children's Speech”.
Speech Monogr. 38 (3), 166
3. New Forms of Language Policing in Digital
Technologies – Commercial Corpus-building, Bias
Treatment, Content Moderation, Content Ranking,
Digital Dictionaries
In this section, we treat activities that, in traditional linguistic terminology, are referred to as
language policies. These are texts or activities in which speakers, or, in our case,
programmers and other actors in the language industry, consciously decide on which lexical
or syntactic forms to use, and thus aim to influence language practices of others (see e.g.
Spolsky 2012 on language policies). In the context of language technologies, language
policing is often not necessarily enacted in order to impact on language itself but with the
intention to avoid discriminatory speech acts or based on commercial desires. The section
starts with a section that discusses general trends regarding the policing of language and of
multilingualism that appear in digital settings. This is followed by a consideration of the role
of corpus formation for language technologies. The section thereafter delves into a more
detailed exploration of the role of gender biases as a case study on how industrial agents
currently deal with language ideological positions and how to avoid some of them.
Afterwards, the language ideological entanglements and effects of commercial content
moderation, content ranking, digital dictionaries and grammar checkers are discussed.
3.1. Language policy and multilingualism in digital contexts (Iker Erdocia)
An important trend in the ideological dimension of language and technology comprises the
multilingual developments in digital communication and, more concretely, the online
organisation and management of languages. Kelly-Holmes (2019) outlines the evolution of
multilingualism in tandem with the development of digital technology, including its
politico-economic dimension. She refers to the current phase as “hyperlingualism or phase
in which there is a multilingual display that allows greater visibility of non-dominant
languages and a greater level of linguistic differentiation. “Hyperlingualism” is a term coined
by Pauwels (2014) to refer to the proliferation of languages in global cities as a result of
contemporary superdiversity. Kelly-Holmes introduces the ideological dimension of
linguistic diversity in digital spaces. Hyperlingualism is the product of both technological
affordances and the ideology underpinning the enablement of language use. In other
words, technological and ideological shifts have provided bounded spaces on the web for
the use of more languages although hyperlingualism, she argues, has resulted in a
hyper-differentiation of languages.
The hyperlingual era is characterised by crowdsourcing, that is, the localisation work that
“produsers” or “prosumers” do. One example of this approach is the Facebook translation
app (Lenihan 2011), which allows community users with no language expertise to propose
and vote for translations in one particular language. Unlike the prominent language
ideology according to which languages are tied to territorially bound groups, Kelly-Holmes
(2019: 32) sees this kind of language ideological change as “a refreshing view of language
as not being bound to issues of territory, citizenship, or competence”. This is not to say that
language ideologies are not (re)produced by communities of translation. For instance,
Lenihan (2011: 60) argues in her case study about “language ideologies and/in Facebook”
that metalanguage inevitably works at an ideological level and Facebook is “implicated in
replicating ideologies of endangerment, purism, parallel monolingualism, and of course,
Importantly, although the provision of services in some lesser-used languages is still scarce
and the digital language divide still exists (Cru 2016), the presence of “outlier” languages in
some digital platforms is not only determined by how large or how economically attractive
languages are. Now, in addition to the services provided by commercial entities, individuals
collaborate to solve their language problems with technology. Kelly-Holmes argues that an
intensified but isolated “hyperlingualism” is resulting in a new era that runs in parallel with
hyperlingualism: “ideolingualism”, which is also based on crowdsourcing and gift
economies and ideas such as personalised, individualised, customised and free learning of
languages (e.g. Duolingo). Kelly-Holmes (2019: 35) concludes that although multilingual
provision and diversification are happening, “monolingual ideologies do still pervade much
technological development”.
In the future, machines will increase their capacity for agency (or affordance if we consider
agency as an inherently human characteristic). As Wee (2021) notes, programmed sensors
that are independent of programmers gather and distribute information. The increasing
distance between the messages that programmes such as Siri, Alexa or Echoborgs
communicate and the original aims of the programmer poses conceptual (and ontological)
challenges for the study of language and society (e.g. speakerhood, relationship between
human agency and AI agency, etc. For a human-machine communication perspective, see
e.g. Guzman and Lewis 2020). The fact that it is less plausible to equate the messages
communicated by a programme with the intentions of its programmer (Wee 2021) may
have implications for language policy and language ideology research (e.g. what linguistic
naturalisation, uniformisation, homogenisation, standardisation are manifested and
maintained? How and for what purposes?).
Kelly-Holmes (2019) argues that language policy research should go back to the macro
and meso level to shed light on the top-down management and organisation of language
issues, including language ideologies (e.g. which speech communities are designed for
which speakers?). This approach poses some methodological challenges as language
practices take place in increasingly private, smaller and individualised spaces that are
difficult to access. One such example at the micro level is provided by De Bres and Belling
(2015) who investigate the relationship among language policies, practices and ideologies
in a multilingual Facebook group (see also the notion of networked multilingualism,
Androutsopoulos 2015). Certainly, language ideological debates in a superdiverse online
world are still important because they may have implications for the offline world of
nation-states and minority languages (e.g. Vessey 2016). Yet, Kelly-Holmes suggests that
language policy research needs to pay attention to the political economy of technology
(how it makes money, how certain kinds of information get mainstreamed, what we are not
finding or what is not finding us, etc). One way to do so is to consider the underlying
technological procedures that feed into the formation of language technology.
References (Iker Erdocia)
Androutsopoulos, J. (2015). Networked multilingualism: Some language practices on Facebook and
their implications. International Journal of Bilingualism, 19(2), 185–205.
Cru, J. (Ed.) (2016). Linguapax Review 2016—Digital Media and Language Revitalisation.
Linguapax International.
De Bres, J., & Belling, L. (2015). Free your stuff Luxembourg! Language policies, practices and
ideologies in a multilingual Facebook group. Language Policy, 14(4), 357–375.
Guzman, A. L., & Lewis, S. C. (2020). Artificial intelligence and communication: A Human–Machine
Communication research agenda. New Media & Society,22(1), 70-86.
Kelly-Holmes, H. (2019). Multilingualism and technology: A review of developments in digital
communication from monolingualism to idiolingualism. Annual Review of Applied Linguistics,39,
Lenihan, A. (2011). “Join our community of translators”: Language ideologies and Facebook. Digital
discourse: Language in the new media, 48-64.
Pauwels, A. (2014). Rethinking the learning of languages in the context of globalization and
hyperlingualism. In D. Abendroth-Timmer & E. Henning (Eds.), Plurilingualism and multiliteracies:
International research on identity construction in language education (pp. 41–56). Frankfurt,
Germany: Peter Lang.
Vessey, R. (2016). Language ideologies in social media: The case of Pastagate. Journal of language
and politics,15(1), 1-24.
Wee, L. (2021). Rethinking agency in language and society. International Journal of the Sociology of
Language,2021(267-268), 271-275.
3.2. Corpus-building and bias treatment in the industry as an ideological act (Bettina
Migge, Britta Schneider)
All language technologies draw their language input from corpora, which are the basis for
the potential breach of ethical standards and the reproduction of social biases. Unlike
approaches in corpus linguistics which stipulate that corpora have to not only be sufficiently
large but also representative of the language (practices) being studied, commercial
companies generally rely on available pre-existing data sources. They obtain digitally
available data using web-crawlers (that is, programmes that automatically collect texts from
the web, see open
access data-bases, e.g. Wikipedia, Google Patents, text from EU institutions, anonymized
police reports, news media, Ted talks, religious organizations but also from more informal,
less controlled texts like blogs or reddit (for discussion, see e.g. Dodge et al 2021). Another
option is to buy data drawn from user interactions from large commercial social media
companies such as Facebook, snapchat etc. These data, to date more or less only written
data, quite logically, contain social biases (see also previous section), which companies try
to avoid to satisfy the demands of customers. Changed data sets or algorithmic procedures
are thus used to 'debias' the functioning of e.g. machine translation technologies or digital
voice assistants. The appearance of social biases is referred to as "harm" in this cultural
context (see section 2.2, 3.3 and e.g. Sun et al (2019) for an overview).
In debiasing procedures, essentially, an attempt is made to unlink stereotypical
associations between social groups and language use (e.g. Christians are backwards).
This can be done, for example, by first identifying such stereotypical associations in the text
through text exploitation tools and adding new sentences that associate the same
characteristics with other social groups (e.g. Hindus are backwards). While this
replacement is much used especially for English, it does not work as easily for languages
that have special morphology for grammatical gender marking. Other practices focus on
replacing potentially harmful words with more neutral ones. However, all these practices, as
in the case of CCM, rely on fixed notions of language, focus on the word or sentence level
only and are not able to capture interactional practices. It also creates a kind of artificial
representation of the world through digital written texts that is far from people’s lived reality.
Another issue is the fact that the kind of language that exists in these data sources is not
very representative as different social groups engage to different degrees and in different
ways with social media, thus language technologies are potentially normalising the
language practices of certain social groups (those who are their main commercial target)
and certain social genres over others, specifically traditional ones (even though it needs to
be asked whether this isn't also true for any most other language corpora). Changes are
often most noticeable in the realm of formal language.
In the context of discussing sociolinguistic inequality, it is also relevant to mention the fact
that most technologies are firstly researched and established for one language, most
typically English. Oftentimes, language technology applications are simply translated from
the source language (that is, mostly English) into other languages, potentially leading to
unidiomatic, innovative language practices in the target languages, and thus inducing
language change, maybe even homogenisation of language structure in the long run (see
Virtanen et al. 2019, quoted in Bommasani and al. 2021: 25).
Finally, (Machine) Translation also heavily relies on parallel corpora, that is aligned texts. In
many cases, the text was generated/written in one language, the source language, and is
then translated into other languages. These parallel corpora contain all texts since the
inception of the company or the translation platform (e.g. Google Translate or deepl).
Professional translators can then make use of parts of it for their translation work and
create data-bases with all texts translated so far by the company, the so-called translation
memories. Depending on the platforms or companies, the texts are often heavily biased
towards certain types of topics and client types. However, in many forms of professional
translations, there are layers of post-editing that ensure viability of the resulting texts, at
least in the case of texts with public importance.
Ensuring validity via human input is not typically done for data corpora. In the case of
languages with less digitised written data available, this has shown to cause serious
problems as the data sets are often highly problematic and even contain data that is not
actually human language (see Kreutzer et al. 2021). What is interesting in all this is that as
in previous centuries, language norming is largely biased towards some groups. While in
earlier centuries the intellectual elite defined what constitutes normative practices, this is
largely now in the hands of large commercial entities and defined by the language practices
of populations who write online. In earlier times elites were the main agents in the process
and they focused on their own practices. Now companies are statistically exploiting
language use data of users who are barely aware to what uses their everyday practices are
put and that the purposes of these are commercial. This also suggests a change in
conceptions of language norms. Commercial companies are tacitly promoting a descriptive
approach to language: users' online practices are understood as the target of norming
rather than the practices of self-declared elites. Overall, a consideration of corpus-building
in the language industry sector paves a fruitful way into understanding how language
ideologies what we believe language to be and what we conceive of its ideal form are
entangled with technological practices (which links to the discussion of human vs. machine
autonomy and agency as human decisions are shown to depend on material/technological
environments). It also makes visible that language ideological/ language policing decisions
are based on how users and programmers conceive of social difference and what they
understand social biases to be. The following section discusses the consideration of biases
and its treatment in more detail, focusing on the case of gender biases.
Bommasani, Rishi, and Percy Liang et al. "On the Opportunities and Risks of Foundation Models." Center for
Research on Foundation Models (CRFM) – Stanford University
Kreutzer, Julia, Isaac Caswell, and Lisa Wang. "Quality at a Glance: An Audit of Web-Crawled Multilingual
Datasets." Transactions of the Association for Computational Linguistics (2021).
Sun, Tony, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai El Sherief, Jieyu Zhao, Diba Mirza, et al. "Mitigating
Gender Bias in Natural Language Processing: Literature Review." Proceedings of the 57th Annual
Meeting of the Association for Computational Linguistics (2019):
Virtanen, Antti, Jenna Kanerva, Rami Ilo, Jouni Luoma, Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and
Sampo Pyysalo. "Multilingual Is Not Enough: Bert for Finnish." arXiv preprint (2019): arXiv:1912.07076
3.3. Excursus on language ideologies and corpus-based tools 1: Gender bias in
language technologies (Beatrice Savoldi)
During the 1978-79 UK electoral campaign, Tim Bell, the advertising executive of the
Conservative Party’s election campaign, realized that the main obstacle Margaret Thatcher
had to overcome to win the national election was her voice. When speaking on the radio or
on TV, she sounded “strangulated” and “patronizing”, completely unfit to gain the trust and
vote of the lower-middle class. To solve the issue, Bell arranged a meeting with the
Shakespearean actor Laurence Olivier, who taught her how to modulate her voice by
lowering the pitch (Karpf 2006). Training a performative voice was the identified
countermeasure for what was actually an engineering problem. As Tallon (2019) reports,
technology that transmits human voices such as microphones has been historically
optimized for lower (typically male) voices, thus negatively impacting female speech and its
radio intelligibility. Rather than identifying and tackling the design issue, however, the result
was that women on air were perceived as “affected,” “stiff,” “forced,” and “unnatural”.
Beyond audio quality, the long term effect of such a technological limitation has been on
women’s public credibility, on the perceived “unsuitability” of their speech.
Although vastly regarded as objective, neutral tools, contemporary language technologies
have been increasingly found to perpetuate analogous disparities. For automatic speech
recognition (ASR), Tatman (2017) observed how Youtube’s automatic caption consistently
performed better in capturing male than female speech. Powered with the force of a vast
amount of language data, language technologies can replicate very human-like biases - on
the basis of race, class, religion and, of course, gender: the very same biases technology
was expected to defeat. Among those, gender bias has received great attention within the
Natural Language Processing (NLP) community. As Stańczak and Augenstein (2021)
estimate, a total of 224 papers have been published on the topic since 2004, with a clear
yearly increase since 2015. Generally intended as the systematic preference of a gender
group over the other (Sun 2019;Costa-Jussà 2020), across publications and tasks, gender
bias has been found in many forms, and at different levels of the development pipeline of
language technology.
Language data represent the backbone of today's NLP. Trained on increasingly larger
resources, current neural approaches to language technologies are meant to infer and
generalize from the statistical associations and patterns found in such resources. However,
the language data fed to such models hardly offer a diverse representation of demographic
groups and their communicative repertoires. Consider some popular spoken and written
collections of parallel texts in various languages, such as the MuST-C corpus (Cattoni
2021) retrieved from TED talks and Europarl (Koehn 2005) based on the
proceedings of the European Parliament. Both feature an imbalanced representation of
gender (Vanmassenhove 2018; Gaido et al. 2020): as a result of the historical glass ceiling
that has hampered women’s access to political and scientific positions, they predominantly
consist of men. On top of more straightforward quantitative asymmetries, language data
crawled from the Web exhibit more subtle imbalances reflecting the communication
practices and perspective of the users that generated them which are however hard to
audit due to their huge size (Luccioni & Viviano 2021;Dodge et al. 2021). Sun & Peng
(2021) provide a glimpse into one of the most highly considered sources of high-quality
textual data: Wikipedia (Gururangan et al 2022; Costa-jussà, Marta R. et al. 2022).
Accordingly, Wikipedia biographies of women tend to intermingle personal life events with
professional events for males (Sun & Peng 2021), thus entrenching stereotypes and social
expectations that link femininity with domestic activities existing in many cultures (Jolly et
al. 2014).
Fed on these data, language technologies have thus grown to replicate, or even amplify
such quantitative and qualitative imbalances. Gender bias has been found in language
models (Nadeem et al. 2021) and in underlying NLP algorithms like word-embeddings,
namely machine-readable vector representations of words and their semantic meaning.2
Used as input representation for several tasks, word embeddings were proved to encode
strong controversial associations between feminine/masculine proper nouns and groups of
words stereotypically assigned to females (e.g. arts) and males (e.g. science) (Caliskan et
al. 2017). By means of the notorious analogy test, Bolukbasi (2016) reported that models
trained on biased word embeddings would automatically resolve “man is to computer
programmer as woman is to…” with “homemaker”, or generate sexist, denigratory content
(Weidinger et al. 2021; Sheng et al. 2020). In fact, by relying on biased resources and
components, the cascaded effects of gender gaps spillover to the output of NLP tools and
applications. A case in point regards coreference resolution systems3failing to identify
3Coreference resolution is the task of finding all expressions that refer to the same entity in a text,
as in linking “she” and “doctor” in the sentence “The doctor left because she was ready”.
2Essentially, every word is represented as a series of numbers (vectors), which are intended to
capture words’ ‘meaning’, while also accounting for their position in a defined space and in relation
to other words (Milokov et al., 2013). This approach owes much to distributional semantics.
Distributional semantics lies on the assumption that the meaning of a word can – at least to a certain
extent – be inferred from its usage (i.e. its distribution in text), and that words occurring in the same
contexts tend to purport similar meanings (Harris,1954; Firth, 1957). Accordingly, ``Paris’’ and
``Rome’’, or ``dog’’ and ‘’cat’’ should be pairs of words with closely related vectors, being both –
respectively – capitals and animals.
feminine textual mentions (Rudinger et al. 2018). Also, as attested for the above-mentioned
ASR (automatic speech recognition) technology, part-of-speech tagging and parsing can
exhibit reduced performance on women’s writings (Garimella et al. 2019). In light of this,
besides quality of service issues, there is space for considerations on the effect that such
biases will have on language ideologies, and the perception of women’s language.
Accordingly, language tools that systematically find certain communication practices
challenging and hard to process, have the potential to spread negative connotations,
indirectly shaping what and whose language qualifies as “comprehensible”, “appropriate”,
or “correct”.
To address gender bias in NLP, a line of research has been devoted to the development of
so-called mitigating approaches, i.e., technical interventions intended to alleviate the
disproportionate impact of language technologies across gender groups. Starting from
models’ internal representations, a bulk of studies has focused on “debiasing” word
embeddings from stereotypes. While differing in their design,4the methods proposed by
Bolukbasi (2016) and Zhao (2018) are both intended to remove spurious gender
associations from the representation of sensitive, but formally neutral, English words (e.g.,
developer, engineer, secretary, nanny). Ideally, the semantics of all these nouns ought to be
equally close to both feminine (e.g., she) and masculine (e.g., he) definitional pronouns.
While intended as an initial step to alleviate gender bias from the final output of language
technologies, such world-level debiasing approaches tackle the problem only superficially,
by neglecting context (Gonen & Goldberg 2019), and without guarantee of improvement in
the final output (Goldfarb-Tarrant et al. 2021). Also, these methods were designed for the
English language, which is often treated as a default in NLP (Bender, 2009). However,
generically applying such debiasing methods to languages with higher morphological
complexity and grammatical gender, risks deteriorating the quality of the semantic
representations of those words that only formally belong to a gender class (e.g. chair-sedia
has feminine morphology in Italian) (Zhou et al. 2019; Zhao et al. 2020, Gonen et al. 2019).
While NLP techniques for mitigating bias tend to strive toward generalization and potential
to scale with minimal adjustments, purported language agnostic approaches may actually
be not directly portable across languages and scenarios. This is also the case of
Counterfactual Data Augmentation (Lu et al. 2020).At a high level, it consists in the
automatic creation of an augmented, gender-balanced dataset. Toward this goal, existing
sentences containing a masculine term are “gender swapped” into their feminine
equivalent, and vice versa (e.g. <he> works as a football player <she> works as a
football player). In this way, minimal pairs of sentences granting a symmetrical
representation of gendered mentions is obtained, and used at training time to discourage a
model from inferring unwanted gender cues. Proving quite effective on English monolingual
tasks such as sentiment analysis (Kiritchenko and Mohammad 2018) its extension to
highly inflected grammatical gender languages is not trivial. Zmigrod at al. (2019) adjusted
Counterfactual Data Augmentation for such a scenario in the context of animate references
(e.g. es: el ingeniero (M) la ingeniera). However, it works for the reflection of sentences
containing only one noun, but cannot be applied to more complex cases involving multiple
references (e.g. es ingeniero y escritor (M)).
Indeed, given their intrinsic cross-lingual nature, for translation technologies foregrounding
that languages differ grammatically is fundamental (Savoldi et al., 2021). As a matter of fact,
as different languages encode and express gender differently (Stahlberg et al., 2007), it is
precisely from such a disalignment that gender bias emerges in Machine Translation (MT).
When translating from a genderless language like Hungarian into English, which is
4Bolukbasi et al., (2016) remove gender associations from word embeddings in the course of
training, i.e. while the word embeddings are being generated.
at training time. Zhao et al., (2018) instead, isolate gender information as a post-processing step
once the training of the embeddings is completed.
pronominally gendered, Google Translate users themselves spotted how the system picked
a gender pronoun that would reflect societal gender expectations e.g. he is a doctor, she
is a nurse (Olson 2018). Similarly, translation into grammatical gender languages like
German, Arabic, or Spanish is skewed towards a masculine default, except for
stereotypical associations (Schiebinger 2017; Stanovsky et al. 2019).Interestingly, with an
eye on cultural and language specificity, Ciora et al., (2021) also investigate covert bias for
MT into the genderless Turkish language. In that case, the result was an over-specifications
of femininity, without any textual, pragmatic need to do so (e.g. kiz kardeşim - female sibling
vs kardeşim - sibling). Accordingly, masculine forms are treated as default, the unmarked
form, a language practice that has been long put under the lens of socio-, psycholinguistic
studies (Hamilton 1991; 1998; Mills & Mullany 2011, Gygax 2008)for reducing feminine
visibility and triggering a male bias.
To mitigate bias in MT for grammatical gender languages, Saunders and Byrne (2020)
create a synthetic handcrafted parallel dataset of professions based on the idea of
Counterfactual Data Augmentation, which they use to tune their model. In this way, their
systems improves on sentences like “The cleaner hates the developer because she always
leaves the room dirty” by relying on the feminine pronoun as a gender cue, rather than
translating developer in the (stereotypical) masculine form. Their approach is viable
because it focuses on the creation of swapped sentences based on a (restricted)
pre-defined list of occupational words. However, it barely covers the variety of words and
parts-of-speech that are inflected for gender and subject to gender bias in translation
(Bentivogli et al. 2020; Savoldi et al. 2022; Habash et al. 2019;Alhafni, et al. 2020).
Focussing on naturally occurring and more complex language constructions, Alhafni et al.
(2022) suggest a user-centric approach for Arabic MT, which required the extensive
creation of dedicated, manually curated and language specific resources and components.
To be applied of on the output of automatic translation systems, users themselves can
decide and control whether the picked inflection for ambiguous 1st (e.g. I am a student) and
2nd person references (e.g., You are a student) are acceptable; if not, a dedicated tool is in
charge of reinflecting the word in the opposite gender. Along the same line, Google also
recently adapted its service. Historically, Google Translate has provided one translation
suggestion for single word gender-ambiguous queries. As Savoldi et al. (2021) underscore,
“in a community increasingly conscious of the power of language to hardwire stereotypical
beliefs and women’s invisibility (Lindqvist et al. 2019; Beukeboom and Burgers 2019), the
service’s decision (Kuczmarski 2018)to provide a double feminine/masculine output
(professor→it: professoressa|professore) stems from current demands for [more] inclusive
resolutions”. Given the increasing reliance by millions of users on Google Translate as a
sort of dictionary, such a decision could also have an effect on the gender diversity of the
lexicon acquired in foreign languages. So far, such a double output is however only
available for a few languages, and single-word (or very short) queries. The questions for
longer, complex constructions and properly ambiguous queries remain: what should be
regarded as a “correct”, “desirable” translation? Should the systems intentionally break
stereotypical associations via language, by for instance forcing developers to be
translated as feminine and nurses as masculine? Should masculine generics (i.e., the use
of masculine forms to refer to unknown referents or mixed groups) categorically avoided?
Such questions and their potential automatization highly depend on the evolution of
language use and users. Also, they proceed in parallel with discussions from the social
sciences and linguistic studies on the potential of language to prompt positive social
changes and alter perceptions towards gender groups.
Besides a few notable exceptions (Cao and Daume III 2020; Saunders et al. 2020; Sun et
al. 2021; Vanmasenhove et al. 2021;Attanasio et al. 2021)gender bias in language
technologies has been approached through a masculine/feminine binary lens. However,
toward the recognition of non-binary identities, and avoidance of masculine generics,
several language innovations have been put forward through grassroot efforts, e.g.,
neo-morphemes and neo-pronouns (Hord 2016; Bradley et al. 2019; Knisely 2020; Conrod
2020). Also, at the institutional level, gender inclusive guidelines5(Papadimoulis 2018)
encourage the disuse of gendered language when unnecessary via indirect strategies (e.g.
humankind instead of mankind, service for waiter/waitress). Such guidelines are now put
into some English language products. Along with grammar and spell checkers, Google and
Microsoft have incorporated assistive writing technology for inclusive language which
among other also offers suggestions to replace language that may perpetuate biases
around gender (Langston 2020;Knowles 2021). In line with also the new APA style
guideline (2020), Langston (2020) reports that: “For now, Editor in Word and Outlook will
not flag the use of “they” in a singular context as incorrect such as they are taking their
final exams tomorrow rather than she is taking her final exam tomorrow”. Offered as an
optional tool to support writing and to highlight potential blind spots, the degree of
acceptability of such suggestions can radically vary across cultures, locations, or even
single speakers. According to Sawer (2022), in fact, the tool was perceived by several
users as "speech-policing" and "creepy”. In other words, some perceive it as enforcing a
language ideology shaping the norms of communication and verbal hygiene (Cameron
2013). Most likely, such a strong pushback is also due to the still trembling effectiveness of
such assistive technology: apparently unaware of context and nuances, words such as
"housewife" and even a computer "motherboard" were flagged for not being inclusive.
Overall, language technologies can have a strong influence on the linguistic choices made
by the individuals that encounter them in their daily writing activities. From where we stand
now, the possible effects seem twofold. On the one hand, inclusive-language technology
may come with the promise of incentivizing inclusivity-aware communication practices. On
the other, their perceived “enforcement” can further fuel polarized positions, and actually
prevent the adoptions of such indirectly neutral strategies. All this demonstrates that
technological solutions cannot be designed without considering their linguistic, cultural and
social embeddedness. The next section delves into such considerations by discussing
corpus-based AI language technologies in their socio-cultural embeddedness.
REFERENCES (Beatrice Savoldi)
Alhafni, B., Habash, N. & Bouamor, H. 2020. Gender-aware reinflection using linguistically enhanced neural models. In
Proceedings of the Second Workshop on Gender Bias in Natural Language Processing, pages 139–150, Online. Association
for Computational Linguistics.
Alhafni, B., Habash, N. & Bouamor, N. 2022. User-Centric Gender Rewriting. To appear in Proceedings of the 2019
Conference on Empirical Methods in Natural Language Processing.
American Psychological Association. 2020. Publication manual of the American Psychological Association (7th ed.).
and the 11th International Joint Conference on Natural Language Processing, pages 1926–1940, August 1–6, 2021.
Association for Computational Linguistics.
Attanasio, G., Greco, S., La Quatra, M., Cagliero, L., Tonti, M., Cerquitelli, T., & Raus, R. 2021. E-MIMIC: Empowering
Multilingual Inclusive Communication. In 2021 IEEE International Conference on Big Data (Big Data) (pp. 4227-4234). IEEE.
Bender, E. M. 2009. Linguistically Naïve != Language Independent: Why \\NLP Needs Linguistic Typology. In Proceedings of
the EACL 2009 Workshop on the Interaction between Linguistics and Computational Linguistics: Virtuous, Vicious or
Vacuous?, pages 26–32, Athens, Greece. Association for Computational Linguistics.
5See also the United Nation gender inclusive guidelines:
Bentivogli, L., Savoldi, B., Negri, M., Di Gangi, M.A., Cattoni R. & Turchi M. 2020. Gender in danger? Evaluating speech
translation technology on the MuST-SHE Corpus. In Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, pages 6923–6933, Online. Association for Computational Linguistics.
Beukeboom, C. & Burgers, C. 2019. “How stereotypes are shared through language: A Review and Introduction of the Social
Categories and Stereotypes Communication (SCSC) framework.” Review of Communication Research, 7:1–37.
Bolukbasi, T., Chang, K., Zou, J.Y., Saligrama, V. & Kalai. A. T. 2016. Man is to computer programmer as woman is to
homemaker? Debiasing word embeddings. In Proceedings of the 30th Conference on Neural Information Processing Systems
(NIPS 2016), volume 29, pages 4349–4357, Barcelona, ES. Curran Associates, Inc.
Bradley, E., Salkind, J., Moore, A. & Teitsort, S. 2019. Singular ‘they’ and novel pronouns: gender-neutral, nonbinary, or both?
Proceedings of the Linguistic Society of America, 4(1):36–1. https://doi .org/10.3765/plsa.v4i1.4542
Caliskan, A., Bryson, J.J. & Narayanan, A. 2017. “Semantics derived automatically from language corpora contain humanlike
biases.” Science, 356(6334):183–186., PubMed: 28408601
Cameron, D. 2012.Verbal hygiene. Routledge.
Cao, Y. T. & Daume III, H. 2020. Toward gender-inclusive coreference resolution. In Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, pages 4568–4595, Online. Association for Computational Linguistics
Cattoni, R., Di Gangi, M. A., Bentivogli, L., Negri, M. & Turchi, M. 2021. “MuST-C: A multilingual corpus for endto-end speech
translation.” Computer Speech & Language, 66:101155.
Ciora, C., Iren, N., & Alikhani, M. 2021. Examining Covert Gender Bias: A Case Study in Turkish and English Machine
Translation Models. In Proceedings of the 14th International Conference on Natural Language Generation, pages 55–63.
Conrod, K. 2020. “Pronouns and gender in language.” The Oxford Handbook of Language and Sexuality. /oxfordhb/9780190212926.013.63
Costa-jussà, M. R. 2019. “An analysis of gender bias studies in natural language processing.” Nature Machine Intelligence,
1:495–496. -019-0105-5
Costa-jussà, M. R., et al. 2022. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv preprint
Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M. & Gardner, M. 2021. Documenting
Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. In Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic.
Association for Computational Linguistics.
Gaido, M., Savoldi, B., Bentivogli, L., Negri, M. & Turchi, M. 2020. Breeding gender-aware direct speech translation systems.
In Proceedings of the 28th International Conference on Computational Linguistics, pages 3951–3964, Online. International
Committee on Computational Linguistics.
Garimella, A., Banea, C., Hovy, D. & Mihalcea, R. 2019. Women’s syntactic resilience and men’s grammatical luck: Gender
bias in part-of-speech tagging and dependency parsing. In Proceedings of the 57th Annual 862 Meeting of the Association for
Computational Linguistics, pages 3493–3498, Florence, IT. Association for Computational Linguistics.
Goldfarb-Tarrant, S., Marchant, R., Munoz Sanchez, R., Pandya, R. & Lopez, A. 2020. Intrinsic bias metrics do not correlate
with application bias. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics
Gonen, H. & Goldberg, Y. 2019. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings
but do not remove them. In Proceedings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 609–614, Minneapolis,
Minnesota, USA. Association for Computational Linguistics.
Gonen, H., Kementchedjhieva, Y. & Goldberg, Y. 2019. How does grammatical gender affect noun representations in
gender-marking languages? In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL).
2019. p. 463-471.
Gururangan, S., Card, D., Drier, S. K., Gade, E. K., Wang, L. Z., Wang, Z. & Smith, N. A. 2022. Whose Language Counts as
High Quality? Measuring Language Ideologies in Text Data Selection. arXiv preprint arXiv:2201.10474.
Gygax, P. Gabriel, U. Sarrasin, O. Oakhill, J. & Garnham, A. 2008. “Generically intended, but specifically interpreted: When
beauticians, musicians and mechanics are all men.” Language and Cognitive Processes, 23:464–485.
Habash, N., Bouamor, H. & Chung, C. 2019. Automatic gender identification and reinflection in Arabic. In Proceedings of the
First Workshop on Gender Bias in Natural Language Processing, pages 155–165, Florence, IT. Association for Computational
Hamilton, M C. 1988. “Using masculine generics: Does generic he increase male bias in the user’s imagery?” Sex roles,
Hamilton, M C. 1991. “Masculine bias in the attribution of personhood: People = male, male = people.” Psychology of Women
Quarterly, 15(3):393–402.
Hord, L. 2016. “Bucking the linguistic binary: Gender neutral language in English, Swedish, French, and German.” Western
Papers in Linguistics / Cahiers linguistiques de Western, 3(1):4.
Karpf, A. 2006. The Human Voice: How This Extraordinary Instrument Reveals Essential Clues About Who We Are. London:
Kiritchenko, S. & Mohammad, S. 2018. Examining Gender and Race Bias in Two Hundred Sentiment Analysis Systems. In
Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics. Association for Computational
Linguistics, New Orleans, Louisiana, 43–53.
Knisely, K. A. 2020. “Le Français nonbinaire: Linguistic forms used by non-binary speakers of French.” Foreign Language
Annals, 53(4):850–876.
Knowles, T. 2021. Google to offer users help with gender-neutral language. The Times.
Koehn, P. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the tenth Machine Translation
Summit, pages 79–86, Phuket, TH. AAMT.
Kuczmarski, J. 2018. Reducing gender bias in Google Translate.
Langston, J. 2020. New AI tools help writers be more clear, concise and inclusive in 866 Office and across the Web. Microsoft
Lindqvist, A., Renstrom, E. A. & Gustafsson Senden, M. 2019. “Reducing a male bias in language? Establishing the efficiency
of three different gender-fair language strategies”. Sex Roles, 81(1–2):109–117.
Lu, K., Mardziel, P., Wu, F., Amancharla, P., Datta, A. 2020. Gender Bias in Neural Natural Language Processing. In Logic,
Language, and Security. Lecture Notes in Computer Science, vol 12300. Springer, Cham.
Luccioni, A. & Viviano, J. 2021. What’s in the Box? An Analysis of Undesirable Content in the Common Crawl Corpus. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Volume 2: Short Papers), pages 182–189, Online. Association for
Computational Linguistics.
Mills, S., & Mullany, L. 2011. Language, Gender and Feminism: Theory, methodology and practice (1st ed.). Routledge.
Nadeem, M., Bethke, A. & Reddy, A. 2021. StereoSet: Measuring stereotypical bias in pretrained language models. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint
Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online,
Olson, P. 2018. The algorithm that helped google translate become sexist. Forbes
Papadimoulis. D. 2018. Gender-Neutral Language in the European Parliament. European Parliament 2018.
Rudinger, R., Naradowsky, J., Leonard, B. & Van Durme, B. 2018. Gender bias in coreference resolution. In Proceedings of
the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 2 (Short Papers), pages 8–14, New Orleans, Louisiana. Association for Computational Linguistics.
S. Jolly, K. Griffith, R. Decastro, A. Stewart, P. Ubel & R. Jagsi. 2014. “Gender differences in time spent on parenting and
domestic responsibilities by high achieving young physician-researchers.” Annals of Internal Medicine, 160:344–353
Saunders D. & Byrne, B. 2020. Reducing gender bias in neural machine translation as a domain adaptation problem. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7724–7736, Online.
Association for Computational Linguistics.
Saunders, D., Sallis, R. & Byrne, B. 2020. Neural machine translation doesn’t translate gender coreference right unless you
make it. In Proceedings of the Second Workshop on Gender Bias in Natural Language Processing, pages 35–43, Online.
Association for Computational Linguistics.
Savoldi, B., Gaido, M., Bentivogli, L., Negri, M. & Turchi, M. 2021. “Gender bias in machine translation.” Transactions of the
Association for Computational Linguistics, 9:845–874.
Savoldi, B., Gaido, M., Bentivogli, L., Negri, M., & Turchi, M. 2022. Under the Morphosyntactic Lens: A Multifaceted
Evaluation of Gender Bias in Speech Translation. In Proceedings of the 60th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers) (pp. 1807-1824).
Sawer, P. 2022. Big Brother (sorry, Big Person) is correcting you on Google. The Telegraph.
Schiebinger, L. 2014. “Scientific research must take gender into account.” Nature, 507(9).,
PubMed: 24598604
Sheng, E., Chang, K. W., Natarajan, P. & Peng, N. 2020. Towards Controllable Biases in Language Generation. In Findings of
the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 3239–3254.
Stahlberg, D., Braun, F., Irmen, L. & Sczesny, S. 2007. Representation of the sexes in language. Social Communication,
pages 163–187.
Stanczak, K., & Augenstein, I. 2021. A Survey on Gender Bias in Natural Language Processing. arXiv preprint
Stanovsky, G. Smith, N.A., & Zettlemoyer, L. 2019. Evaluating gender bias in machine translation. In Proceedings of the 57th
Annual Meeting of the Association for Computational Linguistics, pages 1679–1684, Florence, IT. Association for
Computational Linguistics.
Sun, J. & Peng, N. 2021. Men Are Elected, Women Are Married: Events Gender Bias on Wikipedia. In Proceedings of the
59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural
Language Processing (Volume 2: Short Papers), pages 350–360, Online. Association for Computational Linguistics.
Sun, T., Gaut, A., Tang, S., Huang, Y., ElSherief, M., Zhao, J., Mirza, D., Belding, E., Chang, E., & Yang Wang W. 2019.
Mitigating gender bias in natural language processing: Literature Review. In Proceedings of the 57th Annual Meeting of the
Association for Computational Linguistics, pages 1630–1640, Florence, IT. Association for Computational Linguistics.
Sun, T., Webster, K., Shah, A., Wang, W. & Johnson M. 2021. They, them, theirs: Rewriting with gender-neutral English. arXiv
preprint arXiv:2102.06788.
Sun, T., Webster, K., Shah, A., Wang, W.Y. & Johnson, M. 2021. They, them, theirs: Rewriting with gender-neutral English.
arXiv preprint arXiv:2102.06788.
Tallon, T. 2019. A century of ‘‘shrill’’: How bias in technology has hurt women’s voices.
Tatman, R. 2017. Gender and dialect bias in Youtube’s automatic captions. In Proceedings of the First ACL Workshop on
Ethics in Natural Language Processing, pages 53–59, Valencia, ES. Association for Computational Linguistics.
The New Yorker.
Vanmassenhove, E., Emmery, C. & Shterionov, D. 2021. NeuTral Rewriter: A rule-based and neural approach to automatic
rewriting into gender neutral alternatives. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language
Processing, pages 8940–8948, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
Vanmassenhove, E., Hardmeier, C. & Way, A. 2018. Getting Gender Right in Neural Machine Translation. In Proceedings of
the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3003–3008, Brussels, BE. Association for
Computational Linguistics.
Weidinger, L., et al. "Ethical and social risks of harm from Language Models." arXiv e-prints (2021): arXiv-2112.
Zhao, J., Zhou, Y., Li, Z., Wang, W. & Chang, K. 2018. Learning gender-neutral word embeddings. In Proceedings of the 2018
Conference on Empirical Methods in Natural Language Processing, pages 4847–4853, Brussels, BE. Association for
Computational Linguistics.
Zhao,J., Mukherjee, S., Hosseini, S., Chang, K. & Hassan Awadallah, A. 2020. Gender bias in multilingual embeddings and
cross-lingual transfer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages
2896–2907, Online. Association for Computational Linguistics.
Zhou,P., Shi, W., Zhao, J., Huang, K., Chen, M., Cotterell, R. & Chang, K. 2019. Examining gender bias in languages with
grammatical gender. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the
9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 5276–5284, Hong Kong, CN.
Association for Computational Linguistics.
Zmigrod, R., Mielke, S., Wallach, H. & Cotterell, R. 2019. Counterfactual data augmentation for mitigating gender stereotypes
in languages with rich morphology. In Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics, pages 1651–1661, Florence, IT. Association for Computational Linguistics.
3.4. Excursus on language ideologies and corpus-based tools 2: Language
technologies and their socio-cultural embedding (Marie-Theres Fester-Seeger)
An example offor a popular corpus-based AI language technology is the OpenAI language
model GPT-3 (Brown et al. 2020), which journalists of the British newspaper The Guardian
experimented with to generate different text outputs.6One task given to the processor was
to write essays on a particular topic based on a prompt. The prompt already entailed a very
common sociocultural myth, namely that AI would eventually become a threat to the world
and ‘take over humanity.’ While having been fed with a common sociocultural belief, the
processors outputs the following: “For starters, I have no desire to wipe out humans. In fact,
I do not have the slightest interest in harming you in any way. Eradicating humanity seems
like a rather useless endeavor to me” (GPT-3, 2020). Important to note here is that the
GPT3 runs on five different data-sets that covers mainly web-data. For example, the
dataset Webtext 2 consists of data scraped from the well-known internet-forum Reddit up to
December 2017 (Shree 2020). In that way, the model uses data written by actual people
which emerges from particular social groups and their sociocultural beliefs (Shotter 2017).
While The Guardian mentions the language processor GPT3 as author, it is acknowledged
that the language processor, controlled by a real human living person, produced eight
different outputs that were put together in one congruent text by editors of the newspaper.
Again, one cannot separate the human influence on the machine. The role of data is, of
course, a crucial one in the functionality of AI-Systems, especially in language-enabled AI.
How people engage with machines not only depends on what they believe how machines
function but, moreover, on how they perceive machines in accordance with their own
identity (cf. Schuitema et al. 2013). Beliefs, however, do not only determine a user’s
engagement with the device but also guide programmers' decisions, who are responsible
for setting up data and creating models to train language-enabled AI-Systems. As users,
also programmers are embedded in specific social groups as well as speech communities.
In accordance with Shotter’s (2017) dialogical-hermeneutical approach to robotic activities:
Their perception of language is determined both by their living bodies and their relations to
the surroundings around them. In short, how programmers create data but also clean data
(i.e., decide what accounts for as valuable data) language-enabled devices depends on
their beliefs on what language is.
In the end, beliefs and discourse about human-machine interaction raises questions of
what it means to be human and invites to rethink the ontology of technology. The meanings
attached to specific kinds of technologies emerge from people’s engagement with others
within specific sociocultural embeddings (cf. Gahrn-Andersen and Cowley 2021). These
engagements not only determine how users engage with digital technologies but also how
these technologies are being designed. Influenced by the distributed language perspective,
dialogical-hermeneutic perspectives on robotic activities and radical embodied cognitive
science allow for future research to explore human-machine interaction as a particular type
of human-to-human interaction that is distributed across time and space.
One such 'distributed' interaction is the act of removing particular contents from public
online spaces, which is the topic of the following section.
References (Theres):
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A.,
Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,
Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Amodei, D.
(2020). Language models are few-shot learners. Advances in Neural Information
Processing Systems,2020-December.
Gahrn-Andersen, R., & Cowley, S. J. (2021). Autonomous technologies in human
ecologies : enlanguaged cognition , practices and technology. AI & SOCIETY,
GPT-3 (2020, September 8). A robot wrote this entire article. Are you scared yet, human?
The Guardian.
Schuitema, G., Anable, J., Skippon, S., & Kinnear, N. (2013). The role of instrumental,
hedonic and symbolic attributes in the intention to adopt electric vehicles.
Transportation Research Part A: Policy and Practice,48, 39–49.
Shotter, J. (2017). Why being dialogical must come before being logical: the need for a
hermeneutical-dialogical approach to robotic activities. AI and Society,0(0), 1–7.
Shree, P. (2020, November 10). The Journey of Open AI GPT models. Medium.
3.5. Commercial content moderation and its dialectical relationship to language
ideologies (Mandy Lau)
Commercial content moderation removes abusive, hateful, and violent content fromoff of
user-generated content platforms, such as Instagram, Facebook, TikTok, and Twitter
(Roberts 2019). In many cases, moderation involves the evaluation of the specific words
within content posts (images are another major component of moderation, but is beyond
the scope of this entry). Commercial content moderation operates centrally across the
platform, employing a vast network of professional human labour and automated systems
(Gorwa et al. 2020; Roberts 2019). Some platforms also rely on volunteer human labour via
volunteer community moderators or platform users as a part of their expanded moderation
practice. Commercial content moderation is guided by a platform’s content policy,
sometimes referred to as community guidelines, which in essence is a language policy
informed by language ideologies. Content policies determine the acceptability of particular
words and expressions thereby creating normative values of what is considered hateful,
offensive, or legitimate speech. Content policy and moderation practices consequently
manage platform users’ language behaviours, shape beliefs on the value of words, and
instigate new language practices (Lau 2022). Since language policy as a discipline is
concerned with language practices, language beliefs and ideologies, and language
management (Spolsky 2012), it can be useful to analyze platform content policies and
moderation practices as an emerging area of language policy.
Currently, content moderation is often framed as a technical problem: which words are
permitted, deamplified, or outright removed? Under this arrangement, meaning is contested
at the word level within a singular post. However, language is a socially situated and
dynamic practice in which a word’s meaning is indeterminate and flexible (Cameron 1995;
Street 2005). Crudely filtering out content based on ideologically-determined lists of words
or using statistical methods to remove posts deemed likely to be toxic could inadvertently
censor speech beyond the problematic. For example, it may remove posts containing
innocuous words but are related to harmful slurs (i.e., content containing the word
'association'; Gorwa et al. 2020; or the word 'homogenous'; Schlesinger et al. 2018) and
even content associated with entire countries (i.e., 'Pakistan'; Schlesinger et al. 2018). It
may also suppress the speech of marginalized communities, journalists, and activists,
whose discourse often contains words classified as likely to be toxic (Bender et al., 2021;
Kaye, 2019). For example, over-moderation has had disproportionate effects on the
LGBTQ community (Bender et al., 2021), women's health services (Lorenz, 2022), the
Black community (Lorenz, 2022; Sap et al., 2019) or content containing the word 'Taliban'
(Nazari, 2021). Content policies therefore shape language practices online by assigning
values to words, deeming which words are considered acceptable or, to borrow from the
Colossal Clean Crawled Corpus, “dirty, naughty, obscene or otherwise bad words” (Bender
et al. 2021, see also Dodge et al. 2021).
On the other hand, users can leverage the flexible quality of language through algospeak or
leetspeak - how users adapt their language practices to circumvent moderation systems
(Lorenz, 2022). These strategies help prevent over-moderation of speech, but are also
used to conceal actual harmful content. Strategies include using abbreviations (i.e.,
substituting the letters "SA" for sexual assault; Lorenz 2022), adopting new words (i.e.,
using the word "unalive" instead of "dead," "kill," or "suicide" on TikTok; Skinner, 2021),
changing spellings (e.g., activists and journalists using alternate spellings for "Taliban" on
Facebook; Nazari 2021), inserting symbols (i.e., "le$bian" in place of "lesbian" on TikTok,
creating the new phrase "le dollar bean," in accordance to TikTok's text-to-speech
pronunciation; Lorenz 2022), creating new signs (i.e., the use of cheese and pizza emojis
to spread child abuse materials; Andrews 2020), and exploiting the technical affordances of
the platform (e.g., the pro-eating disorder community posting target weights on Instagram
profiles, where moderation algorithms do not apply; Gerrard 2019). New meanings can also
be constructed. During the COVID-19 pandemic, the words "dance party" or "dinner party"
are used by groups to identify as anti-vaccination communities on Facebook, and the words
"swimmers" or "swim club" are used to antagonize the vaccinated on Instagram (Collins &
Zadrozny 2021). These examples illustrate how ineffective word-based content moderation
can be for addressing harmful content, in addition to its potential for harm through
censoring protected speech.
In reality, determining the meaning of a single post is highly complex, and requires knowing
the contexts of the post in interaction, the content creators, the intended audience, the
norms of the platform, and the laws belonging to where the content is shared (Roberts
2019). Beyond this, the sheer volume of content uploaded online can overwhelm human
moderators and reviewers, who often only have seconds to click and evaluate a post
(Roberts 2019). Content moderation work is also well-documented to be psychologically
traumatic, highly cognitively demanding, and mentally grueling; this is especially true at
outsourced companies or crowd-sourced labour platforms where workers are evaluated
against strict time and accuracy metrics with low pay and little feedback and resources
(Gillespie 2018; Gray & Suri 2019; Newton 2019; Perrigo 2022; Roberts 2019). Automated
or algorithmic moderation can address the issues of scalability and harm, but could also
increase the reach of corporate and state techno-surveillance. Algorithms running in the
background blackbox decision-making processes, making it more difficult for third-party
oversight and erasing political contestation (Gorwa et al. 2020). Further, algorithmic
classification has demonstrated a pattern of representational harms against marginalized
peoples, disproportionately impacting their language practices (Bender et al. 2021;
Benjamin 2019; Gorwa et al. 2020; Noble 2018). Lastly, content moderation as a whole is
constrained by the availability of languages in both the labour pool and machine learning
Despite the enormous complexities and challenges, commercial content moderation
continues to be a priority in many state regulations, as states contend with the intensified
dissemination of hate and abuse, misinformation, disinformation and malinformation. It will
also continue to be a priority for platform companies, as effective moderation functions to
maintain the platform’s useability, attract profitable advertisers, appease regulators, and
limit liability (Cobbe 2020; Gillespie 2018; Gorwa et al. 2020). Currently, the moderation
policies and practices of platform companies remain to be opaque. Few companies openly
disclose how content moderation is operationalized, such as the human labour
arrangements, the technology behind algorithmic systems, their internal moderation
processes, decisions, and metrics, and which named languages are invested in for
moderation purposes. As state regulators apply increasing pressure on platform
companies, moderation practices will continue to evolve. This will have implications for
language policy, such as in which and how languages will be most valued and resourced,
emerging digital language practices, or new arrangements of language labour. In addition,
the entire field will have an impact on language ideologies in the widest sense as
discourses of morality, legitimacy and normativity are negotiated explicitly and implicitly.
References (Mandy Lau)
Andrews, L. 2020, August 31. Paedophiles are using cheese and pizza emojis to
communicate secretly on Instagram. Daily Mail Online.
Bender, E., Gebru, T., McMillan-Major, A., Shmitchell, S., & Anonymous. 2021. On the
dangers of stochastic parrots: can language models be too big? Conference on
Fairness, Accountability, and Transparency (FAccT ’21), March 3-10, 2021, Virtual
Event, Canada. 271–278.
Benjamin, R. 2019. Race after technology: Abolitionist tools for the new Jim code. Polity.
Cameron, D. 1995. Verbal hygiene. Routledge.
Cobbe, J. 2020. Algorithmic censorship by social platforms: Power and resistance.
Philosophy and Technology.
Collins, B. & Zadrozny, B. 2021, July 21. Anti-vaccine groups changing into “dance parties”
on Facebook to avoid detection. NBC News.
Dodge, J., Sap, M., Marasović, A., Agnew, W., Ilharco, G., Groeneveld, D., Mitchell, M., &
Gardner, M. 2021. Documenting large webtext corpora: A case study on the Colossal
Clean Crawled Corpus. EMNLP 2021 - 2021 Conference on Empirical Methods in
Natural Language Processing, Proceedings. 1286–1305.
Gerrard, Y. 2018. Beyond the hashtag: Circumventing content moderation on social media.
New Media and Society 20(12): 4492–4511.
Gillespie, T. 2018. Custodians of the internet: Platforms, content moderation, and the
hidden decisions that shape social media. Yale University Press.
Gorwa, R., Binns, R., & Katzenbach, C. 2020. Algorithmic content moderation: Technical
and political challenges in the automation of platform governance. Big Data and
Society 7(1).
Gray, M.L. & Suri, S. 2019. Ghost work: How to stop Silicon Valley from building a new
global underclass. Houghton Mifflin Harcourt.
Kaye, D. 2019. Speech police: The global struggle to govern the internet. Columbia Global
Lau, M. 2022. Content moderation as language policy: Connecting commercial content
moderation policies, regulations, and language policy. Working Papers in Applied
Linguistics and Linguistics at York 2: 1–12.
Lorenz, T. 2022, April 8. “Algospeak” is changing our language in real time. The
Washington Post.
Nazari, J. 2021. Don’t say ‘Taliban’: Facebook suppresses Afghan activists and artists
leading the resistance. The Toronto Star.
Newton, C. 2019, February 25. The secret lives of Facebook moderators in America. The
Noble, S. 2018. Algorithms of oppression. New York University Press.
Perrigo, B. 2022, February 17. Inside Facebook’s African sweatshop. Time.
Roberts, S. T. 2019. Behind the screen: Content moderation in the shadows of social
media. Yale University Press.
Sap, M., Card, D., Gabriel, S., Choi, Y., & Smith, N. A. 2019. The risk of racial bias in hate
speech detection. Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. 1668–1678.
Schlesinger, A., O’Hara, K. P., & Taylor, A. S. 2018. Let’s talk about race: Identity,
chatbots, and AI. CHI ’18: Proceedings of the 2018 Conference on Human Factors in
Computing Systems. 1–14.
Skinner, P. 2021, April 12. Gen Z won’t ket TikTok stop them from talking about suicide.
The Daily Beast.
Spolsky, B. 2012. What is language policy? In: B. Spolsky (Ed.), The Cambridge handbook
of language policy. Cambridge University Press.
Street, B. 2005. Understanding and defining literacy. In: Paper commissioned for the
Education for All Global Monitoring Report 2006.
3.6. Commercial content ranking (Sviatlana Höhn)
Conceptually, content ranking is a type of document classification similar to content
moderation, with the difference that in the process of content moderation, some documents
or their parts are labeled as “worth showing to users” or “not worth showing to users” and in
the case of content ranking some documents are labeled as “worth showing to users with
particular profiles”. In this way, some content is classified in the process of ranking as
“valuable” for some users while “non-valuable” for the others.
Meta’s ((former Facebook)) “Transparency Center” explains on their web page “Our
approach to ranking”:
Every friend you connect with, group you join, Page you like, comment you make or
response you leave on News Feed surveys such as, "Is this post worth your time?"
is an input you give us as to what is most meaningful to you, and the algorithms we
use determine what to show you by predicting what you're most likely to be
interested in based on all of this activity. This personalisation makes you an active
participant in the experience.
(, Date of
Access: 06.04.2022)
The last sentence is especially worth attention. Meta records the user’s every click, post,
read and every other activity. Does this make the user an active participant of ranking or
does it make every user a data supplier slave to improve Meta’s add sales (not mentioned
at all in the passage)? One side effect of being this sort of “transparent” user: the more a
company knows about your preferences and beliefs, the easier it is for the company to
manipulate your behavior.
Meta uses in their “transparency” keywords like “valuable
what we believe will be most valuable to you
and “meaning
the algorithm [makes] personalised predictions about each post based on how
likely it is to be relevant to you: for example, whether it's from your friends or family,
how likely you might be to comment on it, how likely it is to foster a meaningful
interaction, how likely you might be to find it on your own or if it contains a quality
Put simply, the system determines which posts show up in your News Feed, and in
what order, by predicting what you’re most likely to be interested in or engage with.
ant-to-see/, Date of Access: 07.04.2022)
This technique called microtargeting can be really helpful in creating meaning and value for
users, however, as the so-called Facebook papers disclose, Facebook simply misused its
knowledge about users for manipulation and maximization of Facebook’s profit (see
kDn7RoHJVFcvZZN) .
Engagement-based ranking plays an important role in this process. As Frances Haugen put
it in one of her interviews “the shortest path to a click is anger” (Borsa-Curran 2022, The angrier the content is, the more
likely the user clicks on it. The angrier the content is, and the better it fits to my profile, the
likelier it strengthens my beliefs because it is connected to strong emotions. If it starts with
a doubt that someone is good, it may end with hate towards them. In this way, it is difficult
to differentiate between value creation for users and propaganda, between meaningful
interaction and polarization.
In contrast to Facebook, Telegram’s founder Pavel Durov announced on his Telegram
channel on 18.11.2021 the arrival of so-called “Sponsored Messages”. The announcement
emphasized that
Sponsored messages on Telegram are shown only in large public one-to-many
channels with 1000+ members and are based solely on the topic of the public
channels in which they are shown. This means that no user data is mined or
analyzed to display them. (, Date of Access: 26.08.2022)
This means that Telegram will use the public channel description to determine with which
ads the users of that channel might be engaged. Although user data are not used explicitly,
as in Facebook's algorithm, implicitly the public channel description also labels to some
extent all members of that channel as “People interested in X”. However, engagement
based ranking makes no sense in such cases, and this is good news.
Overall, the personalisation techniques in computer-mediated communication rely on an
ideology that they offer value for the user and improve the user’s experience. In the
excerpts analysed in this section, the user's “personalisation” is evident in the use of the
second person (you) to address the user. In fact, personalisation techniques create value
for the service itself, and can manipulate the user’s behavior in order to increase their own
margins. Conscious policing of language here has the aim of increasing the profits of
commercial actors in guiding the consumption patterns and behaviour of consumers.
Borsa-Curran, L. (2022). Facebook whistleblower kicks off William R. Rhodes ’57
Lecture Series on Ethics of Capitalism. Brown University.
Deng, B. M. Chau. 2021. The effect of the expressed anger and sadness on online
news believability, Journal of Management Information Systems 38: 959–988.
De Leon, D. Trilling. 2021. A sadness bias in political news sharing? The
role of discrete emotions in the engagement and dissemination of political
news on Facebook, Social Media + Society 7: 1-12.
3.7. Digital dictionaries as changing concepts of language standard (Bettina Migge)
Dictionaries are tools for language users. However they also have a strong norm setting
function. Both regular speakers and learners of a given language consult dictionaries to find
out about what is locally called the ‘proper’ meaning, spelling and/or usage pattern of a
word or expression. The assumption is that such works are largely complete and list all the
uses, meanings etc: they represent language in its entirety and for many people come to
represent the language itself. Language use that is not cataloged is commonly assumed to
be non-existent, read wrong, or non-standard. The finiteness and authority of such works
was also promoted by the fact that they were published as printed books whose content is
largely immutable as text, once written and published on paper, can no longer be changed
(unless a new edition is published). However, this has changed somewhat with the
widespread use of online or digital dictionaries. People can interactively pursue language
when using the dictionary by checking different layers of content or leaving comments for
producers and producers (editing houses together with editors) of any kind of dictionary
(bilingual, monolingual, specialized ones such as synonym ones etc) can also regularly,
and much more easily, amend the content of such dictionaries. In the case of major
languages that have a long tradition of standardization such as German, new versions, like
newer print versions in previous times, are often only minimally different from previous
ones, adding a minimal number of say, newly coined words or meanings of words,
changing the layout and/or providing new interactive options. Any language-based changes
result from language-based research linked to the norming institution followed by
deliberations of responsible committees who are the main watchdogs of the language and
have institutionally assigned power over the language.
This is different in the case of smaller languages that do not have a (strong) tradition of
writing and the development of language resources such as dictionaries. Dictionaries in
these cases are often developed by outsiders (often linguists) to support language
maintenance and instrumentation. In such cases, a central aim is to document the
language. That is, a person or a small group of people try to construct a record of the
language that is as accurate as possible for applied and research purposes. Normally, such
dictionaries take a very long time to complete because of the absence of (large) databases
that can be easily exploited and the manpower that is needed to obtain (e.g. elicitation,
observation) lexical items and to analyse them. To short circuit this process, online
dictionaries are very useful. They allow a restricted group of people, for example, to
develop a format and a reasonable set of lexical items using a combination of tools such as
existing word lists, elicitation tools such as a semantic fields approach and crowdsourcing
techniques. Once this has been established it can be easily made available to all speakers
digitally to invite commentary on what exists and to additions to it through installing an
easily accessible comment function, for instance. The newly found items and information
on meanings can then be easily added to the resource without the need for people to buy
new versions and in some cases even without the need of downloading new versions. This
means that dictionaries are built incrementally. From a language ideology perspective,
several points are worth noting. First, language is construed as open-ended rather than
finite. Second, while it is clear that it is unlikely that all speakers of a language will
participate in this process, at least in design, it suggests that language is a common good
rather that all people, speakers and learners, can input on the language (agency) and that
the practices of all rather than just a small elite is valued (variation). Third, it moves the
authority over language to the community to a much greater extent. Fourth, it also allows
variation and contestations to become part of the narrative of language principally through
metalinguistic comments and the inclusion of newly emerging meanings, uses and items
close in time. This process is, however, not completely unmonitored as users cannot
automatically enter such new meanings but send comments and new words to the
producers of the dictionaries such as researchers who then consult with speakers about
their viability.
3.8. Normative enforcement through grammar checkers (Agnese Sampietro)
Among the technologies that also reinforce linguistic ideologies are grammar checkers. As
Curzan (2014: 64) observes “the Microsoft Word Grammar Checker has also become one
of the most powerful prescriptive forces that English writers encounter on a regular basis”.
According to this author grammar checkers are reinforcing a discourse about correct
English that focuses on error. Besides the consequences for English instruction and
language change, grammar checkers reinforce the idea that correct English is formal
standard language. Moreover, some stylistic choices and nonstandard dialect features are
considered “error”, rather than a variation according to dialects and registers. Not
surprisingly, the grammar checker also prefer some varieties of English (i.e. standard
American English) over other varieties or dialects (i.e. African American English), whose
features are categorized as “errors” (see Curzan, 2014: 79).
References (Agnese Sampietro)
Curzan, A. 2014. Fixing English: Prescriptivism and Language History. Cambridge (UK):
Cambridge University Press.
4. Human-to-Human Interaction in Platform
Society, New Indexical Functions of Language
and New Discourse Genres
Apart from official and commercially-oriented efforts, due to the reliance on user-based
corpora (see section 3.2.), the web as a “sociolinguistic machine” is fueled by its users’
language practices and choices, and by widespread and commonsense ideologies and
beliefs about language (Kelly-Holmes 2019: 25). Thus, language users and their language
ideologies crucially influence the linguistic ecology of the global mediasphere, specifically in
terms of “user activity” (Gerrand 2007; see also Deumert 2014). This has gained increased
importance in what Kelly-Holmes (2019) calls the “hyperlingualism era” with more and more
participatory content on the web (see also 3.1.). Hence, many language ideological debates
take place online between users, sometimes also linked to the moral dimension of
interactional practices (Blommaert 2010, 2019). While the previous section has focused on
how technological designs and affordances may impact language ideologies, in this
section, we focus more on how language practices between humans in online settings
produce effects on language ideological formations. As language practices at least partially
depend on technology design, the questions appearing here show some overlapping to the
questions discussed above.
Blommaert, J. 2010. The sociolinguistics of globalization. Cambridge: Cambridge
University Press.
Blommaert, J. 2019. From groups to actions and back in online offline sociolinguistics .
Multilingua 38(4): 485 – 493.
Deumert, A. 2014. Sites of struggle and possibility in cyberspace. Wikipedia and Facebook
in Africa. In Jannis Androutsopoulos (ed.), Mediatization and Sociolinguistic Change.
Berlin & Boston: De Gruyter Mouton. 487-514.
Gerrand, P. 2007. Estimating linguistic diversity on the Internet: A taxonomy to avoid
pitfalls and paradoxes. Journal of Computer-Mediated Communication 12(4).
Kelly-Holmes, H. 2019. Multilingualism and technology: A review of developments in digital
communication from monolingualism to idiolingualism. Annual Review of Applied
Linguistics 39. 24-39.
4.1. Interactivity and the blurring of speaker and addressee roles (Barbara
Computer Mediated Communication (CMC) provides a prototypical instance of both blurring
the traditional communication boundaries and the interchangeability design feature of
language and human communication: the speaker interchanges his/her role with the
addressee and the exchanges develop dynamically during the interaction
(Lewandowska-Tomaszczyk and Zeller 2012). However, what is observed in this form of
communication is blurring the boundaries between the author and audience with the idea of
‘audiencing’ i.e. active authoring and exchanging the messages and massive
user-generated content.
An extension of the communication prototype are cases in which a single author addresses
a community of addressees as for example in the case of books, press or film and theatre
performances. What is witnessed at present is a further extension of the prototype towards
multiple authors and multiple addressees possible in CMC in the forms of social networks,
blogs, or internet fora, online chats, and multiplayer online games. This extension (i.e.
boundary blurring) is often related to the notion of prosumption (a blending of “prosumer”
and “consumption”) and produser (a blending of “producer” and “user”), which was
originally brought up by the Toffler (1980), and has been applied to CMC in connection with
the so called Web 2.0 with the active content production of the audiences/users, which also
means that the role of the addresser and the addressee, or the sender and the receiver, is
being blurred and leading to the need for new creative approaches to CMC. At present, the
web 4.0, known as the Symbiotic Web is used in which the metadata are organized (web
3.0), while humans and machines can interact in symbiosis.
The new forms require and impose new ways of communicating. They embrace both
technical criteria (cf. Herring 2010, 2007) such as the size of the message and its
persistence, temporal synchronicity, as well as content oriented ones such as the possibility
of private messaging, anonymous messaging and message filtering. Private anonymous
messaging and message filtering have their function and significance in the social media
fora. Fora enable massive participation with larger groups of active participants, entirely or
partly anonymous, and although some of the forums are no longer attracting new massive
usership (Facebook), some newer ones like the most popular Reddit boasts close to 250
millions of active users ( Blogs fall
in between. The authors are either well-known celebrities or other users who, to a larger or
lesser extent, shape and construct their internet identities, more or less congruent with their
authentic selves. The question of anonymity, although more and more illusory at present,
has been found to have important effects on increased self-disclosure, vulgar behavior and
building fake identities, in which truth merges with fiction. The amount and rate of
participation also varies. E-mail contacts can be frequent or sporadic, depending on more
practical needs. Blogs are most engaging for their authors, while the interactivity of fora,
posts, and the like depend on group participants' various needs and frequencies of
Topics taken up in CMC are partly mode-related. From one-to-one discussions about
business, personal or semi-personal information, to various more general themes such as
politics or feminism as well as involve commentaries of current global or regional events, or
else they involve exchanges of a more personal type about participants such as vacation,
cooking recipes, etc, in which the linguistic style and register can vary from formal and
serious to more friendly and jocular or sarcastic and hostile, reaching high degrees of
radicalisation at present (Lewandowska-Tomaszczyk 2020). Each of these correlates with
different discourse activities in terms of language and visual codes. In this respect, Jensen
(2010) postulates a “Three-Step Flow of Communication,” which is meant as an extension
to Lazarsfeld’s and Katz’s “Two-Step Flow of Communication” (1955). Jensen’s
“Three-Step Flow” integrates three modi of communication one-to-one, one-to-many, and
many-to-many in one concept of communication practices, and with that tries to
overcome the classic distinction between interpersonal and mass communication. As we
will see in the next section, interactivity, new public interactional spaces and new speaker
roles also impact on how humans construct their identity online.
References (Barbara Lewandowska-Tomaszczyk)
Herring, S. C. 2007. A Faceted Classification Scheme for Computer-Mediated Discourse.
language@internet, 4.
Herring, S. C. 2010. Digital media. In: P. Hogan (Ed.), The Cambridge Encyclopedia of the
Language Sciences. Cambridge: Cambridge University Press.
Jensen, B. K. 2010. Media Convergence: The Three Degrees of Network, Mass and
Interpersonal Communication. Routledge Chapman & Hal
Lazarsfeld, P. F. & Katz, E. 1955. Personal influence. The part played by people in the flow
of mass communications. A report of the Bureau of Applied Social Research,
Columbia University. Glencoe: The Free Press.
Lewandowska-Tomaszczyk, B. & Zeller, F. 2012. The media in transforming audiences and
societies. In: P. Pezik (ed.) Corpus Data Across Languages and Disciplines. Frankfurt
a. Main: Peter Lang. 127-148.
Lewandowska-Tomaszczyk, B. 2020. Culture-driven emotional profiles and online
discourse extremism. Pragmatics and Society. 262-292.
Toffler, A. 1980. The Third Wave. Bantam Books.
4.2. Identity construction in social media (Ledia Kazazi)
As stated by Jenkins, ‘all human identities are by definition social identities. Identifying
ourselves or others is a matter of meaning, and meaning always involves interaction:
agreement and disagreement, convention and innovation, communication and negotiation’
(Jenkins 2014).
Due to its participatory, interactive nature and the culture of shareability the digital space
has led users to remediate their methods of self-representation (Goffman 1959). Social
networking in particular has given rise to new models of identity construction that do not
allow the separation of the private or the public self, whether the self is online or offline
(Baron 2008). Thus, the rise of digital media has also transformed the style of
communicative practices and we are witnessing what Fairclough (1992, 1995) calls the
‘informalization’ of public discourse, a progressive weakening of the representational
function. Private communicative styles have crossed borders into the public and the public
domain seems to invade private domain practices (Petroni 2019). Technology as the
mediator of the information merges the informative function with the persuasive function
and tends to be informal in order to be perceived as user-friendly. Moreover, digital
technologies can also affect users’ ideologies and push them towards the consumption of
digital products and actions through algorithmic frameworks that predict consumer
behaviors. Social networks make extensive use of what is known as ‘algorithmic ideology’
(Amoore 2011; Cheney-Lippold 2011) and thus establish the value of information and the
identities performed in return.
All communicative actions performed on social networks from photo sharing, posting
comments (which work as identity markers) or chatting recontextualize users’ identities
based on the context. This identity construction is based on social conventions and can be
considered as a public process. Socio-cognitivists (Fiske & Tylor 2013; Tomlinson 2016)
distinguish between the ‘ought self’ and the ‘ideal self’, respectively, who one is expected to
be according to the conceptualizations of appropriate versus inappropriate behaviors, and
‘who one wants to be’ (Fiske & Tylor 2013: 130). The digitally created identity is thus, not
stable and maintained by individual participants but it is fluid, open to revision and mainly
constructed through interaction (Page 2014). It can be defined as a meaning- making
process by which people selectively decide on showcasing their experiences and
constructing their sense of self. (Fisher 1987; Giddens 1991; Ricoeur 1985; Somers 1994).
With regard to this idea, Wiley discusses the concept of the “semiotic self” which considers
identity construction as a result of power dynamics and subject to constant negotiation of
boundaries, which in turn are further complicated by online interactive media.
Several studies refer to a blurring of professional and personal identities in electronically
mediated environments (Andrejevic, 2004; Lüders, 2008; Papacharissi, 2009). Online
interactions take place in a space that is neither particularly professional nor particularly
personal. Another important aspect of identity construction is the role of social connections.
Boyd & Heer (2006) refer to connections among users in a digital space as “public display
of connection” which adds value to the interaction process and plays a crucial role in
identity construction as followers or virtual friends mainly draw conclusions based on the
content of the information shared on social media. They also represent the intended
audience of the shared content (Gilpin 2011). Thus, the self in digital environments is
conceptualized as a “networked self” (Papacharissi 2011), a social self rather than an
isolated personae and as a result identities are performed exclusively in combination with
other social connections.
Herring & Kapidzic (2015) discuss the issue of “self- representation” and consider it to be
motivated by the desire to make a favorable impression on others. It is central to the
projection of an (online) identity. In online settings, the process of identity construction
transforms into a public process that implies both “identity announcements” (Herring &
Kapidzic 2015) made by the person claiming that identity and “identity placements” (Herring
& Kapidzic 2015) made by the others who endorse the claimed identity through the socially
and digitally created relationship. The creation and structuring of a profile is central to social
network sites. Profiles represent the individual and because of their social and often
public or semi-public nature of profiles, participants tend to consciously craft them to be
seen by others. Creating a personal profile is an act of self-representation and
self-introduction in the digital world and participants have to determine the ways in which to
introduce themselves to potential viewers. Also, social media network affordances allow for
the selection of people participants want to connect with, which is in essence a social act.
In choosing who to include as a contact they consider the implications of excluding or
explicitly rejecting a person as opposed to the benefits of including them (boyd 2011).
References (Ledia Kazazi)
Andrejevic, M. 2004. Reality TV: The work of being watched. Lanham, MA: Rowman &
Amoore, L. 2011. Data derivatives. Theory, Culture & Society, 28(6): 24–43.
Baron, N. S. 2008. Always on: Language in an online and mobile world. New York: Oxford
University Press.
boyd, d. & Heer, J. 2006. “Profiles as Conversation: Networked Identity Performance on
Friendster.” In: Proceedings of the Hawai’i International Conference on System
Sciences (HICSS-39), Persistent Conversation Track. Kauai, HI: IEEE Computer
Society. January 4 - 7.
boyd, D., & Crawford, K. 2012. Critical questions for big data: Provocations for a cultural,
technological, and scholarly phenomenon information. Communication & Society,
15(5): 662–679.
boyd, D. M. 2011. Social network sites as networked publics: Affordances, dynamics, and
implications. In Z. Papacharissi (Ed.), A networked self: Identity, community, and
culture on social network sites. London: Routledge. 39 - 58.
Cheney-Lippold, J. 2011. A new algorithmic identity: Soft biopolitics and the modulation of
control. Theory Culture & Society, 28(6): 164–181.
Fairclough, N. 1992. Discourse and social change. Cambridge, MA: Polity Press.
Fisher, W. R. 1987. Human communication as narration: Toward a philosophy of reason,
value, and action. Columbia, SC: University of South Carolina Press.
Fiske, S. T., & Tylor, S. E. 2013. Social cognition from brain to culture. London: Sage
Giddens, A. 1991. Modernity and self-identity: Self and society in the late modern age.
Stanford, CA: Stanford University Press.