Chapter

Evaluation of Gender Bias in Amharic Word Embedding Model

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Bias in natural language processing systems can perpetuate and exacerbate societal inequalities, reflecting and potentially amplifying existing biases in human language and culture. Amharic, as the official language of Ethiopia, holds cultural and linguistic significance, making it imperative to assess potential biases within its computational representations. This research paper investigates the presence and extent of gender bias in Amharic text corpora. The research utilizes gendered word pairs to capture gender representation in the word embeddings and quantifies the degrees of gender bias present in profession words. We found that profession words carried stereotypical implicit biases with most occupations leaning towards male. Profession words like “nurse” and “house-maid” align with societal gender dynamics, displaying significant female associations. Additionally, professions in the arts and athleticism demonstrate a robust female-leaning bias, while physically demanding and educated professional roles tend to exhibit male-leaning biases. The study contributes insights into the gender dynamics encoded within the Amharic language informing strategies to reduce bias and fostering fair and unbiased representations for improved societal and technological outcomes.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Over the past few years, word embeddings and bidirectional encoder representations from transformers (BERT) models have brought better solutions to learning text representations for natural language processing (NLP) and other tasks. Many NLP applications rely on pre-trained text representations, leading to the development of a number of neural network language models for various languages. However, this is not the case for Amharic, which is known to be a morphologically complex and under-resourced language. Usable pre-trained models for automatic Amharic text processing are not available. This paper presents an investigation on the essence of learned text representation for information retrieval and NLP tasks using word embeddings and BERT language models. We explored the most commonly used methods for word embeddings, including word2vec, GloVe, and fastText, as well as the BERT model. We investigated the performance of query expansion using word embeddings. We also analyzed the use of a pre-trained Amharic BERT model for masked language modeling, next sentence prediction, and text classification tasks. Amharic ad hoc information retrieval test collections that contain word-based, stem-based, and root-based text representations were used for evaluation. We conducted a detailed empirical analysis on the usability of word embeddings and BERT models on word-based, stem-based, and root-based corpora. Experimental results show that word-based query expansion and language modeling perform better than stem-based and root-based text representations, and fastText outperforms other word embeddings on word-based corpus.
Article
Full-text available
In this paper, we describe the preparation of a usable Amharic text corpus for different Natural Language Processing (NLP) applications. Natural language applications, such as document classification, topic modeling, machine translation, speech recognition, and others, suffer greatly from a lack of digital resources. This is especially true for Amharic, a resource-constrained, morphologically rich, and complex language. In response to this, a total of 67,739 Amharic news documents consisting of 8 different categories from online sources are collected. The collected corpus passes through a number of pre-processing steps including; data cleaning, text normalization and punctuation correction. To validate the usability of the collected corpora from different domains, a baseline document classification experiment was conducted. Experimental results show that, 84.53% accuracy is registered using deep learning in the absence of linguistic information. Finding indicated that it is possible to use the prepared corpora for different natural language applications in the absence of linguistic resources such as stemmer and dictionary despite the complexity of Amharic language. We are further working towards Amharic news document classification by incorporating a linguistic independent stop-word detection, stemming and unsupervised morphological segmentation of Amharic documents.
Article
Full-text available
It has become trivial to point out that algorithmic systems increasingly pervade the social sphere. Improved efficiency—the hallmark of these systems—drives their mass integration into day-to-day life. However, as a robust body of research in the area of algorithmic injustice shows, algorithmic systems, especially when used to sort and predict social outcomes, are not only inadequate but also perpetuate harm. In particular, a persistent and recurrent trend within the literature indicates that society's most vulnerable are disproportionally impacted. When algorithmic injustice and harm are brought to the fore, most of the solutions on offer (1) revolve around technical solutions and (2) do not center disproportionally impacted communities. This paper proposes a fundamental shift—from rational to relational—in thinking about personhood, data, justice, and everything in between, and places ethics as something that goes above and beyond technical solutions. Outlining the idea of ethics built on the foundations of relationality, this paper calls for a rethinking of justice and ethics as a set of broad, contingent, and fluid concepts and down-to-earth practices that are best viewed as a habit and not a mere methodology for data science. As such, this paper mainly offers critical examinations and reflection and not “solutions.”
Article
Full-text available
We introduced the contemporary Amharic corpus, which is automatically tagged for morpho-syntactic information. Texts are collected from 25,199 documents from different domains and about 24 million orthographic words are tokenized. Since it is partly a web corpus, we made some automatic spelling error correction. We have also modified the existing morphological analyzer, HornMorpho, to use it for the automatic tagging.
Conference Paper
Full-text available
The world's indigenous languages and related cultural knowledge are under considerable threat of diminishing given the increasing expansion of the use of standard languages, particularly through the wide-ranging pervasion of digital media and machine readable editions of electronic resources. There is thus a pressing need to preserve and breathe life into traditional data resources containing both valuable linguistic and cultural knowledge. In this paper we demonstrate on the example of an Austrian non-standard language resource (DBÖ/dbo@ema), how the combined application of semantic modelling of cultural concepts and visual exploration tools are key in unlocking the indigenous knowledge system, traditional world views and valuable cultural content contained within this rich resource. The original data collection questionnaires serve as a pilot case study and initial access point to the entire collection. Set within a Digital Humanities context, the collaborative methodological approach described here acts as a demonstrator for opening up traditional/non-standard language resources for cultural content exploration through computing, ultimately giving access to, re-circulating and preserving otherwise lost immaterial cultural heritage.
Article
Full-text available
The blind application of machine learning runs the risk of amplifying biases present in data. Such a danger is facing us with word embedding, a popular framework to represent text data as vectors which has been used in many machine learning and natural language processing tasks. We show that even word embeddings trained on Google News articles exhibit female/male gender stereotypes to a disturbing extent. This raises concerns because their widespread use, as we describe, often tends to amplify these biases. Geometrically, gender bias is first shown to be captured by a direction in the word embedding. Second, gender neutral words are shown to be linearly separable from gender definition words in the word embedding. Using these properties, we provide a methodology for modifying an embedding to remove gender stereotypes, such as the association between between the words receptionist and female, while maintaining desired associations such as between the words queen and female. We define metrics to quantify both direct and indirect gender biases in embeddings, and develop algorithms to "debias" the embedding. Using crowd-worker evaluation as well as standard benchmarks, we empirically demonstrate that our algorithms significantly reduce gender bias in embeddings while preserving the its useful properties such as the ability to cluster related concepts and to solve analogy tasks. The resulting embeddings can be used in applications without amplifying gender bias.
Article
Full-text available
Stemming is an important analysis step in a number of areas such as natural language processing (NLP), information retrieval (IR), machine translation(MT) and text classification. In this paper we present the development of a stemmer for Amharic that reduces words to their citation forms. Amharic is a Semitic language with rich and complex morphology. The application of such a stemmer is in dictionary based cross language IR, where there is a need in the translation step, to look up terms in a machine readable dictionary (MRD). We apply a rule based approach supplemented by occurrence statistics of words in a MRD and in a 3.1M words news corpus. The main purpose of the statistical supplements is to resolve ambiguity between alternative segmentations. The stemmer is evaluated on Amharic text from two domains, news articles and a classic fiction text. It is shown to have an accuracy of 60% for the old fashioned fiction text and 75% for the news articles.
Article
Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Many popular models to learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for morphologically rich languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skip-gram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram, words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpus quickly. We evaluate the obtained word representations on five different languages, on word similarity and analogy tasks.
Article
This study tried to examine how linguistic sexism manifests through the lexicons of Afan Oromo, Amharic and Gamo in the light of the social and cultural lives of the speakers. The data for this study were collected from native speakers through elicitation. These data were analyzed based on Critical Discourse Analysis approach. As the study showed, among the three languages, semantically asymmetric terms, metaphors of terms that denote human beings, use of man/he as generic, and administration titles exhibit sexism. This has resulted from the male dominance in the socio-cultural lives of the societies. The linguistic sexism observed in this study are now conventions of the languages. Researches show that language conventions shape the way speakers think. Hence, it is believed that these sorts of linguistic sexism among the languages maintain the socio-culturally created gender bias ideologies of the societies. This scenario would be a challenge for the current gender mainstreaming endeavors of Ethiopia. Therefore, a thorough study should be carried out on these languages and the rest of the country’s languages to assist in combating the broader gender inequality scenario in Ethiopia. Key words: Linguistic sexism, Afan Oromo, Amharic, Gamo, Male dominance.
Gensim - statistical semantics in python
  • R Rehurek
  • P Sojka
Gender shades: intersectional accuracy disparities in commercial gender classification
  • J Buolamwini
  • T Gebru
Buolamwini, J., Gebru, T.: Gender shades: intersectional accuracy disparities in commercial gender classification. In: Friedler, S.A., Wilson, C. (eds.) Proceedings of the 1st Conference on Fairness, Accountability and Transparency. Proceedings of Machine Learning Research, vol. 81, pp. 77-91. PMLR (2018)
Gender bias evaluation in Luganda-English machine translation
  • E P Wairagala
Wairagala, E.P., et al.: Gender bias evaluation in Luganda-English machine translation. In: Duh, K., Guzmán, F. (eds.) Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), pp. 274-286. Association for Machine Translation in the Americas, Orlando (2022)
Using OntoLex-lemon for representing and interlinking lexicographic collections of Bavarian dialects
  • Y Abgaz
  • M Ionov
  • J P Mccrae
  • C Chiarcos
  • T Declerck
  • J Bosque-Gil
Abgaz, Y.: Using OntoLex-lemon for representing and interlinking lexicographic collections of Bavarian dialects. In: Ionov, M., McCrae, J.P., Chiarcos, C., Declerck, T., Bosque-Gil, J., Gracia, J. (eds.) Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020), pp. 61-69. European Language Resources Association, Marseille (2020)
AI is not just learning our biases
  • L Douglas
Douglas, L.: AI is not just learning our biases; it is amplifying them (2017)
Women also snowboard: overcoming bias in captioning models
  • L A Hendricks
  • K Burns
  • K Saenko
  • T Darrell
  • A Rohrbach
  • V Ferrari
  • M Hebert
  • C Sminchisescu
Hendricks, L.A., Burns, K., Saenko, K., Darrell, T., Rohrbach, A.: Women also snowboard: overcoming bias in captioning models. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision -ECCV 2018, pp. 793-811. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9 47
How biases in language get perpetuated by technology
  • S Nair
Evaluation of Gender Bias
  • S Sakai
  • Y Suzuki