Preprint

Wiki-Gendersort: Automatic gender detection using first names in Wikipedia

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Gender information is often absent from databases available to scholars, thus hindering the proper problematization, investigation, and answering of various gender-related research questions. Due to this situation and in order to improve both data completeness and accuracy, various gender detection algorithms have been developed, aimed at inferring gender from data already provided. Named-based algorithms represent the most simple, yet effective used gender detection methods: such methods proceed by generating first-name-to-gender mapping tables based on user records in a given dataset and then applying such mapping tables "in reversal" to other databases for completion or validation purposes. The present research aims to develop a gender detection algorithm focusing on the gender detection of eponymous Wikipedia pages and compare its performance to that of other well-known gender detection databases, using the author names indexed in the Web of Science.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

... The substantial proliferation of open datasets from public and private scientific endeavors offers new opportunities to empirically explore a myriad of gender-related research questions (Bérubé et al., 2020;VanHelene et al., 2024). However, to fully understand the potential effect of gender on different human perceptions, behavior or attitudes, research needs to first provide and then test the performance of clear-cut gender detection tools to the systematic testing of the stability of two versions of ChatGPT, the evaluation of performance under different prompting conditions, and the comparison of results with existing GDTs (Namsor and Gender-API). ...
... Finally, thus far, prior literature has mainly provided results on classifications, missclassifications, nonclassification, errors, and gender bias of gender inferences (Bérubé et al., 2020;Sebo, 2022;VanHelene et al., 2024;Alexopoulos et al., 2023). However, beyond these relevant statistics, which are indeed very common in computational sciences and gender detection, prior literature has typically neglected the computation of inter-coder reliability measures to address the reliability of the gender predictions. ...
... conditions that test the wording, input and output of the inference. Also new from previous research on automatic gender inference (Bérubé et al., 2020;Sebo, 2021aSebo, , 2021bSebo, , 2022VanHelene et al., 2024), our study implements a more stringent comparison (including inter-coder reliability statistics with Kohen's kappa and Krippendorff's alpha) to offer more robust empirical evidence and research suggestions that may resonate to future scholars interested in gender inference from names. Although Alexopoulos et al. (2023) compared ChatGPT to GDTs on a significantly larger dataset; they reported exclusively machine learning metrics which, in our opinion, are not appropriate for gender inference since for this classification problem there are no false positives or false negatives. ...
Article
Full-text available
The gender classification from names is crucial for uncovering a myriad of gender-related research questions. Traditionally, this has been automatically computed by gender detection tools (GDTs), which now face new industry players in the form of conversational bots like ChatGPT. This paper statistically tests the stability and performance of ChatGPT 3.5 Turbo and ChatGPT 4o for gender detection. It also compares two of the most used GDTs (Namsor and Gender-API) with ChatGPT using a dataset of 5,779 records compiled from previous studies for the most challenging variant, which is the gender inference from full name without providing any additional information. Results statistically show that ChatGPT is very stable presenting low standard deviation and tight confidence intervals for the same input, while it presents small differences in performance when prompt changes. ChatGPT slightly outperforms the other tools with an overall accuracy over 96%, although the difference is around 3% with both GDTs. When the probability returned by GDTs is factored in, differences get narrower and comparable in terms of inter-coder reliability and error coded. ChatGPT stands out in the reduced number of non-classifications (0% in most tests), which in combination with the other metrics analyzed, results in a solid alternative for gender inference. This paper contributes to current literature on gender detection classification from names by testing the stability and performance of the most used state-of-the-art AI tool, suggesting that the generative language model of ChatGPT provides a robust alternative to traditional gender application programming interfaces (APIs), yet GDTs (especially Namsor) should be considered for research-oriented purposes.
... As a potential solution, computational sciences have provided robust tools to substantially address this challenge (Das & Paik, 2021;Fourkioti et al., 2019;Goyanes et al., 2022). Specifically, one of the most important procedures is the automatic computation of gender from names through gender detection application programming interfaces (APIs) (Bérubé et al., 2020;Sebo, 2021Sebo, , 2022a. These algorithmic classifiers enable the computational coding of gender from names in big datasets (Bérubé et al., 2020), significantly reducing the time and human resources required for manual coding, while maintaining or even improving the quality and reliability compared to human coders. ...
... Specifically, one of the most important procedures is the automatic computation of gender from names through gender detection application programming interfaces (APIs) (Bérubé et al., 2020;Sebo, 2021Sebo, , 2022a. These algorithmic classifiers enable the computational coding of gender from names in big datasets (Bérubé et al., 2020), significantly reducing the time and human resources required for manual coding, while maintaining or even improving the quality and reliability compared to human coders. Studies of computational coding of gender using tools like Namsor or Gender-API report a high degree of accuracy, particularly for Western names (Santamaría & Mihaljević, 2018;Sebo, 2021Sebo, , 2022b. ...
Article
Full-text available
Both computational social scientists and scientometric scholars alike, interested in gender-related research questions, need to classify the gender of observations. However, in most public and private databases, this information is typically unavailable, making it difficult to design studies aimed at understanding the role of gender in influencing citizens’ perceptions, attitudes, and behaviors. Against this backdrop, it is essential to design methodological procedures to infer the gender automatically and computationally from data already provided, thus facilitating the exploration and examination of gender-related research questions or hypotheses. Researchers can use automatic gender detection tools like Namsor or Gender-API, which are already on the market. However, recent developments in conversational bots offer a new, still relatively underexplored, alternative. This study offers a step-by-step research guide, with relevant examples and detailed clarifications, to automatically classify the gender from names through ChatGPT and two partially free gender detection tool (Namsor and Gender-API). In addition, the study provides methodological suggestions and recommendations on how to gather, interpret, and report results coming from both platforms. The study methodologically contributes to the scientometric literature by describing an easy-to-execute methodological procedure that enables the computational codification of gender from names. This procedure could be implemented by scholars without advanced computing skills.
... A total of 10,898 distinct authors/contributors are identified to have contributed to the 17 projects listed in Table I. To detect the gender of an author, we use Wiki-Gendersort [13]. Among the 10,898 contributors, 10,255 are identified as males, 477 are as females, and the rest 166 are classified as unisex or unknown. ...
... This algorithm is reported to have 97.07% accuracy in identifying genders in NamSor names [13]. The exclusion of 1.57% of contributors (Section III-A) classified as unisex or unknown should not have a significant impact on our results. ...
... Three tools fulfilled these conditions: Gender API (free up to 500 queries per month, inaccuracy rate 1.8%) [17], NamSor (free up to 5,000 queries per month, inaccuracy rate 2.0%) [18], and Wiki-Gendersort (completely free, inaccuracy rate 6.6%) [19]. For the three tools selected, the query response options were female, male, or unknown (gender could not be determined). ...
... Finally, Wiki-Gendersort [19] requires the installation of the module on the computer and then the use of a specific function (file_assign) to assign a gender to a list of names in TXT format. We used Spyder, an open-source integrated development environment for programming in Python. ...
Article
Full-text available
Objective: We recently showed that the gender detection tools NamSor, Gender API, and Wiki-Gendersort accurately predicted the gender of individuals with Western given names. Here, we aimed to evaluate the performance of these tools with Chinese given names in Pinyin format. Methods: We constructed two datasets for the purpose of the study. File #1 was created by randomly drawing 20,000 names from a gender-labeled database of 52,414 Chinese given names in Pinyin format. File #2, which contained 9,077 names, was created by removing from File #1 all unisex names that we were able to identify (i.e., those that were listed in the database as both male and female names). We recorded for both files the number of correct classifications (correct gender assigned to a name), misclassifications (wrong gender assigned to a name), and nonclassifications (no gender assigned). We then calculated the proportion of misclassifications and nonclassifications (errorCoded). Results: For File #1, errorCoded was 53% for NamSor, 65% for Gender API, and 90% for Wiki-Gendersort. For File #2, errorCoded was 43% for NamSor, 66% for Gender API, and 94% for Wiki-Gendersort. Conclusion: We found that all three gender detection tools inaccurately predicted the gender of individuals with Chinese given names in Pinyin format and therefore should not be used in this population.
... We proposed various approaches for automatic gender prediction from a person's name, described in detail in Sections 5 and 6. At present, many academic and commercial investigators have been studying and exploring the automated identification of human names [9][10][11][12]. It is known from the literature reviewing that many researchers of academic institutions and industries have been working on NLP using AI, ML, and DL-based algorithms, but to our knowledge, we did not find many research works on gender inference from names. ...
Article
Full-text available
The name of individuals has a specific meaning and great significance. Individuals' names generally have substantial gender differences, and explicitly, Bengali names usually have a solid sexual identity. We can determine if a stranger is a man or a woman based on their name with remarkably suitable precision. In this research, we primarily conducted a thorough investigation into gender prediction based on a person's name using DL-based methods. While various techniques have been explored for the English language, there has been little progress in the Bengali language. We address this gap by presenting a large-scale experiment with 2030 Bangladeshi unique names. We used both convolutional neural network (CNN)-and recurrent neural network (RNN)-based deep learning methods to infer gender from the Bangladeshi names in the Bengali language. We presented the one-dimensional CNN (Conv1D), simple long short-term memory (LSTM), bidirectional LSTM, stacked LSTM, and combined Conv1D and stacked bidirectional LSTM-based models and evaluated the performance of each scheme using our own dataset. Experimental results are analyzed on the basis of accuracy, precision, recall, F1-score, ROC AUC score, and loss performance metrics. The performance evaluative results show that Conv1D out-performs with 91.18% accuracy, which is likely to improve as the size of the training data grows.
... Several methods exist to achieve this, including the census and national research databases such as the US Social Security Administration [46], the US baby names website [40], Wikipedia name lists, scraping or manual retrieval from websites, and applying face recognition software to web images. Software packages to undertake gender name disambiguation, including ropensci/gender [47] and the Wiki-Gendersort algorithm [48], do not analyse successfully unisex or non-binary gender names and non-Western names, thus often excluding research by authors with these names, leading to incomplete analysis. For example, longitudinal research by Huang, Gates, Sinatra and Barabási [49] uses a commercial software package Genderize.io ...
Article
Full-text available
In this article, we ask whether dominant narratives of gender and performance within academic institutions are masking stories that may be both more complex and potentially more hopeful than those which are often told using publication-related data. Influenced by world university rankings, institutions emphasise so-called ‘excellent’ research practices: publish in ‘high impact’, elite subscription journals indexed by the commercial bibliographic databases that inform the various ranking systems. In particular, we ask whether data relating to institutional demographics and open access publications could support a different story about the roles that women are playing as pioneers and practitioners of open scholarship. We review gender bias in scholarly publications and discuss examples of open access research publications that highlight a positive advantage for women. Using analysis of workforce demographics and open research data from our Open Knowledge Initiative project, we explore relationships and correlations between academic gender and open access research output from universities in Australia and the United Kingdom. This opens a conversation about different possibilities and models for exploring research output by gender and changing the dominant narrative of deficit in academic publishing.
Article
Full-text available
This paper is the first to analyse the role of women authors in fostering justice-relevant topics in climate adaptation research. As representation, citation and payment patterns remain gender-biased across scientific disciplines, we explore the case of climate science, particularly adaptation, as its most human-oriented facet. In climate research and policy, there has been a recent surge of interest in climate justice topics: mentions of justice have increased almost tenfold in Intergovernmental Panel on Climate Change Working Group 2 reports between the latest assessment cycles (AR5 and AR6). We conduct a systematic examination of the topic space in the adaptation policy scholarship. As it is a vast and rapidly growing field, we use topic modelling, an unsupervised machine learning method, to identify the literature on climate justice and related fields, as well as to examine the relationship between topic prevalence and the gender of the authors. We find climate change adaptation policy research to be male dominated, with women holding 38.8% of first and 28.8% of last authorships. However, we observe topic-specific variability, whereby the share of female authors is higher among publications on justice-relevant topics. Female authorship is highly linked to topics such as Community, Local Knowledge, and Governance, but less to Food Security and Climate Finance. Our findings corroborate the evidence that female authors play a significant role in advancing the research and dialogue on the relationship between climate change and areas that have meaningful impact on lives of women and other marginalised groups.
Article
The hierarchical influences model attributes significant influence to journalists over their news content. A quantitative content analysis was conducted to examine the impact of journalists’ gender and reporting experience on the media delegitimization of the trans community in Spain. The results emphasize their minimal impact. Instead, media attention and delegitimization of the trans community appear to be shaped by the prevailing value system cultivated in newsrooms, personal attitudes, values, and beliefs and journalistic role performance. These findings underscore the significance of theoretical knowledge over practical experience in the news coverage of specific news topics.
Article
Full-text available
Objectives Although diversity has been demonstrated to benefit research groups, women remain underrepresented in most scientific disciplines, including Laboratory Medicine and Clinical Chemistry. In order to promote diversity and equality in scientific communities, understanding the gender distribution of authorship is crucial. Methods This study included a total of 30,268 Web of Science-listed Clinical Chemistry and Laboratory Medicine publications from the United States of America, Canada, and the member countries of the European Federation of Clinical Chemistry and Laboratory Medicine from 2005 to 2022. In addition to the publication productivity of female and male authors over time, gender-specific publication characteristics and country-specific gender distributions of authorships were examined. Results Overall, publications with female first authors increased by 49 % between 2005 and 2022, averaging 42 % female first authors. Eastern Europe (60 %) and Southern Europe (51 %) had particularly high proportions of female first authors. While female last authorship was the most predictive of female first authorship, with an odds ratio of 2.01 (95 % CI: 1.91–2.12, p < 0.001), only 27 % of last authors were female. Moreover, citation rate was not predictive of female first or last authorship. Conclusion Authorship in Clinical Chemistry and Laboratory Medicine is moving towards gender parity. This trend is more pronounced for first authors than for last authors. Further research into the citations of female authors in this discipline could be a starting point for increasing the visibility of women researchers in science. Moreover, geographical differences may provide opportunities for future research on gender parity across disciplines.
Article
Full-text available
Film festivals are a key component in the global film industry in terms of trendsetting, publicity, trade, and collaboration. We present an unprecedented analysis of the international film festival circuit, which has so far remained relatively understudied quantitatively, partly due to the limited availability of suitable data sets. We use large-scale data from the Cinando platform of the Cannes Film Market, widely used by industry professionals. We explicitly model festival events as a global network connected by shared films and quantify festivals as aggregates of the metadata of their showcased films. Importantly, we argue against using simple count distributions for discrete labels such as language or production country, as such categories are typically not equidistant. Rather, we propose embedding them in continuous latent vector spaces. We demonstrate how these “festival embeddings” provide insight into changes in programmed content over time, predict festival connections, and can be used to measure diversity in film festival programming across various cultural, social, and geographical variables—which all constitute an aspect of public value creation by film festivals. Our results provide a novel mapping of the film festival circuit between 2009–2021 (616 festivals, 31,989 unique films), highlighting festival types that occupy specific niches, diverse series, and those that evolve over time. We also discuss how these quantitative findings fit into media studies and research on public value creation by cultural industries. With festivals occupying a central position in the film industry, investigations into the data they generate hold opportunities for researchers to better understand industry dynamics and cultural impact, and for organizers, policymakers, and industry actors to make more informed, data-driven decisions. We hope our proposed methodological approach to festival data paves way for more comprehensive film festival studies and large-scale quantitative cultural event analytics in general.
Article
PURPOSE Clinical trials are valuable evidence for managing urologic malignancies. Early termination of clinical trials is associated with a waste of resources and may substantially affect patient care. We sought to study the termination rate of urologic cancer clinical trials and identify factors associated with trial termination. METHODS A cross-sectional search of ClinicalTrials.gov identified completed and terminated kidney, prostate, and bladder cancer clinical trials started. Trials were assessed for reasons for termination. Multivariable analyses were conducted to determine the significant factors associated with the termination. RESULTS Between 2000 and 2020, 9,145 oncology clinical trials were conducted, of which 11.30% (n = 1,033) were urologic cancer clinical trials. Of the urologic cancer clinical trials, 25.38% (n = 265) were terminated, with low patient accrual being the most common reason for termination, 52.9% (n = 127). Multivariable analysis showed that only the university funding source odds ratio (OR) of 2.20 (95% CI, 1.45 to 3.32), single-center studies OR of 2.11 (95% CI, 1.59 to 2.81), and sample size of <50 were significant predictors of clinical trial termination OR of 5.26 (95% CI, 3.85 to 7.69); all P values are <.001. CONCLUSION The termination rate of urologic cancer clinical trials was 25%, with low accrual being the most frequently reported reason. Trials funded by a university, single-center trials, and small trials (sample size <50) were associated with early termination. A better understanding of these factors might help researchers, funding agencies, and other stakeholders prioritize resource allocations for multicenter trials that aim to recruit a sufficient number of patients.
Article
Full-text available
The current trends and challenges in the field of bibliometrics are reviewed. To do so, we take the reader along a bibliometric route with six stations: the explosion of databases, the inflation of metrics, its relationship to Data Science, searching for meaning, evaluative bibliometrics, and diversity and profession. This evaluation encompasses three dimensions of the bibliometrics field regarding research evaluation: the technological, the theoretical, and the social. Finally, we advocate for the principles of an evaluative bibliometrics, balancing the power of metrics with expert judgment and science policy.
Article
The rising usage of social media has motivated to invent different methodologies of anonymous writing, which leads to an increase in malicious and suspicious activities. This anonymity has created difficulty in finding the suspect. Author profiling deals with the characterization of an author through some key attributes such as gender, age, language, dialect region variety, personality, and so on. Identifying the gender of the author of a suspect document is a salient task of author-profiling. The linguistic profile of a user can help in determining his/her demographics. Different social media platforms, such as Twitter, Facebook, and Instagram, are used regularly by users for sharing their daily life activities. Moreover, users often post images along with text on different social media platforms; thus, the usage of multimodal information is very common nowadays. In this article, the task of automatic gender prediction from multimodal Twitter data is posed as a classification problem and an efficient multimodal neural framework is proposed for solving this. The popularly used BERT_base is utilized for learning the encoded representation for the text part of the tweet, and recently introduced EfficientNet is used for extracting the features from images. Finally, a direct product-based fusion strategy is applied for fusing the text and image representations, followed by a fully connected layer for predicting the gender of a Twitter user. Plagiarism detection authorship analysis near end duplicate detection (PAN)-2018 author profiling data are used for evaluating the performance of our proposed approach. Our proposed model achieved accuracies of 82.05%, 86.22%, and 89.53% for pure-image, pure-text, and multimodal setting, respectively; outperforming the previous state-of-the-art works in all the cases. Moreover, a deep analysis is carried out to interpret the produced results; different words that serve as clues for gender classification are identified characterizing different gender classes. The supplementary file and the source codes for the proposed approach are available at https://github.com/chanchalIITP/GenderTCSS .
ResearchGate has not been able to resolve any references for this publication.