PreprintPDF Available

Inferring gender from names in any region, language, or alphabet

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Figures

Content may be subject to copyright.
Inferring gender from names in any region, language, or alphabet
Elian Carsenat <elian.carsenat@namsor.com>
NamSor, 2019
This paper describes the NamSor API v2 method for name gender classification.
NamSor API Version 2.0.7
Contents
Inferring gender from names in any region, language, or alphabet............................................................ 1
Introduction ............................................................................................................................................ 2
Our case for algorithmic transparency ................................................................................................ 3
Getting started ........................................................................................................................................ 4
Naïve Bayes Algorithm for Name Gender Inference ............................................................................... 5
Personal Name Features ..................................................................................................................... 5
Training Data Set Selection & Reinforced Learning ............................................................................. 6
Score and Calibrated Probabilities based on a Validation Data Set .................................................... 7
General purpose Calibration Set ......................................................................................................... 9
Software Changes ............................................................................................................................. 10
Known Issues, future enhancements ................................................................................................ 10
Introduction
Among the different methods that are commonly used for gender attribution, name inference presents
several advantages: it can be applied retroactively to any database that contains personal names and it
doesn’t rely on any secondary source of data.
Figure 1 - Identifying the gender of PCT inventors (WIPO, 2016)
Inferring the gender of names has numerous uses in research, business, and public services. Among the
different methods that are commonly used for gender attribution, name inference presents several
advantages:
It can be applied retroactively to any database that contains personal names;
It doesn’t rely on any secondary sources;
It is fast, and;
It is cost efficient.
NamSor’s goal is to provide a reliable name gender prediction model with global coverage (for all
countries, all languages, all alphabets, etc.).
To overcome the challenges of traditional methods (name gender semantics, name gender dictionaries)
and specifically reduce the impact of migration trends on gender inference, NamSor can interpret the
full name (ex. Mary Smith, John W. Smith, Smith; Mary,
晓明
) or both first & last names
(Mary|Smith; John|Smith,
晓明|
).
NamSor also includes API endpoints to infer the origin or ethnicity of individuals based on their names,
and these developments reinforce the quality of our gender estimation. A given name like Andrea may
be male or female depending on the cultural context. The cultural context is inferred using the surname
combined with the first name and there is also an optional geographic context (country ISO2 code). In
some cultures, a surname ending may change depending on gender, so the API automatically recognizes
if gender can be inferred from the first name (e.g., Carl) or the last name (e.g., Sokolova).
NamSor uses extensive machine learning techniques (supervised and non-supervised), and we
collaborate also with linguists, geographers, anthropologists, and historians to increase our models
accuracy in various cultural contexts.
Our case for algorithmic transparency
NamSor has been used by many Universities and Research Organizations worldwide and we are
periodically asked to be more transparent on the algorithms we use and our model’s accuracy.
Since Artificial Intelligence has become hype, there have been several controversies about AI’s gender
and racial biases. It is, in fact, very difficult for an A.I. not to learn gender, ethnic, or racial biases.
NamSor is designed to maximize gender, ethnic, or racial biases for applications where such information
is desired (data mining, gender equality studies, human geography, analysis of urban segregation, etc.),
and we see a business opportunity for NamSor to assess other A.I.’s gender, ethnic, or racial biases for
applications where such biases are neither desired nor legal (human resources / recruitment, credit
rating, …).
We believe being transparent on our algorithms can be useful to build trust in our capability to assess
other A.I.’s gender, ethnic, or racial biases. Consequently, in 2019, we have launched a new version of
NamSor API v2 with the objective of being more transparent on the algorithms.
As a company, we also separately maintain two distinct groups of algorithms:
- NamSor v1 Core, our historical software (2012-2018) which still has some interesting features for non-
supervised machine learning (clustering name by ethnic / linguistic groups);
- NamSor ML, a deep learning framework for processing international names using word embedding
(FastText or Word2Vec pre-trained models).
Getting started
You can get an API Key and get started appending gender to a few names, using the online API
documentation and interactive Try it out features,
https://v2.namsor.com/NamSorAPIv2/apidoc.html#/personal/genderGeo
For processing large volumes of names, different options exist:
- integrating the REST API directly, based on its OpenAPI v3 documentation here,
https://v2.namsor.com/NamSorAPIv2/api2/openapi.json
- using one of the SDKs for Python, Java, R or PHP (all are generated by Openapi-generator, an
opensource project sponsored by NamSor),
https://github.com/namsor/namsor-python-sdk2
https://github.com/namsor/namsor-java-sdk2
https://github.com/namsor/namsor-r-sdk2
https://github.com/namsor/namsor-php-sdk2
- using one to the Python or Java Command Line tools,
https://github.com/namsor/namsor-python-tools-v2
https://github.com/namsor/namsor-java-sdk2
For example to use the Python Command Line tool to append gender to a list of id, first and last names,
geographic context : id12|John|Smith|US
python namsor_tools.py -apiKey <yourAPIKey> -w -header -uid -f fnlngeo -i
samples/some_idfnlngeo.txt -service gender
Naïve Bayes Algorithm for Name Gender Inference
NamSor API v2 uses Naïve Bayes Classifiers, a class of algorithms which is excellent at classification.
Personal Name Features
The Naïve Bayes Classifier works by associating the name’s features to a likely gender. Our training set
will have the following information:
Elena|Sokolova|female
Andrey|Sokolov|male
For example, let’s look at some of the features we extract from the name Elena Sokolova:
Example of Name Features
Description
Empirical knowledge
#FN_elena
The first name is Elena
Elena is usually female
#FN_*ena$
The first name ends with ena.
In many cultures, a name ending
with ena, or na, or a is generally
female.
#LN_sokolova
The last name is Sokolova.
Sokolova sounds Eastern
European / Slavic.
#LNSt_Orig_RU
The last name is most frequent
in Russia, according to our global
surnames dictionary.
Sokolova is more like a Russian
name than any other name.
#LNSt_Orig_L10_8_RU
The last name is frequent in
Russia relatively to other names,
according to our global
surnames dictionary.
Sokolova looks like a common
Russian name.
#LN_*ova$
The last name ends with ova.
Russian names ending with ova
are female names.
#FnSt_Orig_Gen_female
The first name Elena in Russia is
more likely a female name,
according to our Russian-given
names dictionary.
Elena is a female Russian name.
The computer doesn’t have any of our empirical knowledge about the name Elena Sokolova, but given
enough training data, it will learn how to classify names:
- sometimes trusting the ‘baby name’ dictionary of the given location,
- sometimes trusting the ‘baby name’ dictionary of the country of origin,
- sometimes using given names morphology instead,
- sometimes using surnames morphology instead.
Part of our model’s accuracy comes from carefully selecting name features.
Training Data Set Selection & Reinforced Learning
We use open data with full names and gender information to pre-train our model. Once the model is
pre-trained, the API learns on its own assessing the quality of new data submitted to it.
As an example, we might use the European Union’s Directory as a training set
(https://publications.europa.eu/fr/web/who-is-who), after validating that the title and gender
information is correct. From that document, the model will learn that Jean-Claude JUNCKER is
associated with title M. and is likely a male name. Parsing names is often a critical step in gender
inference, so it will also learn from the typographic convention that JUNCKER is the surname and Jean-
Claude is the given name.
Part of our model’s accuracy comes from carefully selecting quality open data sources and validating
them based on our prior knowledge.
Once the model is trained, the API learns on its own. For example,
User A infers gender from
gender/Jean-Claude/JUNCKER
User B infers gender from
gender/Elena/Sokolova
User C needs to parse a full name into first/last name parts
parse/elena juncker
parse/juncker elena
Information from User A and User B will further train the name parser, making it more likely for User C
to obtain a correct parsing of a new name Elena JUNCKER.
Score and Calibrated Probabilities based on a Validation Data Set
NamSor API v2 uses Naïve Bayes Classifiers, a class of algorithms which is excellent at classification but
poor at estimating the probabilities. These probabilities are good for ranking and the algorithm performs
very well for classification, but in real world applications, they are not good estimates of actual
probabilities.
1
So instead of probabilities, each classifier will output a SCORE, which is based on the relative probability
between the predicted value and the second-best alternative.
SCORE = Log(P(best) / P(second))
All names having a similar score have a similar probability of being in the given class.
We take one additional validation step to calibrate the classifier and normalize the score to a [0-1]
range, which can be interpreted directly as a probability. We take this definition of a calibrated classifier
from the article 'Transforming Classifier Scores into Accurate Multiclass Probability Estimates' :
"This classifier is said to be well calibrated if the empirical class membership probability P converges to
the score value s." "Intuitively, if we consider all the examples to which a classifier assigns a score s(x) =
0.8, then 80% of these examples should be members of the class in question. Calibration is important if
we want the scores to be directly interpretable as the chances of membership in the class."
https://www.dropbox.com/s/n8em807cacbey7f/10.1.1.13.7457.pdf
For the calibration, we classify a validation data set with 10,000 names where we have a known gender
and look at scores. This table allows the grouping of names by decile into 10 buckets of 1,000 names,
having similar scores.
percentile
10
20
30
40
50
60
70
80
90
100
1
The article 'On the Optimality of the Simple Bayesian Classifier under Zero-One Loss' is supposed to
give a rigorous mathematical explanation for this
https://link.springer.com/article/10.1023%2FA%3A1007413511361?LI=true
How many names having a score of 9 have been correctly classified as male/female? In that example,
889 names are correctly classified, and 111 names are incorrectly classified.
percentile
score
ok
ko
precision
10
9.67
889
111
0.89
20
13.89
999
1
1.00
30
16.81
1000
0
1.00
40
19.22
1000
0
1.00
50
21.36
1000
0
1.00
60
23.54
1000
0
1.00
70
26.14
1000
0
1.00
80
29.08
1000
0
1.00
90
33.77
1000
0
1.00
100
52.92
1000
0
1.00
To complete our score to probability function, we add the theoretical values:
- for minimum score 0, we associate a probability of 0.5 corresponding to a random choice between
male or female,
- for maximum score 100, we associate a probability of 1 corresponding to a perfect precision.
So, our calibration table is,
Score
Probability
0.00
0.50
9.67
0.89
13.89
1.00
16.81
1.00
19.22
1.00
21.36
1.00
23.54
1.00
26.14
1.00
29.08
1.00
33.77
1.00
Based on this table, the classifier can interpolate from the score and return a calibrated probability:
gender/John/Smith
{
"firstName": "John",
"lastName": "Smith",
"likelyGender": "male",
"genderScale": -1,
"score": 30.792681810673518,
"probabilityCalibrated": 1
}
Based on the examples above, we could have the following values:
First name
Last name
Likely Gender
Score
Probability
Empirical knowledge
John
Smith
Male
30.79
1
John is male
Mary
Smith
Female
25.54
1
Mary is female
(empty)
Smith
Female
3.45
0.53
Gender can’t be guessed
without a given name.
(empty)
Sokolova
Female
6.64
0.71
Gender can be guessed from
the surname.
(empty)
Sokolov
Male
13.65
0.99
Gender can be guessed from
the surname.
Dominique
Martin
Female
4.94
0.61
Dominique is quite ambiguous
as a French name.
Jing
Cao
Female
3.42
0.53
Romanized Chinese names
can be quite gender
ambiguous.
General purpose Calibration Set
NamSor API v2 comes initially with a single calibration set ‘US_Boston, which is a large sample of names
from the US City of Boston having a known gender. This sample is general purpose because it is quite
diverse in terms of names origin and ethnicities. But, in the future, we want to add more specific
calibration sets (Olympic Athletes, for international names; EU Directory for European Names, etc.)
This is what the validation table looks like based on the ‘US_Boston’ on a larger validation set:
For 10% of the sample, the score is inferior to 12.08. Gender is correctly predicted in 27,487
occurrences and incorrectly predicted in 10,417 occurrences. The precision for this decile is 0.73.
For 10% of the sample with a score between 12.08 and 18.52, the precision is 0.91 (and so on).
The value interpolated from the score is returned as a calibrated probability
(probabilityCalibrated).
In order to estimate the overall precision of the API based on a probability cut-off value, we can
use the following table:
For 10% of the sample with a score higher than 79.49, Gender is correctly predicted in 37,905
occurrences and incorrectly predicted in 169 occurrences. The precision for this decile is 0.996,
but we have excluded 90% of the original data so the recall is only 10%.
For 20% of the sample with a score higher that 40.71, Gender is correctly predicted in 75,395
occurrences and incorrectly predicted in 414 occurrences. The precision for those 20% is 0.995,
but we have excluded 80% of the original data so the recall is only 20%.
An optimal value for most research is to exclude 10% of the original data that has the lowest
score. If we only consider the results where calibrated probability is higher than 0.73, we should
have about 90% of our original data and a precision higher than 0.976.
Software Changes
NamSorAPIv2.0.7
The probability calibration is no longer based
on the Score, but based on the probability
estimates.
NamSorAPIv2.0.6
Fix issue where probability could be between
0.33 and 0.5; with a low score, the probability
should be 0.5 (corresponding to randomly
choosing Male/Female).
Known Issues, future enhancements
The calibration dataset should be configurable.
https://github.com/namsor/namsor-tools-v2/issues/7
... We used an independent onomastic classification tool, NamSor, to compare the country of origin rank of surnames in each list. NamSor is a name checking technology that uses applied onomastics to classify names by gender, country of origin, ethnicity, and diaspora [21][22][23]. Naïve Bayes Classifiers are a class of algorithms used by NamSor for ranking and classification purposes [23]. The NamSor Country of Origin feature returns a list of the top 10 countries of origin, ranked from most likely to least likely. ...
... NamSor is a name checking technology that uses applied onomastics to classify names by gender, country of origin, ethnicity, and diaspora [21][22][23]. Naïve Bayes Classifiers are a class of algorithms used by NamSor for ranking and classification purposes [23]. The NamSor Country of Origin feature returns a list of the top 10 countries of origin, ranked from most likely to least likely. ...
Article
Full-text available
Background There are an estimated 460,000 Armenians in the United States, and more than half live in California. Armenian-Americans are generally represented within the ‘White’ or ‘Some Other Race’ race categories in population-based research studies. While Armenians have been included in studies focused on Middle-Eastern populations, there are no studies focused exclusively on Armenians due to a lack of standardized collection of Armenian ethnicity in the United States or an Armenian surname list. To fill this research gap, we sought to construct and evaluate an Armenian Surname List (ASL) for use as an identification tool in public health and epidemiological research studies focused on Armenian populations. Methods Data sources for the ASL included the California Public Use Death Files (CPUDF) and the Middle Eastern Surname List (MESL). For evaluation of the ASL, the California Cancer Registry (CCR) database was queried for surnames with birthplace in Armenia and identified by the MESL. Results There are a total of 3,428 surnames in the ASL. Nearly half (1,678) of surnames in the ASL were not identified by the MESL. The ASL captured 310 additional Armenian surnames in the CCR than the MESL. Conclusions The ASL is the first surname list for identifying Armenians in major databases for epidemiological research.
... Three tools fulfilled these conditions: Gender API (free up to 500 queries per month, inaccuracy rate 1.8%) [17], NamSor (free up to 5,000 queries per month, inaccuracy rate 2.0%) [18], and Wiki-Gendersort (completely free, inaccuracy rate 6.6%) [19]. For the three tools selected, the query response options were female, male, or unknown (gender could not be determined). ...
... Until recently, NamSor [18] could not be easily used because it required the use of a connector (NamSor Custom Connector) via the free application Power BI Desktop. However, the tool now allows the uploading of a database in CSV or TXT format (https://v2.namsor.com/ ...
Article
Full-text available
Objective: We recently showed that the gender detection tools NamSor, Gender API, and Wiki-Gendersort accurately predicted the gender of individuals with Western given names. Here, we aimed to evaluate the performance of these tools with Chinese given names in Pinyin format. Methods: We constructed two datasets for the purpose of the study. File #1 was created by randomly drawing 20,000 names from a gender-labeled database of 52,414 Chinese given names in Pinyin format. File #2, which contained 9,077 names, was created by removing from File #1 all unisex names that we were able to identify (i.e., those that were listed in the database as both male and female names). We recorded for both files the number of correct classifications (correct gender assigned to a name), misclassifications (wrong gender assigned to a name), and nonclassifications (no gender assigned). We then calculated the proportion of misclassifications and nonclassifications (errorCoded). Results: For File #1, errorCoded was 53% for NamSor, 65% for Gender API, and 90% for Wiki-Gendersort. For File #2, errorCoded was 43% for NamSor, 66% for Gender API, and 94% for Wiki-Gendersort. Conclusion: We found that all three gender detection tools inaccurately predicted the gender of individuals with Chinese given names in Pinyin format and therefore should not be used in this population.
... While the level of demographic data on Meetup is limited, we were able to tease out some basic information (e.g., age, gender, location). Using, for example, the name-based gender assigned API "NamSor" (Carsenat, 2019), we selected a sample of 16,334 meetup profiles (from members located in Edinburgh) containing both first and last names. Results suggested an overall gender balance of 4580 females and 10,120 males (10% of results with the lowest score were excluded). ...
Article
This paper examines a novel and innovative methodological approach and dataset for measuring the complex relational dynamics underpinning entrepreneurial ecosystems (EEs). Existing measurement techniques have largely failed to yield sufficiently nuanced data or insights to inform robust policy recommendations within this research field. To rectify this situation, this paper sets out a novel approach to assessing the relational connectivity within EEs by capturing entrepreneurial "conversations". Drawing on real-time data extracted from an event-based social media platform, in combination with social network analysis and qualitative interview data, we provide an in-depth assessment of the relational connections within the city of Edinburgh at three analytical levels. Overall, the paper demonstrates that the analysis of conversations and conversational spaces is an important mechanism for exploring and mapping the relational connectivity within EEs. As well as producing novel empirical insights, this approach provides policy makers with vital strategic policy intelligence to help better inform public policy frameworks and associated interventions.
ResearchGate has not been able to resolve any references for this publication.