Content uploaded by Elian Carsenat
Author content
All content in this area was uploaded by Elian Carsenat on Nov 26, 2019
Content may be subject to copyright.
Inferring gender from names in any region, language, or alphabet
Elian Carsenat <elian.carsenat@namsor.com>
NamSor, 2019
This paper describes the NamSor API v2 method for name gender classification.
NamSor API Version 2.0.7
Contents
Inferring gender from names in any region, language, or alphabet............................................................ 1
Introduction ............................................................................................................................................ 2
Our case for algorithmic transparency ................................................................................................ 3
Getting started ........................................................................................................................................ 4
Naïve Bayes Algorithm for Name Gender Inference ............................................................................... 5
Personal Name Features ..................................................................................................................... 5
Training Data Set Selection & Reinforced Learning ............................................................................. 6
Score and Calibrated Probabilities based on a Validation Data Set .................................................... 7
General purpose Calibration Set ......................................................................................................... 9
Software Changes ............................................................................................................................. 10
Known Issues, future enhancements ................................................................................................ 10
Introduction
Among the different methods that are commonly used for gender attribution, name inference presents
several advantages: it can be applied retroactively to any database that contains personal names and it
doesn’t rely on any secondary source of data.
Figure 1 - Identifying the gender of PCT inventors (WIPO, 2016)
Inferring the gender of names has numerous uses in research, business, and public services. Among the
different methods that are commonly used for gender attribution, name inference presents several
advantages:
• It can be applied retroactively to any database that contains personal names;
• It doesn’t rely on any secondary sources;
• It is fast, and;
• It is cost efficient.
NamSor’s goal is to provide a reliable name gender prediction model with global coverage (for all
countries, all languages, all alphabets, etc.).
To overcome the challenges of traditional methods (name gender semantics, name gender dictionaries)
and specifically reduce the impact of migration trends on gender inference, NamSor can interpret the
full name (ex. Mary Smith, John W. Smith, Smith; Mary,
王晓明
) or both first & last names
(Mary|Smith; John|Smith,
晓明|王
).
NamSor also includes API endpoints to infer the origin or ethnicity of individuals based on their names,
and these developments reinforce the quality of our gender estimation. A given name like Andrea may
be male or female depending on the cultural context. The cultural context is inferred using the surname
combined with the first name and there is also an optional geographic context (country ISO2 code). In
some cultures, a surname ending may change depending on gender, so the API automatically recognizes
if gender can be inferred from the first name (e.g., Carl) or the last name (e.g., Sokolova).
NamSor uses extensive machine learning techniques (supervised and non-supervised), and we
collaborate also with linguists, geographers, anthropologists, and historians to increase our model’s
accuracy in various cultural contexts.
Our case for algorithmic transparency
NamSor has been used by many Universities and Research Organizations worldwide and we are
periodically asked to be more transparent on the algorithms we use and our model’s accuracy.
Since Artificial Intelligence has become hype, there have been several controversies about AI’s gender
and racial biases. It is, in fact, very difficult for an A.I. not to learn gender, ethnic, or racial biases.
NamSor is designed to maximize gender, ethnic, or racial biases for applications where such information
is desired (data mining, gender equality studies, human geography, analysis of urban segregation, etc.),
and we see a business opportunity for NamSor to assess other A.I.’s gender, ethnic, or racial biases for
applications where such biases are neither desired nor legal (human resources / recruitment, credit
rating, …).
We believe being transparent on our algorithms can be useful to build trust in our capability to assess
other A.I.’s gender, ethnic, or racial biases. Consequently, in 2019, we have launched a new version of
NamSor API v2 with the objective of being more transparent on the algorithms.
As a company, we also separately maintain two distinct groups of algorithms:
- NamSor v1 Core, our historical software (2012-2018) which still has some interesting features for non-
supervised machine learning (clustering name by ethnic / linguistic groups);
- NamSor ML, a deep learning framework for processing international names using word embedding
(FastText or Word2Vec pre-trained models).
Getting started
You can get an API Key and get started appending gender to a few names, using the online API
documentation and interactive ‘Try it out’ features,
https://v2.namsor.com/NamSorAPIv2/apidoc.html#/personal/genderGeo
For processing large volumes of names, different options exist:
- integrating the REST API directly, based on its OpenAPI v3 documentation here,
https://v2.namsor.com/NamSorAPIv2/api2/openapi.json
- using one of the SDKs for Python, Java, R or PHP (all are generated by Openapi-generator, an
opensource project sponsored by NamSor),
https://github.com/namsor/namsor-python-sdk2
https://github.com/namsor/namsor-java-sdk2
https://github.com/namsor/namsor-r-sdk2
https://github.com/namsor/namsor-php-sdk2
- using one to the Python or Java Command Line tools,
https://github.com/namsor/namsor-python-tools-v2
https://github.com/namsor/namsor-java-sdk2
For example to use the Python Command Line tool to append gender to a list of id, first and last names,
geographic context : id12|John|Smith|US
python namsor_tools.py -apiKey <yourAPIKey> -w -header -uid -f fnlngeo -i
samples/some_idfnlngeo.txt -service gender
Naïve Bayes Algorithm for Name Gender Inference
NamSor API v2 uses Naïve Bayes Classifiers, a class of algorithms which is excellent at classification.
Personal Name Features
The Naïve Bayes Classifier works by associating the name’s features to a likely gender. Our training set
will have the following information:
Elena|Sokolova|female
Andrey|Sokolov|male
For example, let’s look at some of the features we extract from the name Elena Sokolova:
Example of Name Features
Description
Empirical knowledge
#FN_elena
The first name is Elena
Elena is usually female
#FN_*ena$
The first name ends with ena.
In many cultures, a name ending
with ena, or na, or a is generally
female.
#LN_sokolova
The last name is Sokolova.
Sokolova sounds Eastern
European / Slavic.
#LNSt_Orig_RU
The last name is most frequent
in Russia, according to our global
surnames dictionary.
Sokolova is more like a Russian
name than any other name.
#LNSt_Orig_L10_8_RU
The last name is frequent in
Russia relatively to other names,
according to our global
surnames dictionary.
Sokolova looks like a common
Russian name.
#LN_*ova$
The last name ends with ova.
Russian names ending with ova
are female names.
#FnSt_Orig_Gen_female
The first name Elena in Russia is
more likely a female name,
according to our Russian-given
names dictionary.
Elena is a female Russian name.
The computer doesn’t have any of our empirical knowledge about the name Elena Sokolova, but given
enough training data, it will learn how to classify names:
- sometimes trusting the ‘baby name’ dictionary of the given location,
- sometimes trusting the ‘baby name’ dictionary of the country of origin,
- sometimes using given names’ morphology instead,
- sometimes using surnames’ morphology instead.
Part of our model’s accuracy comes from carefully selecting name features.
Training Data Set Selection & Reinforced Learning
We use open data with full names and gender information to pre-train our model. Once the model is
pre-trained, the API learns on its own – assessing the quality of new data submitted to it.
As an example, we might use the European Union’s Directory as a training set
(https://publications.europa.eu/fr/web/who-is-who), after validating that the title and gender
information is correct. From that document, the model will learn that Jean-Claude JUNCKER is
associated with title M. and is likely a male name. Parsing names is often a critical step in gender
inference, so it will also learn from the typographic convention that JUNCKER is the surname and Jean-
Claude is the given name.
Part of our model’s accuracy comes from carefully selecting quality open data sources and validating
them based on our prior knowledge.
Once the model is trained, the API learns on its own. For example,
User A infers gender from
gender/Jean-Claude/JUNCKER
User B infers gender from
gender/Elena/Sokolova
User C needs to parse a full name into first/last name parts
parse/elena juncker
parse/juncker elena
Information from User A and User B will further train the name parser, making it more likely for User C
to obtain a correct parsing of a new name Elena JUNCKER.
Score and Calibrated Probabilities based on a Validation Data Set
NamSor API v2 uses Naïve Bayes Classifiers, a class of algorithms which is excellent at classification but
poor at estimating the probabilities. These probabilities are good for ranking and the algorithm performs
very well for classification, but in real world applications, they are not good estimates of actual
probabilities.
1
So instead of probabilities, each classifier will output a SCORE, which is based on the relative probability
between the predicted value and the second-best alternative.
SCORE = Log(P(best) / P(second))
All names having a similar score have a similar probability of being in the given class.
We take one additional validation step to calibrate the classifier and normalize the score to a [0-1]
range, which can be interpreted directly as a probability. We take this definition of a calibrated classifier
from the article 'Transforming Classifier Scores into Accurate Multiclass Probability Estimates' :
"This classifier is said to be well calibrated if the empirical class membership probability P converges to
the score value s." "Intuitively, if we consider all the examples to which a classifier assigns a score s(x) =
0.8, then 80% of these examples should be members of the class in question. Calibration is important if
we want the scores to be directly interpretable as the chances of membership in the class."
https://www.dropbox.com/s/n8em807cacbey7f/10.1.1.13.7457.pdf
For the calibration, we classify a validation data set with 10,000 names where we have a known gender
and look at scores. This table allows the grouping of names by decile into 10 buckets of 1,000 names,
having similar scores.
percentile
score
10
9.67
20
13.89
30
16.81
40
19.22
50
21.36
60
23.54
70
26.14
80
29.08
90
33.77
100
52.92
1
The article 'On the Optimality of the Simple Bayesian Classifier under Zero-One Loss' is supposed to
give a rigorous mathematical explanation for this
https://link.springer.com/article/10.1023%2FA%3A1007413511361?LI=true
How many names having a score of 9 have been correctly classified as male/female? In that example,
889 names are correctly classified, and 111 names are incorrectly classified.
percentile
score
ok
ko
precision
10
9.67
889
111
0.89
20
13.89
999
1
1.00
30
16.81
1000
0
1.00
40
19.22
1000
0
1.00
50
21.36
1000
0
1.00
60
23.54
1000
0
1.00
70
26.14
1000
0
1.00
80
29.08
1000
0
1.00
90
33.77
1000
0
1.00
100
52.92
1000
0
1.00
To complete our score to probability function, we add the theoretical values:
- for minimum score 0, we associate a probability of 0.5 corresponding to a random choice between
male or female,
- for maximum score 100, we associate a probability of 1 corresponding to a perfect precision.
So, our calibration table is,
Score
Probability
0.00
0.50
9.67
0.89
13.89
1.00
16.81
1.00
19.22
1.00
21.36
1.00
23.54
1.00
26.14
1.00
29.08
1.00
33.77
1.00
Based on this table, the classifier can interpolate from the score and return a calibrated probability:
gender/John/Smith
{
"firstName": "John",
"lastName": "Smith",
"likelyGender": "male",
"genderScale": -1,
"score": 30.792681810673518,
"probabilityCalibrated": 1
}
Based on the examples above, we could have the following values:
First name
Last name
Likely Gender
Score
Probability
Empirical knowledge
John
Smith
Male
30.79
1
John is male
Mary
Smith
Female
25.54
1
Mary is female
(empty)
Smith
Female
3.45
0.53
Gender can’t be guessed
without a given name.
(empty)
Sokolova
Female
6.64
0.71
Gender can be guessed from
the surname.
(empty)
Sokolov
Male
13.65
0.99
Gender can be guessed from
the surname.
Dominique
Martin
Female
4.94
0.61
Dominique is quite ambiguous
as a French name.
Jing
Cao
Female
3.42
0.53
Romanized Chinese names
can be quite gender
ambiguous.
General purpose Calibration Set
NamSor API v2 comes initially with a single calibration set ‘US_Boston,’ which is a large sample of names
from the US City of Boston having a known gender. This sample is general purpose because it is quite
diverse in terms of names origin and ethnicities. But, in the future, we want to add more specific
calibration sets (Olympic Athletes, for international names; EU Directory for European Names, etc.)
This is what the validation table looks like based on the ‘US_Boston’ on a larger validation set:
For 10% of the sample, the score is inferior to 12.08. Gender is correctly predicted in 27,487
occurrences and incorrectly predicted in 10,417 occurrences. The precision for this decile is 0.73.
For 10% of the sample with a score between 12.08 and 18.52, the precision is 0.91 (and so on).
The value interpolated from the score is returned as a calibrated probability
(probabilityCalibrated).
In order to estimate the overall precision of the API based on a probability cut-off value, we can
use the following table:
For 10% of the sample with a score higher than 79.49, Gender is correctly predicted in 37,905
occurrences and incorrectly predicted in 169 occurrences. The precision for this decile is 0.996,
but we have excluded 90% of the original data – so the recall is only 10%.
For 20% of the sample with a score higher that 40.71, Gender is correctly predicted in 75,395
occurrences and incorrectly predicted in 414 occurrences. The precision for those 20% is 0.995,
but we have excluded 80% of the original data – so the recall is only 20%.
An optimal value for most research is to exclude 10% of the original data that has the lowest
score. If we only consider the results where calibrated probability is higher than 0.73, we should
have about 90% of our original data and a precision higher than 0.976.
Software Changes
NamSorAPIv2.0.7
The probability calibration is no longer based
on the Score, but based on the probability
estimates.
NamSorAPIv2.0.6
Fix issue where probability could be between
0.33 and 0.5; with a low score, the probability
should be 0.5 (corresponding to randomly
choosing Male/Female).
Known Issues, future enhancements
The calibration dataset should be configurable.
https://github.com/namsor/namsor-tools-v2/issues/7