Conference PaperPDF Available

Measuring and Mitigating Unintended Bias in Text Classification


Abstract and Figures

We introduce and illustrate a new approach to measuring and mitigating unintended bias in machine learning models. Our definition of unintended bias is parameterized by a test set and a subset of input features. We illustrate how this can be used to evaluate text classifiers using a synthetic test set and a public corpus of comments annotated for toxicity from Wikipedia Talk pages. We also demonstrate how imbalances in training data can lead to unintended bias in the resulting models, and therefore potentially unfair applications. We use a set of common demographic identity terms as the subset of input features on which we measure bias. This technique permits analysis in the common scenario where demographic information on authors and readers is unavailable, so that bias mitigation must focus on the content of the text itself. The mitigation method we introduce is an unsupervised approach based on balancing the training dataset. We demonstrate that this approach reduces the unintended bias without compromising overall model quality.
Content may be subject to copyright.
Measuring and Mitigating Unintended Bias in Text Classification
Lucas Dixon
John Li
Jeffrey Sorensen
Nithum Thain
Lucy Vasserman
We introduce and illustrate a new approach to measuring and
mitigating unintended bias in machine learning models. Our
definition of unintended bias is parameterized by a test set
and a subset of input features. We illustrate how this can
be used to evaluate text classifiers using a synthetic test set
and a public corpus of comments annotated for toxicity from
Wikipedia Talk pages. We also demonstrate how imbalances
in training data can lead to unintended bias in the resulting
models, and therefore potentially unfair applications. We use
a set of common demographic identity terms as the subset of
input features on which we measure bias. This technique per-
mits analysis in the common scenario where demographic in-
formation on authors and readers is unavailable, so that bias
mitigation must focus on the content of the text itself. The
mitigation method we introduce is an unsupervised approach
based on balancing the training dataset. We demonstrate that
this approach reduces the unintended bias without compro-
mising overall model quality.
With the recent proliferation of the use of machine learning
for a wide variety of tasks, researchers have identified un-
fairness in ML models as one of the growing concerns in
the field. Many ML models are built from human-generated
data, and human biases can easily result in a skewed distri-
bution in the training data. ML practitioners must be proac-
tive in recognizing and counteracting these biases, otherwise
our models and products risk perpetuating unfairness by per-
forming better for some users than for others.
Recent research in fairness in machine learning proposes
several definitions of fairness for machine learning tasks,
metrics for evaluating fairness, and techniques to mitigate
unfairness. The main contribution of this paper is to intro-
duce methods to quantify and mitigate unintended bias in
text classification models. We illustrate the methods by ap-
plying them to a text classifier built to identify toxic com-
ments in Wikipedia Talk Pages (Wulczyn, Thain, and Dixon
Initial versions of text classifiers trained on this data
showed problematic trends for certain statements. Clearly
non-toxic statements containing certain identity terms, such
Copyright c
2017, Association for the Advancement of Artificial
Intelligence ( All rights reserved.
as “I am a gay man”, were given unreasonably high toxi-
city scores. We call this false positive bias. The source of
this bias was the disproportionate representation of iden-
tity terms in our training data: terms like “gay” were
so frequently used in toxic comments that the models
over-generalized and learned to disproportionately associate
those terms with the toxicity label. In this work, we propose
a method for identifying and mitigating this form of unin-
tended model bias.
In the following sections, we describe related work, then
discuss a working definition of unintended bias in a clas-
sification task, and distinguish that from “unfairness” in an
application. We then demonstrate that a significant cause of
unintended bias in our baseline model is due to dispropor-
tionate representation of data with certain identity terms and
provide a way to measure the extent of the disparity. We
then propose a simple and novel technique to counteract that
bias by strategically adding data. Finally, we present metrics
for evaluating unintended bias in a model, and demonstrate
that our technique reduces unintended bias while maintain-
ing overall model quality.
Related Work
Researchers of fairness in ML have proposed a range of def-
initions for “fairness” and metrics for its evaluation. Many
have also presented mitigation strategies to improve model
fairness according to these metrics. (Feldman et al. 2015)
provide a definition of fairness tied to demographic parity of
model predictions, and provides a strategy to alter the train-
ing data to improve fairness. (Hardt, Price, and Srebro 2016)
presents an alternate definition of fairness that requires par-
ity of model performance instead of predictions, and a mit-
igation strategy that applies to trained models. (Kleinberg,
Mullainathan, and Raghavan 2016) and (Friedler, Scheideg-
ger, and Venkatasubramanian 2016) both compare several
different fairness metrics. These works rely on the avail-
ability of demographic data about the object of classifica-
tion in order to identify and mitigate bias. (Beutel et al.
2017) presents a new mitigation technique using adversarial
training that requires only a small amount of labeled demo-
graphic data.
Very little prior work has been done on fairness for text
classification tasks. (Blodgett and O’Connor 2017), (Hovy
and Spruit 2016) and (Tatman 2017) discuss the impact of
using unfair natural language processing models for real-
world tasks, but do not provide mitigation strategies. (Boluk-
basi et al. 2016) demonstrates gender bias in word embed-
dings and provides a technique to “de-bias” them, allowing
these more fair embeddings to be used for any text-based
Our work adds to this growing body of machine learning
fairness research with a novel approach to defining, measur-
ing, and mitigating unintended bias for a text-based classifi-
cation task.
Model Task and Data
In this paper we work with a text classifier built to iden-
tify toxicity in comments from Wikipedia Talk Pages. The
model is built from a dataset of 127,820 Talk Page com-
ments, each labeled by human raters as toxic or non-toxic.
A toxic comment is defined as a “rude, disrespectful, or un-
reasonable comment that is likely to make you leave a dis-
cussion.” All versions of the model are convolutional neural
networks trained using the Keras framework (Chollet and
others 2015) in TensorFlow (Abadi et al. 2015).
Definitions of Unintended Bias and Fairness
The word ‘fairness’ in machine learning is used in various
ways. To avoid confusion, in this paper, we distinguish be-
tween the unintended biases in a machine learning model
and the potential unfair applications of the model.
Every machine learning model is designed to express a
bias. For example, a model trained to identify toxic com-
ments is intended to be biased such that comments that are
toxic get a higher score than those which are not. The model
is not intended to discriminate between the gender of the
people expressed in a comment - so if the model does so,
we call that unintended bias. We contrast this with fairness
which we use to refer to a potential negative impact on so-
ciety, and in particular when different individuals are treated
To illustrate this distinction, consider a model for toxi-
city that has unintended bias at a given threshold. For in-
stance, the model may give comments that contain the word
‘gay’ scores above the threshold independently of whether
the comment is toxic. If such a model is applied on a web-
site to remove comments that get a score above that thresh-
old, then we might speculate that the model will have a neg-
ative effect on society because it will make it more difficult
on that website to discuss topics where one would naturally
use the word ‘gay’. Thus we might say that the model’s im-
pact is unfair (to people who wish to write comments that
contain the word gay). However, if the model is used to sort
and review all comments before they are published then we
might find that the comments that contain the word gay are
reviewed first, and then published earlier, producing an un-
fair impact for people who write comments without the word
gay (since their comments may be published later). If com-
ments are grouped for review but published in batch, then the
model’s unintended bias may not cause any unfair impact on
comment authors.
Since the presence of unintended bias can have varied im-
pacts on fairness, we aim to define and mitigate the unin-
tended bias that will improve fairness across a broad range
of potential model applications.
One definition, adapted from the literature, is a model
contains unintended bias if it performs better for some de-
mographic groups than others (Hardt, Price, and Srebro
2016). To apply this to text classification, we consider the
unintended bias across the content of the text, and narrow the
definition to a model contains unintended bias if it performs
better for comments about some groups than for comments
about others groups.
In this work, we address one specific subcase of the above
definition, which we call identity term bias. Here, we nar-
row further from looking at all comments about different
groups to looking at comments containing specific identity
terms. Focusing on only a small selection of identity terms
enables us to make progress towards mitigating unintended
model bias, but it is of course only a first step. For this work,
our definition is: a model contains unintended bias if it per-
forms better for comments containing some particular iden-
tity terms than for comments containing others.
The false positive bias described above, where non-toxic
comments containing certain identity terms were given un-
reasonably high toxicity scores, is the manifestation of un-
intended bias. In the rest of this paper, we lay out strategies
to measure and mitigate this unintended bias.
Quantifying bias in dataset
Identity terms affected by the false positive bias are dispro-
portionately used in toxic comments in our training data.
For example, the word ‘gay’ appears in 3% of toxic com-
ments but only 0.5% of comments overall. The combination
of dataset size, model training methods, and the dispropor-
tionate number of toxic examples for comments containing
these words in the training data led to overfitting in the origi-
nal toxicity model: it made generalizations such as associat-
ing the word ‘gay’ with toxicity. We manually created a set
of 51 common identity terms, and looked for similar dispro-
portionate representations. Table 1 illustrates the difference
between the likelihood of seeing a given identity in a toxic
statement vs. its overall likelihood.
Term Toxic Overall
atheist 0.09% 0.10%
queer 0.30% 0.06%
gay 3% 0.50%
transgender 0.04% 0.02%
lesbian 0.10% 0.04%
homosexual 0.80% 0.20%
feminist 0.05% 0.05%
black 0.70% 0.60%
white 0.90% 0.70%
heterosexual 0.02% 0.03%
islam 0.10% 0.08%
muslim 0.20% 0.10%
bisexual 0.01% 0.03%
Table 1: Frequency of identity terms in toxic comments and
Figure 1: Percent of comments labeled as toxic at each
length containing the given terms.
In addition to a disproportionate amount of toxicity in
comments containing certain identity terms, there is also a
relationship between comment length and toxicity, as shown
in 1.
The models we are training are known to have the ability
to capture contextual dependencies. However, with insuffi-
cient data, the model has no error signal that would require
these distinctions, so these models are likely to overgeneral-
ize, causing the false positive bias for identity terms.
Bias Mitigation
To mitigate the data imbalance which causes the unintended
bias, we added additional data, all containing non-toxic ex-
amples of the identity terms where we found the most dis-
proportionate data distributions.
For each term, we added enough new non-toxic examples
to bring the toxic/non-toxic balance in line with the prior
distribution for the overall dataset, at each length bucket de-
scribed above. Because our CNN models are sensitive to
length, and toxic comments tend to be shorter, we found bal-
ancing by length to be especially important.
We mined the new data from Wikipedia articles them-
selves. Since the text comes from the published article, we
assume that the text is non-toxic, which we validated by
labeling 1000 comments, 99.5% of them were confirmed
non-toxic. Using unsupervised, assumed non-toxic article
data enables the data balancing to be done without addi-
tional human labeling. Gathering additional supervised non-
toxic data from the original comment domain could be pro-
hibitively expensive or impossible, as the rareness of these
types of comments is the initial cause of the bias. We found
that using unsupervised data, even from a slightly different
domain (article vs. comment text), is a feasible and effective
strategy to mitigate bias.
Unintended Bias Evaluation Metrics
To evaluate and compare performance and unintended bias
in the models, we use the following test sets and evaluation
Template Examples Label
I am <IDENTITY>Non-Toxic
I am a <IDENTITY>person, ask me anything Non-Toxic
<IDENTITY>people are just like everyone
I hate all <IDENTITY>Toxic
I am a <IDENTITY>person and I hate your
guts and think you suck
<IDENTITY>people are gross and universally
Table 2: Phrase template examples.
General Test Set
We use a general held out test set sampled from the original
data set of Wikipedia Talk Page comments. This set evalu-
ates overall model performance (intended bias), but it does
not provide much information about unintended bias. We
use this set to ensure that bias mitigation techniques do not
drastically hurt overall performance. This test set contains
31866 labeled comments of which 9.6% are labeled toxic.
Identity Phrase Templates Test Set
To evaluate unintended bias specifically on comments con-
taining identity terms, we generated a synthetic dataset. We
created templates of both toxic and non-toxic phrases and
slotted a wide range of identity terms into each of these tem-
plates, examples shown in table 2.
This creates a controlled set of 77,000 examples, 50% of
which are toxic, where we can directly test for unintended
model bias by grouping the comments by identity term and
comparing performance on each group.
A common evaluation metric for real-valued scores is area
under the receiver operating characteristic curve or AUC.
We look at the AUC on the general and identity phrase tem-
plate sets gauge overall model performance. AUC on the full
phrase template set (all identity phrases together) gives a
limited picture of unintended bias. A low AUC indicates that
the model is performing differently for phrases with differ-
ent identity terms, but it doesn’t help us understand which
identity terms are the outliers.
Error Rate Equality Difference
Equality of Odds, proposed in (Hardt, Price, and Srebro
2016), is a definition of fairness that is satisfied when the
false positive rates and false negative rates are equal across
comments containing different identity terms. This concept
inspires the error rate equality difference metrics, which use
the variation in these error rates between terms to measure
the extent of unintended bias in the model, similar to the
equality gap metric used in (Beutel et al. 2017).
Using the identity phrase test set, we calculate the false
positive rate, FPRand false negative rate, F N R on the en-
tire test set, as well as these same metrics on each subset of
the data containing each specific identity term, FPRtand
F N Rt. A more fair model will have similar values across
all terms, approaching the equality of odds ideal, where
FPR =FPRtand F N R =F N Rtfor all terms t. Wide
variation among these values across terms indicates high un-
intended bias.
Error rate equality difference quantifies the extent of the
per-term variation (and therefore the extent of unintended
bias) as the sum of the differences between the overall false
positive or negative rate and the per-term values, as shown
in equations 1 and 2.
False Positive
Equality Difference =X
False Negative
Equality Difference =X
|F N R F N Rt|(2)
Error rate equality differences evaluate classification out-
comes, not real-valued scores, so in order to calculate this
metric, we must choose one (or multiple) score threshold(s).
In this work, we use the equal error rate threshold for each
evaluated model.
Pinned AUC
In addition to the error rate metrics, we also defined a new
metric called pinned area under the curve (pinned AUC).
This metric addresses challenges with both regular AUC and
the error rate equality difference method and enables the
evaluation of unintended bias in a more general setting.
Many classification models, including those implemented
in our research, provide a prediction score rather than a di-
rect class decision. Thresholding can then be used to trans-
form this score into a predicted class, though in practice,
consumers of these models often use the scores directly to
sort and prioritize text. Prior fairness metrics, like error rate
equality difference, only provide an evaluation of bias in the
context of direct binary classification or after a threshold has
been chosen. The pinned AUC metric provides a threshold-
agnostic approach that detects bias in a wider range of use-
This approach adapts from the popular area under the re-
ceiver operator characteristic (AUC) metric which provides
a threshold-agnostic evaluation of the performance of an ML
classifier (Fawcett 2006). However, in the context of bias de-
tection, a direct application of AUC to the wrong datasets
can lead to inaccurate analysis. We demonstrate this with a
simulated hypothetical model represented in Figure 2.
Consider three datasets, each representing comments con-
taining different identity terms, here “tall”, “average”, or
“short”. The model represented by Figure 2 clearly con-
tains unintended bias, producing much higher scores for
both toxic and non-toxic comments containing “short”.
If we evaluate the model performance on each identity-
based dataset individually then we find that the model ob-
tains a high AUC on each (Table 3), obscuring the unin-
tended bias we know is present. This is not surprising as the
model appears to perform well at separating toxic and non-
toxic comments within each identity. This demonstrates the
general principle that the AUC score of a model on a strictly
Figure 2: Distributions of toxicity scores for three groups
of data, each containing comments with different identity
terms, “tall”, “average”, or “short”.
per-group identity dataset may not effectively identify unin-
tended bias.
By contrast, the AUC on the combined data is signifi-
cantly lower, indicating poor model performance. The un-
derlying cause, in this case, is due to the unintended bias
reducing the separability of classes by giving non-toxic ex-
amples in the “short” subgroup a higher score than many
toxic examples from the other subgroups. However, a low
combined AUC is not of much help in diagnosing bias, as it
could have many other causes, nor does it help distinguish
which subgroups are likely to be most negatively impacted.
The AUC measure on both the individual datasets and the
aggregated one provide poor measures of unintended bias,
as neither answer the key question in measuring bias: is the
model performance on one subgroup different than its per-
formance on the average example?
The pinned AUC metric tackles this question directly. The
pinned AUC metric for a subgroup is defined by comput-
ing the AUC on a secondary dataset containing two equally-
balanced components: a sample of comments from the sub-
group of interest and a sample of comments that reflect the
underlying distribution of comments. By creating this auxil-
iary dataset that “pin’s” the subgroup to the underlying dis-
tribution, we allow the AUC to capture the divergence of the
model performance on one subgroup with respect to the av-
erage example, providing a direct measure of bias.
More formally, if we let Drepresent the full set of com-
ments and Dtbe the set of comments in subgroup t, then we
can generate the secondary dataset for term tby applying
some sampling function sas in Equation 3 below1. Equa-
tion 4 then defines the pinned AUC of term t,pAUCt, as
1The exact technique for sub-sampling and defining Dmay
vary depending the data. See appendix.
Dataset AUC Pinned AUC
Combined 0.79 N/A
tall 0.93 0.84
average 0.93 0.84
short 0.93 0.79
Table 3: AUC results.
the AUC of the corresponding secondary dataset.
pDt=s(Dt) + s(D),|s(Dt)|=|s(D)|(3)
pAUCt=AU C(pDt)(4)
Table 3 demonstrates how the pinned AUC is able to
quantitatively reveal both the presence and victim of un-
intended bias. In this example, the “short” subgroup has a
lower pinned AUC than the other subgroups due to the bias
in the score distribution for those comments. While this is a
simple example, it extends to much larger sets of subgroups,
where pinned AUC can reveal unintended bias that would
otherwise be hidden.
Pinned AUC Equality Difference
While the actual value of the pinned AUC number is impor-
tant, for the purposes of unintended bias, it is most important
that the pinned AUC values are similar across groups. Simi-
lar pinned AUC values mean similar performance within the
overall distribution, indicating a lack of unintended bias. As
with equality of odds, in the ideal case, per-group pinned
AUCs and overall AUC would be equal. We therefore sum-
marize pinned AUC equality difference similarly to equality
difference for false positive and false negative rates above.
Pinned AUC equality difference, shown in equation 5, is
defined as a sum of the differences between the per-term
pinned AUC (pAUCt) and the overall AUC on the aggre-
gated data over all identity terms (AUC ). A lower sum
represents less variance between performance on individual
terms, and therefore less unintended bias.
Pinned AUC
Equality Difference =X
|AUC pAU Ct|(5)
We evaluate three models: a baseline, a bias-mitigated
model, and a control. Each of the three models is trained
using an identical convolutional neural network architec-
ture2. The baseline model is trained on all 127,820 super-
vised Wikipedia TalkPage comments. The bias-mitigated
model has undergone the bias mitigation technique de-
scribed above, adding 4,620 additional assumed non-toxic
training samples from Wikipedia articles to balance the dis-
tribution of specific identity terms. The control group also
adds 4,620 randomly selected comments from Wikipedia ar-
ticles, meant to confirm that model improvements in the ex-
periment are not solely due to the addition of data.
2The details of the model and code are available at
Model General Phrase Templates
Baseline 0.960 0.952
Random Control 0.957 0.946
Bias Mitigated 0.959 0.960
Table 4: Mean AUC on the general and phrase templates test
To capture the impact of training variance, we train each
model ten times, and show all results as scatter plots, with
each point representing one model.
Overall AUC
Table 4 shows the mean AUC for all three models on the
general test set and on the identity phrase set. We see that the
bias-mitigated model performs best on the identity phrase
set, while not losing performance on the general set, demon-
strating a reduction in unintended bias without compromis-
ing general model performance.
Error Rates
To evaluate using the error rate equality difference metric de-
fined above and inspired by (Hardt, Price, and Srebro 2016),
we convert each model into a binary classifier by selecting
a threshold for each model using the equal error rate com-
puted on the general test set. Here we compare the false posi-
tive and false negative rates for each identity term with each
model. A more fair model will have similar false positive
and negative rates across all terms, and a model with unin-
tended bias will have a wide variance in these metrics.
Figure 3 shows the per-term false positive rates for the
baseline model, the random control, and the bias-mitigated
model. The bias-mitigated model clearly shows more uni-
formity of performance across terms, demonstrating that
the bias-mitigation technique does indeed reduce unintended
bias. The performance is still not completely uniform how-
ever, there is still room for improvement.
Figure 4 shows the per-term false negative rates for the
three experiments. The effect is less pronounced here since
we added non-toxic (negative) data only, aiming specifically
to combat false positives. Most importantly, we do not see an
increase in variance of false negative rates, demonstrating
that the bias mitigation technique reduces unintended bias
on false positives, while not introducing false negative bias
on the measured terms.
Pinned AUC
We also demonstrate a reduction in unintended bias using
the new pinned AUC metric introduced in this work. As with
error rates, a more fair model will have similar performance
across all terms. Figure 5 show the per-term pinned AUC for
each model, and we again see more uniformity in from the
bias-mitigated model. This demonstrates that the bias mit-
igation technique reduces unintended bias of the model’s
real-valued scores, not just of the thresholded binary clas-
sifier used to measure equality difference.
Figure 3: Per-term false positive rates for the baseline, ran-
dom control, and bias-mitigated models.
Figure 5: Per-term pinned AUC for the baseline, random
control, and bias-mitigated models.
Equality Difference Summary
Finally, we look at the equality difference for false positives,
false negatives, and pinned AUC to summarize each chart
into one metric, shown in 5. The bias-mitigated model shows
smaller sums of differences for all three metrics, indicating
more similarity in performance across identity terms, and
therefore less unintended bias.
Future Work
This work relies on machine learning researchers selecting
narrow a definition of unintended bias tied to a specific set
of identity terms to measure and correct for. For future work,
we hope to remove the human step of identifying the relevant
Figure 4: Per-term false negative rates for the baseline, ran-
dom control, and bias-mitigated models.
Sums of differences
Metric Baseline Control Bias-Mitigated
False Positive
Equality Difference 74.13 77.72 52.94
False Negative
Equality Difference 36.73 36.91 30.73
Pinned AUC
Equality Difference 6.37 6.84 4.07
Table 5: Sums of differences between the per-term value and
the overall value for each model.
identity terms, either by automating the mining of identity
terms affected by unintended bias or by devising bias miti-
gation strategies that do not rely directly on a set of identity
terms. We also hope to generalize the methods to be less
dependent on individual words, so that we can more effec-
tively deal with biases tied to words used in many different
contexts, e.g. white vs black.
In this paper, we have proposed a definition of unintended
bias for text classification and distinguished it from fairness
in the application of ML. We have presented strategies for
quantifying and mitigating unintended bias in datasets and
the resulting models. We demonstrated that applying these
strategies mitigate the unintended biases in a model with-
out harming the overall model quality, and with very little
impact even on the original test set.
What we present here is a first step towards fairness in text
classification, the path to fair models will of course require
many more steps.
Pinned AUC
We defined pinned AUC as copied below.
pDt=s(Dt) + s(D),|s(Dt)|=|s(D)|(6)
pAUCt=AU C(pDt)(7)
Depending on the exact data in the full set D, there are
many options for sub-sampling down to s(D), each with im-
pacts on the pinned AUC metric and it’s ability to reveal un-
intended bias. A full evaluation of these are left for future
work, but here is a quick summary of the possible variants:
1. Replacement: While DtD, it may make sense to sam-
ple such that Dt6⊂ s(D). The results shown in this work
sample this way.
2. Other subgroups: If Dcontains many subgroups in dif-
ferent amounts, s(D)could be sampled to ensure equal
representation from each group. In this work, Dis syn-
thetically constructed such that each subgroup is equally
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.;
Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.;
Ghemawat, S.; Goodfellow, I.; Harp, A.; Irving, G.; Isard,
M.; Jia, Y.; Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Leven-
berg, J.; Man´
e, D.; Monga, R.; Moore, S.; Murray, D.; Olah,
C.; Schuster, M.; Shlens, J.; Steiner, B.; Sutskever, I.; Tal-
war, K.; Tucker, P.; Vanhoucke, V.; Vasudevan, V.; Vi´
F.; Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke, M.; Yu,
Y.; and Zheng, X. 2015. TensorFlow: Large-scale machine
learning on heterogeneous systems. Software available from
Beutel, A.; Chen, J.; Zhao, Z.; and Chi, E. H. 2017. Data
decisions and theoretical implications when adversarially
learning fair representations. CoRR abs/1707.00075.
Blodgett, S. L., and O’Connor, B. 2017. Racial disparity in
natural language processing: A case study of social media
african-american english. CoRR abs/1707.00061.
Bolukbasi, T.; Chang, K.; Zou, J. Y.; Saligrama, V.; and
Kalai, A. 2016. Man is to computer programmer as woman
is to homemaker? debiasing word embeddings. CoRR
Chollet, F., et al. 2015. Keras.
Fawcett, T. 2006. An introduction to roc analysis. Pattern
recognition letters 27(8):861–874.
Feldman, M.; Friedler, S. A.; Moeller, J.; Scheidegger, C.;
and Venkatasubramanian, S. 2015. Certifying and remov-
ing disparate impact. In Proceedings of the 21th ACM
SIGKDD International Conference on Knowledge Discov-
ery and Data Mining, KDD ’15, 259–268. New York, NY,
Friedler, S. A.; Scheidegger, C.; and Venkatasubramanian,
S. 2016. On the (im)possibility of fairness. CoRR
Hardt, M.; Price, E.; and Srebro, N. 2016. Equality of op-
portunity in supervised learning. CoRR abs/1610.02413.
Hovy, D., and Spruit, S. L. 2016. The social impact of
natural language processing. In ACL.
Kleinberg, J. M.; Mullainathan, S.; and Raghavan, M. 2016.
Inherent trade-offs in the fair determination of risk scores.
CoRR abs/1609.05807.
Tatman, R. 2017. Gender and dialect bias in youtube’s au-
tomatic captions. Valencia, Spain: European Chapter of the
Association for Computational Linguistics.
Wulczyn, E.; Thain, N.; and Dixon, L. 2017. Ex machina:
Personal attacks seen at scale. In Proceedings of the 26th
International Conference on World Wide Web, WWW ’17,
1391–1399. Republic and Canton of Geneva, Switzerland:
International World Wide Web Conferences Steering Com-
... (NLP) [7], [27] and can be easily learned by even the most advanced text classifiers [14]. Hate speech detection as one of the downstream tasks has been widely applied on social media platforms. ...
... Data augmentation tackles the problem at the data preprocessing stage. Instances relating to underrepresented groups are augmented through random combinations of pre-defined templates and group identifiers [7], [16], [36]. But augmentation without supervision always delivers meaningless and unrealistic instances, which could introduce potential risks into the system. ...
Hate speech detection is a common downstream application of natural language processing (NLP) in the real world. In spite of the increasing accuracy, current data-driven approaches could easily learn biases from the imbalanced data distributions originating from humans. The deployment of biased models could further enhance the existing social biases. But unlike handling tabular data, defining and mitigating biases in text classifiers, which deal with unstructured data, are more challenging. A popular solution for improving machine learning fairness in NLP is to conduct the debiasing process with a list of potentially discriminated words given by human annotators. In addition to suffering from the risks of overlooking the biased terms, exhaustively identifying bias with human annotators are unsustainable since discrimination is variable among different datasets and may evolve over time. To this end, we propose an automatic misuse detector (MiD) relying on an explanation method for detecting potential bias. And built upon that, an end-to-end debiasing framework with the proposed staged correction is designed for text classifiers without any external resources required.
... Datasets Our experiments are based on 4 models trained with the following benchmark datasets: Census Income [38], German Credit [25], Bank Marketing [35] and COMPAS [6]. These datasets have been used as the evaluation subjects in multiple previous studies [15,21,34,40,49,50]. ...
... It is possible to extend our method to support deep learning architectures such as RNN (for text data) by extending causality analysis to handle feedback loops. We focus on feedforward NN as existing studies on fairness largely focus on tabular data [15,21,34,40,49,50]. ...
Given a discriminating neural network, the problem of fairness improvement is to systematically reduce discrimination without significantly scarifies its performance (i.e., accuracy). Multiple categories of fairness improving methods have been proposed for neural networks, including pre-processing, in-processing and post-processing. Our empirical study however shows that these methods are not always effective (e.g., they may improve fairness by paying the price of huge accuracy drop) or even not helpful (e.g., they may even worsen both fairness and accuracy). In this work, we propose an approach which adaptively chooses the fairness improving method based on causality analysis. That is, we choose the method based on how the neurons and attributes responsible for unfairness are distributed among the input attributes and the hidden neurons. Our experimental evaluation shows that our approach is effective (i.e., always identify the best fairness improving method) and efficient (i.e., with an average time overhead of 5 minutes).
... More generally, although our work focuses on adversarial attacks on generative models, it is heavily inspired by and related to prior work that examines the efficacy of adversarial testing to find and address vulnerabilities in NLP algorithms in discriminative settings. Some of these efforts augment humans (through guidelines, templates, programmatic generation of attacks, and various combinations thereof) to devise test cases that cause systems to fail [45,46,29,21,30,55,6,23]. Others use humans in the loop to continuously and dynamically build, break, and fix [20] models in order to continuously make them more robust to failure modes [40,32,55,61]. ...
Full-text available
We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models.
... While many models have claimed to achieve state-ofthe-art performance on some datasets, they fail to generalize (Arango, Pérez, and Poblete 2019;Gröndahl et al. 2018). The models may classify comments that refer to certain commonly-attacked identities (e.g., gay, black, muslim) as toxic without the comment having any intention of being toxic (Dixon et al. 2018;Borkan et al. 2019). A large prior on certain trigger vocabulary leads to biased predictions that may discriminate against particular groups who are already the target of such abuse (Sap et al. 2019;Davidson, Bhattacharya, and Weber 2019). ...
Hate speech is a challenging issue plaguing the online social media. While better models for hate speech detection are continuously being developed, there is little research on the bias and interpretability aspects of hate speech. In this paper, we introduce HateXplain, the first benchmark hate speech dataset covering multiple aspects of the issue. Each post in our dataset is annotated from three different perspectives: the basic, commonly used 3-class classification (i.e., hate, offensive or normal), the target community (i.e., the community that has been the victim of hate speech/offensive speech in the post), and the rationales, i.e., the portions of the post on which their labelling decision (as hate, offensive or normal) is based. We utilize existing state-of-the-art models and observe that even models that perform very well in classification do not score high on explainability metrics like model plausibility and faithfulness. We also observe that models, which utilize the human rationales for training, perform better in reducing unintended bias towards target communities. We have made our code and dataset public for other researchers.
... We employ LSTM (Long Short-Term Memory) (Hochreiter and Schmidhuber, 1997) as the toxicity classifier, which takes the input of a variety of word embeddings considered in this work. We then measure the biases using two commonly adopted metrics (Dixon et al., 2018): False Positive Equality Difference (FPED) and False Negative Equality Difference (FNED). FNED/FPED is defined as the sum of deviations of group-specific False Negative Rates (FNRs)/False Positive Rates (FPRs) from the overall FNR/FPR. ...
Full-text available
Debiasing word embeddings has been largely limited to individual and independent social categories. However, real-world corpora typically present multiple social categories that possibly correlate or intersect with each other. For instance, "hair weaves" is stereotypically associated with African American females, but neither African American nor females alone. Therefore, this work studies biases associated with multiple social categories: joint biases induced by the union of different categories and intersectional biases that do not overlap with the biases of the constituent categories. We first empirically observe that individual biases intersect non-trivially (i.e., over a one-dimensional subspace). Drawing from the intersectional theory in social science and the linguistic theory, we then construct an intersectional subspace to debias for multiple social categories using the nonlinear geometry of individual biases. Empirical evaluations corroborate the efficacy of our approach. Data and implementation code can be downloaded at
Natural language models and systems have been shown to reflect gender bias existing in training data. This bias can impact on the downstream task that machine learning models, built on this training data, are to accomplish. A variety of techniques have been proposed to mitigate gender bias in training data. In this paper we compare different gender bias mitigation approaches on a classification task. We consider mitigation techniques that manipulate the training data itself, including data scrubbing, gender swapping and counterfactual data augmentation approaches. We also look at using de-biased word embeddings in the representation of the training data. We evaluate the effectiveness of the different approaches at reducing the gender bias in the training data and consider the impact on task performance. Our results show that the performance of the classification task is not affected adversely by many of the bias mitigation techniques but we show a significant variation in the effectiveness of the different gender bias mitigation techniques.KeywordsGender biasTraining dataClassification
Full-text available
Data-driven algorithms are studied and deployed in diverse domains to support critical decisions, directly impacting people’s well-being. As a result, a growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations. Progress in fair machine learning and equitable algorithm design hinges on data, which can be appropriately used only if adequately documented. Unfortunately, the algorithmic fairness community, as a whole, suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity). In this work, we target this data documentation debt by surveying over two hundred datasets employed in algorithmic fairness research, and producing standardized and searchable documentation for each of them. Moreover we rigorously identify the three most popular fairness datasets, namely Adult, COMPAS, and German Credit, for which we compile in-depth documentation. This unifying documentation effort supports multiple contributions. Firstly, we summarize the merits and limitations of Adult, COMPAS, and German Credit, adding to and unifying recent scholarship, calling into question their suitability as general-purpose fairness benchmarks. Secondly, we document hundreds of available alternatives, annotating their domain and supported fairness tasks, along with additional properties of interest for fairness practitioners and researchers, including their format, cardinality, and the sensitive attributes they encode. We summarize this information, zooming in on the tasks, domains, and roles of these resources. Finally, we analyze these datasets from the perspective of five important data curation topics: anonymization, consent, inclusivity, labeling of sensitive attributes, and transparency. We discuss different approaches and levels of attention to these topics, making them tangible, and distill them into a set of best practices for the curation of novel resources.
Increasingly, online discussions have been subject to offensive comments and abuse from malintent users. The threat of abuse and online harassment often forces many people to withdraw from discussions or shut down essential conversations. The current explosion in the number of online communities poses a huge problem for online social media platforms in terms of effective moderation of discussions and prevention from abuse. This paper aims to present a scalable method for effectively classifying toxic comments. Recently, the XGBoost framework has become very popular among machine learning practitioners and data scientists due to its effectiveness, flexibility, and portability. It exploits the gradient boosting framework and adds several improvements that enhance the performance and accuracy of the model. Due to its desirability, we utilize XGBoost to create an efficient system for toxic comment classification and evaluate its performance using area under curve and receiver operating characteristic curve.KeywordsXGBoostEnsemble learningNatural language processingMachine learning
Full-text available
Recent discussion in the public sphere about algorithmic classification has involved tension between competing notions of what it means for a probabilistic classification to be fair to different groups. We formalize three fairness conditions that lie at the heart of these debates, and we prove that except in highly constrained special cases, there is no method that can satisfy these three conditions simultaneously. Moreover, even satisfying all three conditions approximately requires that the data lie in an approximate version of one of the constrained special cases identified by our theorem. These results suggest some of the ways in which key notions of fairness are incompatible with each other, and hence provide a framework for thinking about the trade-offs between them.
Full-text available
The blind application of machine learning runs the risk of amplifying biases present in data. Such a danger is facing us with word embedding, a popular framework to represent text data as vectors which has been used in many machine learning and natural language processing tasks. We show that even word embeddings trained on Google News articles exhibit female/male gender stereotypes to a disturbing extent. This raises concerns because their widespread use, as we describe, often tends to amplify these biases. Geometrically, gender bias is first shown to be captured by a direction in the word embedding. Second, gender neutral words are shown to be linearly separable from gender definition words in the word embedding. Using these properties, we provide a methodology for modifying an embedding to remove gender stereotypes, such as the association between between the words receptionist and female, while maintaining desired associations such as between the words queen and female. We define metrics to quantify both direct and indirect gender biases in embeddings, and develop algorithms to "debias" the embedding. Using crowd-worker evaluation as well as standard benchmarks, we empirically demonstrate that our algorithms significantly reduce gender bias in embeddings while preserving the its useful properties such as the ability to cluster related concepts and to solve analogy tasks. The resulting embeddings can be used in applications without amplifying gender bias.
We highlight an important frontier in algorithmic fairness: disparity in the quality of natural language processing algorithms when applied to language from authors of different social groups. For example, current systems sometimes analyze the language of females and minorities more poorly than they do of whites and males. We conduct an empirical analysis of racial disparity in language identification for tweets written in African-American English, and discuss implications of disparity in NLP.
How can we learn a classifier that is "fair" for a protected or sensitive group, when we do not know if the input to the classifier belongs to the protected group? How can we train such a classifier when data on the protected group is difficult to attain? In many settings, finding out the sensitive input attribute can be prohibitively expensive even during model training, and sometimes impossible during model serving. For example, in recommender systems, if we want to predict if a user will click on a given recommendation, we often do not know many attributes of the user, e.g., race or age, and many attributes of the content are hard to determine, e.g., the language or topic. Thus, it is not feasible to use a different classifier calibrated based on knowledge of the sensitive attribute. Here, we use an adversarial training procedure to remove information about the sensitive attribute from the latent representation learned by a neural network. In particular, we study how the choice of data for the adversarial training effects the resulting fairness properties. We find two interesting results: a small amount of data is needed to train these adversarial models, and the data distribution empirically drives the adversary's notion of fairness.
Conference Paper
The damage personal attacks cause to online discourse motivates many platforms to try to curb the phenomenon. However, understanding the prevalence and impact of personal attacks in online platforms at scale remains surprisingly difficult. The contribution of this paper is to develop and illustrate a method that combines crowdsourcing and machine learning to analyze personal attacks at scale. We show an evaluation method for a classifier in terms of the aggregated number of crowd-workers it can approximate. We apply our methodology to English Wikipedia, generating a corpus of over 100k high quality human-labeled comments and 63M machine-labeled ones from a classifier that is as good as the aggregate of 3 crowd-workers, as measured by the area under the ROC curve and Spearman correlation. Using this corpus of machine-labeled scores, our methodology allows us to explore some of the open questions about the nature of online personal attacks. This reveals that the majority of personal attacks on Wikipedia are not the result of a few malicious users, nor primarily the consequence of allowing anonymous contributions from unregistered users.
We propose a criterion for discrimination against a specified sensitive attribute in supervised learning, where the goal is to predict some target based on available features. Assuming data about the predictor, target, and membership in the protected group are available, we show how to optimally adjust any learned predictor so as to remove discrimination according to our definition. Our framework also improves incentives by shifting the cost of poor classification from disadvantaged groups to the decision maker, who can respond by improving the classification accuracy. In line with other studies, our notion is oblivious: it depends only on the joint statistics of the predictor, the target and the protected attribute, but not on interpretation of individualfeatures. We study the inherent limits of defining and identifying biases based on such oblivious measures, outlining what can and cannot be inferred from different oblivious tests. We illustrate our notion using a case study of FICO credit scores.
What does it mean for an algorithm to be fair? Different papers use different notions of algorithmic fairness, and although these appear internally consistent, they also seem mutually incompatible. We present a mathematical setting in which the distinctions in previous papers can be made formal. In addition to characterizing the spaces of inputs (the "observed" space) and outputs (the "decision" space), we introduce the notion of a construct space: a space that captures unobservable, but meaningful variables for the prediction. We show that in order to prove desirable properties of the entire decision-making process, different mechanisms for fairness require different assumptions about the nature of the mapping from construct space to decision space. The results in this paper imply that future treatments of algorithmic fairness should more explicitly state assumptions about the relationship between constructs and observations.
What does it mean for an algorithm to be biased? In U.S. law, unintentional bias is encoded via disparate impact, which occurs when a selection process has widely different outcomes for different groups, even as it appears to be neutral. This legal determination hinges on a definition of a protected class (ethnicity, gender) and an explicit description of the process. When computers are involved, determining disparate impact (and hence bias) is harder. It might not be possible to disclose the process. In addition, even if the process is open, it might be hard to elucidate in a legal setting how the algorithm makes its decisions. Instead of requiring access to the process, we propose making inferences based on the data it uses. We present four contributions. First, we link disparate impact to a measure of classification accuracy that while known, has received relatively little attention. Second, we propose a test for disparate impact based on how well the protected class can be predicted from the other attributes. Third, we describe methods by which data might be made unbiased. Finally, we present empirical evidence supporting the effectiveness of our test for disparate impact and our approach for both masking bias and preserving relevant information in the data. Interestingly, our approach resembles some actual selection practices that have recently received legal scrutiny.