Measuring and Mitigating Unintended Bias in Text Classiﬁcation
We introduce and illustrate a new approach to measuring and
mitigating unintended bias in machine learning models. Our
deﬁnition of unintended bias is parameterized by a test set
and a subset of input features. We illustrate how this can
be used to evaluate text classiﬁers using a synthetic test set
and a public corpus of comments annotated for toxicity from
Wikipedia Talk pages. We also demonstrate how imbalances
in training data can lead to unintended bias in the resulting
models, and therefore potentially unfair applications. We use
a set of common demographic identity terms as the subset of
input features on which we measure bias. This technique per-
mits analysis in the common scenario where demographic in-
formation on authors and readers is unavailable, so that bias
mitigation must focus on the content of the text itself. The
mitigation method we introduce is an unsupervised approach
based on balancing the training dataset. We demonstrate that
this approach reduces the unintended bias without compro-
mising overall model quality.
With the recent proliferation of the use of machine learning
for a wide variety of tasks, researchers have identiﬁed un-
fairness in ML models as one of the growing concerns in
the ﬁeld. Many ML models are built from human-generated
data, and human biases can easily result in a skewed distri-
bution in the training data. ML practitioners must be proac-
tive in recognizing and counteracting these biases, otherwise
our models and products risk perpetuating unfairness by per-
forming better for some users than for others.
Recent research in fairness in machine learning proposes
several deﬁnitions of fairness for machine learning tasks,
metrics for evaluating fairness, and techniques to mitigate
unfairness. The main contribution of this paper is to intro-
duce methods to quantify and mitigate unintended bias in
text classiﬁcation models. We illustrate the methods by ap-
plying them to a text classiﬁer built to identify toxic com-
ments in Wikipedia Talk Pages (Wulczyn, Thain, and Dixon
Initial versions of text classiﬁers trained on this data
showed problematic trends for certain statements. Clearly
non-toxic statements containing certain identity terms, such
2017, Association for the Advancement of Artiﬁcial
Intelligence (www.aaai.org). All rights reserved.
as “I am a gay man”, were given unreasonably high toxi-
city scores. We call this false positive bias. The source of
this bias was the disproportionate representation of iden-
tity terms in our training data: terms like “gay” were
so frequently used in toxic comments that the models
over-generalized and learned to disproportionately associate
those terms with the toxicity label. In this work, we propose
a method for identifying and mitigating this form of unin-
tended model bias.
In the following sections, we describe related work, then
discuss a working deﬁnition of unintended bias in a clas-
siﬁcation task, and distinguish that from “unfairness” in an
application. We then demonstrate that a signiﬁcant cause of
unintended bias in our baseline model is due to dispropor-
tionate representation of data with certain identity terms and
provide a way to measure the extent of the disparity. We
then propose a simple and novel technique to counteract that
bias by strategically adding data. Finally, we present metrics
for evaluating unintended bias in a model, and demonstrate
that our technique reduces unintended bias while maintain-
ing overall model quality.
Researchers of fairness in ML have proposed a range of def-
initions for “fairness” and metrics for its evaluation. Many
have also presented mitigation strategies to improve model
fairness according to these metrics. (Feldman et al. 2015)
provide a deﬁnition of fairness tied to demographic parity of
model predictions, and provides a strategy to alter the train-
ing data to improve fairness. (Hardt, Price, and Srebro 2016)
presents an alternate deﬁnition of fairness that requires par-
ity of model performance instead of predictions, and a mit-
igation strategy that applies to trained models. (Kleinberg,
Mullainathan, and Raghavan 2016) and (Friedler, Scheideg-
ger, and Venkatasubramanian 2016) both compare several
different fairness metrics. These works rely on the avail-
ability of demographic data about the object of classiﬁca-
tion in order to identify and mitigate bias. (Beutel et al.
2017) presents a new mitigation technique using adversarial
training that requires only a small amount of labeled demo-
Very little prior work has been done on fairness for text
classiﬁcation tasks. (Blodgett and O’Connor 2017), (Hovy
and Spruit 2016) and (Tatman 2017) discuss the impact of
using unfair natural language processing models for real-
world tasks, but do not provide mitigation strategies. (Boluk-
basi et al. 2016) demonstrates gender bias in word embed-
dings and provides a technique to “de-bias” them, allowing
these more fair embeddings to be used for any text-based
Our work adds to this growing body of machine learning
fairness research with a novel approach to deﬁning, measur-
ing, and mitigating unintended bias for a text-based classiﬁ-
Model Task and Data
In this paper we work with a text classiﬁer built to iden-
tify toxicity in comments from Wikipedia Talk Pages. The
model is built from a dataset of 127,820 Talk Page com-
ments, each labeled by human raters as toxic or non-toxic.
A toxic comment is deﬁned as a “rude, disrespectful, or un-
reasonable comment that is likely to make you leave a dis-
cussion.” All versions of the model are convolutional neural
networks trained using the Keras framework (Chollet and
others 2015) in TensorFlow (Abadi et al. 2015).
Deﬁnitions of Unintended Bias and Fairness
The word ‘fairness’ in machine learning is used in various
ways. To avoid confusion, in this paper, we distinguish be-
tween the unintended biases in a machine learning model
and the potential unfair applications of the model.
Every machine learning model is designed to express a
bias. For example, a model trained to identify toxic com-
ments is intended to be biased such that comments that are
toxic get a higher score than those which are not. The model
is not intended to discriminate between the gender of the
people expressed in a comment - so if the model does so,
we call that unintended bias. We contrast this with fairness
which we use to refer to a potential negative impact on so-
ciety, and in particular when different individuals are treated
To illustrate this distinction, consider a model for toxi-
city that has unintended bias at a given threshold. For in-
stance, the model may give comments that contain the word
‘gay’ scores above the threshold independently of whether
the comment is toxic. If such a model is applied on a web-
site to remove comments that get a score above that thresh-
old, then we might speculate that the model will have a neg-
ative effect on society because it will make it more difﬁcult
on that website to discuss topics where one would naturally
use the word ‘gay’. Thus we might say that the model’s im-
pact is unfair (to people who wish to write comments that
contain the word gay). However, if the model is used to sort
and review all comments before they are published then we
might ﬁnd that the comments that contain the word gay are
reviewed ﬁrst, and then published earlier, producing an un-
fair impact for people who write comments without the word
gay (since their comments may be published later). If com-
ments are grouped for review but published in batch, then the
model’s unintended bias may not cause any unfair impact on
Since the presence of unintended bias can have varied im-
pacts on fairness, we aim to deﬁne and mitigate the unin-
tended bias that will improve fairness across a broad range
of potential model applications.
One deﬁnition, adapted from the literature, is a model
contains unintended bias if it performs better for some de-
mographic groups than others (Hardt, Price, and Srebro
2016). To apply this to text classiﬁcation, we consider the
unintended bias across the content of the text, and narrow the
deﬁnition to a model contains unintended bias if it performs
better for comments about some groups than for comments
about others groups.
In this work, we address one speciﬁc subcase of the above
deﬁnition, which we call identity term bias. Here, we nar-
row further from looking at all comments about different
groups to looking at comments containing speciﬁc identity
terms. Focusing on only a small selection of identity terms
enables us to make progress towards mitigating unintended
model bias, but it is of course only a ﬁrst step. For this work,
our deﬁnition is: a model contains unintended bias if it per-
forms better for comments containing some particular iden-
tity terms than for comments containing others.
The false positive bias described above, where non-toxic
comments containing certain identity terms were given un-
reasonably high toxicity scores, is the manifestation of un-
intended bias. In the rest of this paper, we lay out strategies
to measure and mitigate this unintended bias.
Quantifying bias in dataset
Identity terms affected by the false positive bias are dispro-
portionately used in toxic comments in our training data.
For example, the word ‘gay’ appears in 3% of toxic com-
ments but only 0.5% of comments overall. The combination
of dataset size, model training methods, and the dispropor-
tionate number of toxic examples for comments containing
these words in the training data led to overﬁtting in the origi-
nal toxicity model: it made generalizations such as associat-
ing the word ‘gay’ with toxicity. We manually created a set
of 51 common identity terms, and looked for similar dispro-
portionate representations. Table 1 illustrates the difference
between the likelihood of seeing a given identity in a toxic
statement vs. its overall likelihood.
Term Toxic Overall
atheist 0.09% 0.10%
queer 0.30% 0.06%
gay 3% 0.50%
transgender 0.04% 0.02%
lesbian 0.10% 0.04%
homosexual 0.80% 0.20%
feminist 0.05% 0.05%
black 0.70% 0.60%
white 0.90% 0.70%
heterosexual 0.02% 0.03%
islam 0.10% 0.08%
muslim 0.20% 0.10%
bisexual 0.01% 0.03%
Table 1: Frequency of identity terms in toxic comments and
Figure 1: Percent of comments labeled as toxic at each
length containing the given terms.
In addition to a disproportionate amount of toxicity in
comments containing certain identity terms, there is also a
relationship between comment length and toxicity, as shown
The models we are training are known to have the ability
to capture contextual dependencies. However, with insufﬁ-
cient data, the model has no error signal that would require
these distinctions, so these models are likely to overgeneral-
ize, causing the false positive bias for identity terms.
To mitigate the data imbalance which causes the unintended
bias, we added additional data, all containing non-toxic ex-
amples of the identity terms where we found the most dis-
proportionate data distributions.
For each term, we added enough new non-toxic examples
to bring the toxic/non-toxic balance in line with the prior
distribution for the overall dataset, at each length bucket de-
scribed above. Because our CNN models are sensitive to
length, and toxic comments tend to be shorter, we found bal-
ancing by length to be especially important.
We mined the new data from Wikipedia articles them-
selves. Since the text comes from the published article, we
assume that the text is non-toxic, which we validated by
labeling 1000 comments, 99.5% of them were conﬁrmed
non-toxic. Using unsupervised, assumed non-toxic article
data enables the data balancing to be done without addi-
tional human labeling. Gathering additional supervised non-
toxic data from the original comment domain could be pro-
hibitively expensive or impossible, as the rareness of these
types of comments is the initial cause of the bias. We found
that using unsupervised data, even from a slightly different
domain (article vs. comment text), is a feasible and effective
strategy to mitigate bias.
Unintended Bias Evaluation Metrics
To evaluate and compare performance and unintended bias
in the models, we use the following test sets and evaluation
Template Examples Label
I am <IDENTITY>Non-Toxic
I am a <IDENTITY>person, ask me anything Non-Toxic
<IDENTITY>people are just like everyone
I hate all <IDENTITY>Toxic
I am a <IDENTITY>person and I hate your
guts and think you suck
<IDENTITY>people are gross and universally
Table 2: Phrase template examples.
General Test Set
We use a general held out test set sampled from the original
data set of Wikipedia Talk Page comments. This set evalu-
ates overall model performance (intended bias), but it does
not provide much information about unintended bias. We
use this set to ensure that bias mitigation techniques do not
drastically hurt overall performance. This test set contains
31866 labeled comments of which 9.6% are labeled toxic.
Identity Phrase Templates Test Set
To evaluate unintended bias speciﬁcally on comments con-
taining identity terms, we generated a synthetic dataset. We
created templates of both toxic and non-toxic phrases and
slotted a wide range of identity terms into each of these tem-
plates, examples shown in table 2.
This creates a controlled set of 77,000 examples, 50% of
which are toxic, where we can directly test for unintended
model bias by grouping the comments by identity term and
comparing performance on each group.
A common evaluation metric for real-valued scores is area
under the receiver operating characteristic curve or AUC.
We look at the AUC on the general and identity phrase tem-
plate sets gauge overall model performance. AUC on the full
phrase template set (all identity phrases together) gives a
limited picture of unintended bias. A low AUC indicates that
the model is performing differently for phrases with differ-
ent identity terms, but it doesn’t help us understand which
identity terms are the outliers.
Error Rate Equality Difference
Equality of Odds, proposed in (Hardt, Price, and Srebro
2016), is a deﬁnition of fairness that is satisﬁed when the
false positive rates and false negative rates are equal across
comments containing different identity terms. This concept
inspires the error rate equality difference metrics, which use
the variation in these error rates between terms to measure
the extent of unintended bias in the model, similar to the
equality gap metric used in (Beutel et al. 2017).
Using the identity phrase test set, we calculate the false
positive rate, FPRand false negative rate, F N R on the en-
tire test set, as well as these same metrics on each subset of
the data containing each speciﬁc identity term, FPRtand
F N Rt. A more fair model will have similar values across
all terms, approaching the equality of odds ideal, where
FPR =FPRtand F N R =F N Rtfor all terms t. Wide
variation among these values across terms indicates high un-
Error rate equality difference quantiﬁes the extent of the
per-term variation (and therefore the extent of unintended
bias) as the sum of the differences between the overall false
positive or negative rate and the per-term values, as shown
in equations 1 and 2.
Equality Difference =X
Equality Difference =X
|F N R −F N Rt|(2)
Error rate equality differences evaluate classiﬁcation out-
comes, not real-valued scores, so in order to calculate this
metric, we must choose one (or multiple) score threshold(s).
In this work, we use the equal error rate threshold for each
In addition to the error rate metrics, we also deﬁned a new
metric called pinned area under the curve (pinned AUC).
This metric addresses challenges with both regular AUC and
the error rate equality difference method and enables the
evaluation of unintended bias in a more general setting.
Many classiﬁcation models, including those implemented
in our research, provide a prediction score rather than a di-
rect class decision. Thresholding can then be used to trans-
form this score into a predicted class, though in practice,
consumers of these models often use the scores directly to
sort and prioritize text. Prior fairness metrics, like error rate
equality difference, only provide an evaluation of bias in the
context of direct binary classiﬁcation or after a threshold has
been chosen. The pinned AUC metric provides a threshold-
agnostic approach that detects bias in a wider range of use-
This approach adapts from the popular area under the re-
ceiver operator characteristic (AUC) metric which provides
a threshold-agnostic evaluation of the performance of an ML
classiﬁer (Fawcett 2006). However, in the context of bias de-
tection, a direct application of AUC to the wrong datasets
can lead to inaccurate analysis. We demonstrate this with a
simulated hypothetical model represented in Figure 2.
Consider three datasets, each representing comments con-
taining different identity terms, here “tall”, “average”, or
“short”. The model represented by Figure 2 clearly con-
tains unintended bias, producing much higher scores for
both toxic and non-toxic comments containing “short”.
If we evaluate the model performance on each identity-
based dataset individually then we ﬁnd that the model ob-
tains a high AUC on each (Table 3), obscuring the unin-
tended bias we know is present. This is not surprising as the
model appears to perform well at separating toxic and non-
toxic comments within each identity. This demonstrates the
general principle that the AUC score of a model on a strictly
Figure 2: Distributions of toxicity scores for three groups
of data, each containing comments with different identity
terms, “tall”, “average”, or “short”.
per-group identity dataset may not effectively identify unin-
By contrast, the AUC on the combined data is signiﬁ-
cantly lower, indicating poor model performance. The un-
derlying cause, in this case, is due to the unintended bias
reducing the separability of classes by giving non-toxic ex-
amples in the “short” subgroup a higher score than many
toxic examples from the other subgroups. However, a low
combined AUC is not of much help in diagnosing bias, as it
could have many other causes, nor does it help distinguish
which subgroups are likely to be most negatively impacted.
The AUC measure on both the individual datasets and the
aggregated one provide poor measures of unintended bias,
as neither answer the key question in measuring bias: is the
model performance on one subgroup different than its per-
formance on the average example?
The pinned AUC metric tackles this question directly. The
pinned AUC metric for a subgroup is deﬁned by comput-
ing the AUC on a secondary dataset containing two equally-
balanced components: a sample of comments from the sub-
group of interest and a sample of comments that reﬂect the
underlying distribution of comments. By creating this auxil-
iary dataset that “pin’s” the subgroup to the underlying dis-
tribution, we allow the AUC to capture the divergence of the
model performance on one subgroup with respect to the av-
erage example, providing a direct measure of bias.
More formally, if we let Drepresent the full set of com-
ments and Dtbe the set of comments in subgroup t, then we
can generate the secondary dataset for term tby applying
some sampling function sas in Equation 3 below1. Equa-
tion 4 then deﬁnes the pinned AUC of term t,pAUCt, as
1The exact technique for sub-sampling and deﬁning Dmay
vary depending the data. See appendix.
Dataset AUC Pinned AUC
Combined 0.79 N/A
tall 0.93 0.84
average 0.93 0.84
short 0.93 0.79
Table 3: AUC results.
the AUC of the corresponding secondary dataset.
pDt=s(Dt) + s(D),|s(Dt)|=|s(D)|(3)
Table 3 demonstrates how the pinned AUC is able to
quantitatively reveal both the presence and victim of un-
intended bias. In this example, the “short” subgroup has a
lower pinned AUC than the other subgroups due to the bias
in the score distribution for those comments. While this is a
simple example, it extends to much larger sets of subgroups,
where pinned AUC can reveal unintended bias that would
otherwise be hidden.
Pinned AUC Equality Difference
While the actual value of the pinned AUC number is impor-
tant, for the purposes of unintended bias, it is most important
that the pinned AUC values are similar across groups. Simi-
lar pinned AUC values mean similar performance within the
overall distribution, indicating a lack of unintended bias. As
with equality of odds, in the ideal case, per-group pinned
AUCs and overall AUC would be equal. We therefore sum-
marize pinned AUC equality difference similarly to equality
difference for false positive and false negative rates above.
Pinned AUC equality difference, shown in equation 5, is
deﬁned as a sum of the differences between the per-term
pinned AUC (pAUCt) and the overall AUC on the aggre-
gated data over all identity terms (AUC ). A lower sum
represents less variance between performance on individual
terms, and therefore less unintended bias.
Equality Difference =X
|AUC −pAU Ct|(5)
We evaluate three models: a baseline, a bias-mitigated
model, and a control. Each of the three models is trained
using an identical convolutional neural network architec-
ture2. The baseline model is trained on all 127,820 super-
vised Wikipedia TalkPage comments. The bias-mitigated
model has undergone the bias mitigation technique de-
scribed above, adding 4,620 additional assumed non-toxic
training samples from Wikipedia articles to balance the dis-
tribution of speciﬁc identity terms. The control group also
adds 4,620 randomly selected comments from Wikipedia ar-
ticles, meant to conﬁrm that model improvements in the ex-
periment are not solely due to the addition of data.
2The details of the model and code are available at
Model General Phrase Templates
Baseline 0.960 0.952
Random Control 0.957 0.946
Bias Mitigated 0.959 0.960
Table 4: Mean AUC on the general and phrase templates test
To capture the impact of training variance, we train each
model ten times, and show all results as scatter plots, with
each point representing one model.
Table 4 shows the mean AUC for all three models on the
general test set and on the identity phrase set. We see that the
bias-mitigated model performs best on the identity phrase
set, while not losing performance on the general set, demon-
strating a reduction in unintended bias without compromis-
ing general model performance.
To evaluate using the error rate equality difference metric de-
ﬁned above and inspired by (Hardt, Price, and Srebro 2016),
we convert each model into a binary classiﬁer by selecting
a threshold for each model using the equal error rate com-
puted on the general test set. Here we compare the false posi-
tive and false negative rates for each identity term with each
model. A more fair model will have similar false positive
and negative rates across all terms, and a model with unin-
tended bias will have a wide variance in these metrics.
Figure 3 shows the per-term false positive rates for the
baseline model, the random control, and the bias-mitigated
model. The bias-mitigated model clearly shows more uni-
formity of performance across terms, demonstrating that
the bias-mitigation technique does indeed reduce unintended
bias. The performance is still not completely uniform how-
ever, there is still room for improvement.
Figure 4 shows the per-term false negative rates for the
three experiments. The effect is less pronounced here since
we added non-toxic (negative) data only, aiming speciﬁcally
to combat false positives. Most importantly, we do not see an
increase in variance of false negative rates, demonstrating
that the bias mitigation technique reduces unintended bias
on false positives, while not introducing false negative bias
on the measured terms.
We also demonstrate a reduction in unintended bias using
the new pinned AUC metric introduced in this work. As with
error rates, a more fair model will have similar performance
across all terms. Figure 5 show the per-term pinned AUC for
each model, and we again see more uniformity in from the
bias-mitigated model. This demonstrates that the bias mit-
igation technique reduces unintended bias of the model’s
real-valued scores, not just of the thresholded binary clas-
siﬁer used to measure equality difference.
Figure 3: Per-term false positive rates for the baseline, ran-
dom control, and bias-mitigated models.
Figure 5: Per-term pinned AUC for the baseline, random
control, and bias-mitigated models.
Equality Difference Summary
Finally, we look at the equality difference for false positives,
false negatives, and pinned AUC to summarize each chart
into one metric, shown in 5. The bias-mitigated model shows
smaller sums of differences for all three metrics, indicating
more similarity in performance across identity terms, and
therefore less unintended bias.
This work relies on machine learning researchers selecting
narrow a deﬁnition of unintended bias tied to a speciﬁc set
of identity terms to measure and correct for. For future work,
we hope to remove the human step of identifying the relevant
Figure 4: Per-term false negative rates for the baseline, ran-
dom control, and bias-mitigated models.
Sums of differences
Metric Baseline Control Bias-Mitigated
Equality Difference 74.13 77.72 52.94
Equality Difference 36.73 36.91 30.73
Equality Difference 6.37 6.84 4.07
Table 5: Sums of differences between the per-term value and
the overall value for each model.
identity terms, either by automating the mining of identity
terms affected by unintended bias or by devising bias miti-
gation strategies that do not rely directly on a set of identity
terms. We also hope to generalize the methods to be less
dependent on individual words, so that we can more effec-
tively deal with biases tied to words used in many different
contexts, e.g. white vs black.
In this paper, we have proposed a deﬁnition of unintended
bias for text classiﬁcation and distinguished it from fairness
in the application of ML. We have presented strategies for
quantifying and mitigating unintended bias in datasets and
the resulting models. We demonstrated that applying these
strategies mitigate the unintended biases in a model with-
out harming the overall model quality, and with very little
impact even on the original test set.
What we present here is a ﬁrst step towards fairness in text
classiﬁcation, the path to fair models will of course require
many more steps.
We deﬁned pinned AUC as copied below.
pDt=s(Dt) + s(D),|s(Dt)|=|s(D)|(6)
Depending on the exact data in the full set D, there are
many options for sub-sampling down to s(D), each with im-
pacts on the pinned AUC metric and it’s ability to reveal un-
intended bias. A full evaluation of these are left for future
work, but here is a quick summary of the possible variants:
1. Replacement: While Dt⊂D, it may make sense to sam-
ple such that Dt6⊂ s(D). The results shown in this work
sample this way.
2. Other subgroups: If Dcontains many subgroups in dif-
ferent amounts, s(D)could be sampled to ensure equal
representation from each group. In this work, Dis syn-
thetically constructed such that each subgroup is equally
Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.;
Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.;
Ghemawat, S.; Goodfellow, I.; Harp, A.; Irving, G.; Isard,
M.; Jia, Y.; Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Leven-
berg, J.; Man´
e, D.; Monga, R.; Moore, S.; Murray, D.; Olah,
C.; Schuster, M.; Shlens, J.; Steiner, B.; Sutskever, I.; Tal-
war, K.; Tucker, P.; Vanhoucke, V.; Vasudevan, V.; Vi´
F.; Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke, M.; Yu,
Y.; and Zheng, X. 2015. TensorFlow: Large-scale machine
learning on heterogeneous systems. Software available from
Beutel, A.; Chen, J.; Zhao, Z.; and Chi, E. H. 2017. Data
decisions and theoretical implications when adversarially
learning fair representations. CoRR abs/1707.00075.
Blodgett, S. L., and O’Connor, B. 2017. Racial disparity in
natural language processing: A case study of social media
african-american english. CoRR abs/1707.00061.
Bolukbasi, T.; Chang, K.; Zou, J. Y.; Saligrama, V.; and
Kalai, A. 2016. Man is to computer programmer as woman
is to homemaker? debiasing word embeddings. CoRR
Chollet, F., et al. 2015. Keras. https://github.com/
Fawcett, T. 2006. An introduction to roc analysis. Pattern
recognition letters 27(8):861–874.
Feldman, M.; Friedler, S. A.; Moeller, J.; Scheidegger, C.;
and Venkatasubramanian, S. 2015. Certifying and remov-
ing disparate impact. In Proceedings of the 21th ACM
SIGKDD International Conference on Knowledge Discov-
ery and Data Mining, KDD ’15, 259–268. New York, NY,
Friedler, S. A.; Scheidegger, C.; and Venkatasubramanian,
S. 2016. On the (im)possibility of fairness. CoRR
Hardt, M.; Price, E.; and Srebro, N. 2016. Equality of op-
portunity in supervised learning. CoRR abs/1610.02413.
Hovy, D., and Spruit, S. L. 2016. The social impact of
natural language processing. In ACL.
Kleinberg, J. M.; Mullainathan, S.; and Raghavan, M. 2016.
Inherent trade-offs in the fair determination of risk scores.
Tatman, R. 2017. Gender and dialect bias in youtube’s au-
tomatic captions. Valencia, Spain: European Chapter of the
Association for Computational Linguistics.
Wulczyn, E.; Thain, N.; and Dixon, L. 2017. Ex machina:
Personal attacks seen at scale. In Proceedings of the 26th
International Conference on World Wide Web, WWW ’17,
1391–1399. Republic and Canton of Geneva, Switzerland:
International World Wide Web Conferences Steering Com-