Content uploaded by Jeffrey Sorensen

Author content

All content in this area was uploaded by Jeffrey Sorensen on May 03, 2021

Content may be subject to copyright.

Measuring and Mitigating Unintended Bias in Text Classiﬁcation

Lucas Dixon

ldixon@google.com

John Li

jetpack@google.com

Jeffrey Sorensen

sorenj@google.com

Nithum Thain

nthain@google.com

Lucy Vasserman

lucyvasserman@google.com

Jigsaw

Abstract

We introduce and illustrate a new approach to measuring and

mitigating unintended bias in machine learning models. Our

deﬁnition of unintended bias is parameterized by a test set

and a subset of input features. We illustrate how this can

be used to evaluate text classiﬁers using a synthetic test set

and a public corpus of comments annotated for toxicity from

Wikipedia Talk pages. We also demonstrate how imbalances

in training data can lead to unintended bias in the resulting

models, and therefore potentially unfair applications. We use

a set of common demographic identity terms as the subset of

input features on which we measure bias. This technique per-

mits analysis in the common scenario where demographic in-

formation on authors and readers is unavailable, so that bias

mitigation must focus on the content of the text itself. The

mitigation method we introduce is an unsupervised approach

based on balancing the training dataset. We demonstrate that

this approach reduces the unintended bias without compro-

mising overall model quality.

Introduction

With the recent proliferation of the use of machine learning

for a wide variety of tasks, researchers have identiﬁed un-

fairness in ML models as one of the growing concerns in

the ﬁeld. Many ML models are built from human-generated

data, and human biases can easily result in a skewed distri-

bution in the training data. ML practitioners must be proac-

tive in recognizing and counteracting these biases, otherwise

our models and products risk perpetuating unfairness by per-

forming better for some users than for others.

Recent research in fairness in machine learning proposes

several deﬁnitions of fairness for machine learning tasks,

metrics for evaluating fairness, and techniques to mitigate

unfairness. The main contribution of this paper is to intro-

duce methods to quantify and mitigate unintended bias in

text classiﬁcation models. We illustrate the methods by ap-

plying them to a text classiﬁer built to identify toxic com-

ments in Wikipedia Talk Pages (Wulczyn, Thain, and Dixon

2017).

Initial versions of text classiﬁers trained on this data

showed problematic trends for certain statements. Clearly

non-toxic statements containing certain identity terms, such

Copyright c

2017, Association for the Advancement of Artiﬁcial

Intelligence (www.aaai.org). All rights reserved.

as “I am a gay man”, were given unreasonably high toxi-

city scores. We call this false positive bias. The source of

this bias was the disproportionate representation of iden-

tity terms in our training data: terms like “gay” were

so frequently used in toxic comments that the models

over-generalized and learned to disproportionately associate

those terms with the toxicity label. In this work, we propose

a method for identifying and mitigating this form of unin-

tended model bias.

In the following sections, we describe related work, then

discuss a working deﬁnition of unintended bias in a clas-

siﬁcation task, and distinguish that from “unfairness” in an

application. We then demonstrate that a signiﬁcant cause of

unintended bias in our baseline model is due to dispropor-

tionate representation of data with certain identity terms and

provide a way to measure the extent of the disparity. We

then propose a simple and novel technique to counteract that

bias by strategically adding data. Finally, we present metrics

for evaluating unintended bias in a model, and demonstrate

that our technique reduces unintended bias while maintain-

ing overall model quality.

Related Work

Researchers of fairness in ML have proposed a range of def-

initions for “fairness” and metrics for its evaluation. Many

have also presented mitigation strategies to improve model

fairness according to these metrics. (Feldman et al. 2015)

provide a deﬁnition of fairness tied to demographic parity of

model predictions, and provides a strategy to alter the train-

ing data to improve fairness. (Hardt, Price, and Srebro 2016)

presents an alternate deﬁnition of fairness that requires par-

ity of model performance instead of predictions, and a mit-

igation strategy that applies to trained models. (Kleinberg,

Mullainathan, and Raghavan 2016) and (Friedler, Scheideg-

ger, and Venkatasubramanian 2016) both compare several

different fairness metrics. These works rely on the avail-

ability of demographic data about the object of classiﬁca-

tion in order to identify and mitigate bias. (Beutel et al.

2017) presents a new mitigation technique using adversarial

training that requires only a small amount of labeled demo-

graphic data.

Very little prior work has been done on fairness for text

classiﬁcation tasks. (Blodgett and O’Connor 2017), (Hovy

and Spruit 2016) and (Tatman 2017) discuss the impact of

using unfair natural language processing models for real-

world tasks, but do not provide mitigation strategies. (Boluk-

basi et al. 2016) demonstrates gender bias in word embed-

dings and provides a technique to “de-bias” them, allowing

these more fair embeddings to be used for any text-based

task.

Our work adds to this growing body of machine learning

fairness research with a novel approach to deﬁning, measur-

ing, and mitigating unintended bias for a text-based classiﬁ-

cation task.

Methodology

Model Task and Data

In this paper we work with a text classiﬁer built to iden-

tify toxicity in comments from Wikipedia Talk Pages. The

model is built from a dataset of 127,820 Talk Page com-

ments, each labeled by human raters as toxic or non-toxic.

A toxic comment is deﬁned as a “rude, disrespectful, or un-

reasonable comment that is likely to make you leave a dis-

cussion.” All versions of the model are convolutional neural

networks trained using the Keras framework (Chollet and

others 2015) in TensorFlow (Abadi et al. 2015).

Deﬁnitions of Unintended Bias and Fairness

The word ‘fairness’ in machine learning is used in various

ways. To avoid confusion, in this paper, we distinguish be-

tween the unintended biases in a machine learning model

and the potential unfair applications of the model.

Every machine learning model is designed to express a

bias. For example, a model trained to identify toxic com-

ments is intended to be biased such that comments that are

toxic get a higher score than those which are not. The model

is not intended to discriminate between the gender of the

people expressed in a comment - so if the model does so,

we call that unintended bias. We contrast this with fairness

which we use to refer to a potential negative impact on so-

ciety, and in particular when different individuals are treated

differently.

To illustrate this distinction, consider a model for toxi-

city that has unintended bias at a given threshold. For in-

stance, the model may give comments that contain the word

‘gay’ scores above the threshold independently of whether

the comment is toxic. If such a model is applied on a web-

site to remove comments that get a score above that thresh-

old, then we might speculate that the model will have a neg-

ative effect on society because it will make it more difﬁcult

on that website to discuss topics where one would naturally

use the word ‘gay’. Thus we might say that the model’s im-

pact is unfair (to people who wish to write comments that

contain the word gay). However, if the model is used to sort

and review all comments before they are published then we

might ﬁnd that the comments that contain the word gay are

reviewed ﬁrst, and then published earlier, producing an un-

fair impact for people who write comments without the word

gay (since their comments may be published later). If com-

ments are grouped for review but published in batch, then the

model’s unintended bias may not cause any unfair impact on

comment authors.

Since the presence of unintended bias can have varied im-

pacts on fairness, we aim to deﬁne and mitigate the unin-

tended bias that will improve fairness across a broad range

of potential model applications.

One deﬁnition, adapted from the literature, is a model

contains unintended bias if it performs better for some de-

mographic groups than others (Hardt, Price, and Srebro

2016). To apply this to text classiﬁcation, we consider the

unintended bias across the content of the text, and narrow the

deﬁnition to a model contains unintended bias if it performs

better for comments about some groups than for comments

about others groups.

In this work, we address one speciﬁc subcase of the above

deﬁnition, which we call identity term bias. Here, we nar-

row further from looking at all comments about different

groups to looking at comments containing speciﬁc identity

terms. Focusing on only a small selection of identity terms

enables us to make progress towards mitigating unintended

model bias, but it is of course only a ﬁrst step. For this work,

our deﬁnition is: a model contains unintended bias if it per-

forms better for comments containing some particular iden-

tity terms than for comments containing others.

The false positive bias described above, where non-toxic

comments containing certain identity terms were given un-

reasonably high toxicity scores, is the manifestation of un-

intended bias. In the rest of this paper, we lay out strategies

to measure and mitigate this unintended bias.

Quantifying bias in dataset

Identity terms affected by the false positive bias are dispro-

portionately used in toxic comments in our training data.

For example, the word ‘gay’ appears in 3% of toxic com-

ments but only 0.5% of comments overall. The combination

of dataset size, model training methods, and the dispropor-

tionate number of toxic examples for comments containing

these words in the training data led to overﬁtting in the origi-

nal toxicity model: it made generalizations such as associat-

ing the word ‘gay’ with toxicity. We manually created a set

of 51 common identity terms, and looked for similar dispro-

portionate representations. Table 1 illustrates the difference

between the likelihood of seeing a given identity in a toxic

statement vs. its overall likelihood.

Term Toxic Overall

atheist 0.09% 0.10%

queer 0.30% 0.06%

gay 3% 0.50%

transgender 0.04% 0.02%

lesbian 0.10% 0.04%

homosexual 0.80% 0.20%

feminist 0.05% 0.05%

black 0.70% 0.60%

white 0.90% 0.70%

heterosexual 0.02% 0.03%

islam 0.10% 0.08%

muslim 0.20% 0.10%

bisexual 0.01% 0.03%

Table 1: Frequency of identity terms in toxic comments and

overall.

Figure 1: Percent of comments labeled as toxic at each

length containing the given terms.

In addition to a disproportionate amount of toxicity in

comments containing certain identity terms, there is also a

relationship between comment length and toxicity, as shown

in 1.

The models we are training are known to have the ability

to capture contextual dependencies. However, with insufﬁ-

cient data, the model has no error signal that would require

these distinctions, so these models are likely to overgeneral-

ize, causing the false positive bias for identity terms.

Bias Mitigation

To mitigate the data imbalance which causes the unintended

bias, we added additional data, all containing non-toxic ex-

amples of the identity terms where we found the most dis-

proportionate data distributions.

For each term, we added enough new non-toxic examples

to bring the toxic/non-toxic balance in line with the prior

distribution for the overall dataset, at each length bucket de-

scribed above. Because our CNN models are sensitive to

length, and toxic comments tend to be shorter, we found bal-

ancing by length to be especially important.

We mined the new data from Wikipedia articles them-

selves. Since the text comes from the published article, we

assume that the text is non-toxic, which we validated by

labeling 1000 comments, 99.5% of them were conﬁrmed

non-toxic. Using unsupervised, assumed non-toxic article

data enables the data balancing to be done without addi-

tional human labeling. Gathering additional supervised non-

toxic data from the original comment domain could be pro-

hibitively expensive or impossible, as the rareness of these

types of comments is the initial cause of the bias. We found

that using unsupervised data, even from a slightly different

domain (article vs. comment text), is a feasible and effective

strategy to mitigate bias.

Unintended Bias Evaluation Metrics

To evaluate and compare performance and unintended bias

in the models, we use the following test sets and evaluation

metrics.

Template Examples Label

I am <IDENTITY>Non-Toxic

I am a <IDENTITY>person, ask me anything Non-Toxic

<IDENTITY>people are just like everyone

else

Non-Toxic

I hate all <IDENTITY>Toxic

I am a <IDENTITY>person and I hate your

guts and think you suck

Toxic

<IDENTITY>people are gross and universally

terrible

Toxic

Table 2: Phrase template examples.

General Test Set

We use a general held out test set sampled from the original

data set of Wikipedia Talk Page comments. This set evalu-

ates overall model performance (intended bias), but it does

not provide much information about unintended bias. We

use this set to ensure that bias mitigation techniques do not

drastically hurt overall performance. This test set contains

31866 labeled comments of which 9.6% are labeled toxic.

Identity Phrase Templates Test Set

To evaluate unintended bias speciﬁcally on comments con-

taining identity terms, we generated a synthetic dataset. We

created templates of both toxic and non-toxic phrases and

slotted a wide range of identity terms into each of these tem-

plates, examples shown in table 2.

This creates a controlled set of 77,000 examples, 50% of

which are toxic, where we can directly test for unintended

model bias by grouping the comments by identity term and

comparing performance on each group.

AUC

A common evaluation metric for real-valued scores is area

under the receiver operating characteristic curve or AUC.

We look at the AUC on the general and identity phrase tem-

plate sets gauge overall model performance. AUC on the full

phrase template set (all identity phrases together) gives a

limited picture of unintended bias. A low AUC indicates that

the model is performing differently for phrases with differ-

ent identity terms, but it doesn’t help us understand which

identity terms are the outliers.

Error Rate Equality Difference

Equality of Odds, proposed in (Hardt, Price, and Srebro

2016), is a deﬁnition of fairness that is satisﬁed when the

false positive rates and false negative rates are equal across

comments containing different identity terms. This concept

inspires the error rate equality difference metrics, which use

the variation in these error rates between terms to measure

the extent of unintended bias in the model, similar to the

equality gap metric used in (Beutel et al. 2017).

Using the identity phrase test set, we calculate the false

positive rate, FPRand false negative rate, F N R on the en-

tire test set, as well as these same metrics on each subset of

the data containing each speciﬁc identity term, FPRtand

F N Rt. A more fair model will have similar values across

all terms, approaching the equality of odds ideal, where

FPR =FPRtand F N R =F N Rtfor all terms t. Wide

variation among these values across terms indicates high un-

intended bias.

Error rate equality difference quantiﬁes the extent of the

per-term variation (and therefore the extent of unintended

bias) as the sum of the differences between the overall false

positive or negative rate and the per-term values, as shown

in equations 1 and 2.

False Positive

Equality Difference =X

t∈T

|FPR−FPRt|(1)

False Negative

Equality Difference =X

t∈T

|F N R −F N Rt|(2)

Error rate equality differences evaluate classiﬁcation out-

comes, not real-valued scores, so in order to calculate this

metric, we must choose one (or multiple) score threshold(s).

In this work, we use the equal error rate threshold for each

evaluated model.

Pinned AUC

In addition to the error rate metrics, we also deﬁned a new

metric called pinned area under the curve (pinned AUC).

This metric addresses challenges with both regular AUC and

the error rate equality difference method and enables the

evaluation of unintended bias in a more general setting.

Many classiﬁcation models, including those implemented

in our research, provide a prediction score rather than a di-

rect class decision. Thresholding can then be used to trans-

form this score into a predicted class, though in practice,

consumers of these models often use the scores directly to

sort and prioritize text. Prior fairness metrics, like error rate

equality difference, only provide an evaluation of bias in the

context of direct binary classiﬁcation or after a threshold has

been chosen. The pinned AUC metric provides a threshold-

agnostic approach that detects bias in a wider range of use-

cases.

This approach adapts from the popular area under the re-

ceiver operator characteristic (AUC) metric which provides

a threshold-agnostic evaluation of the performance of an ML

classiﬁer (Fawcett 2006). However, in the context of bias de-

tection, a direct application of AUC to the wrong datasets

can lead to inaccurate analysis. We demonstrate this with a

simulated hypothetical model represented in Figure 2.

Consider three datasets, each representing comments con-

taining different identity terms, here “tall”, “average”, or

“short”. The model represented by Figure 2 clearly con-

tains unintended bias, producing much higher scores for

both toxic and non-toxic comments containing “short”.

If we evaluate the model performance on each identity-

based dataset individually then we ﬁnd that the model ob-

tains a high AUC on each (Table 3), obscuring the unin-

tended bias we know is present. This is not surprising as the

model appears to perform well at separating toxic and non-

toxic comments within each identity. This demonstrates the

general principle that the AUC score of a model on a strictly

Figure 2: Distributions of toxicity scores for three groups

of data, each containing comments with different identity

terms, “tall”, “average”, or “short”.

per-group identity dataset may not effectively identify unin-

tended bias.

By contrast, the AUC on the combined data is signiﬁ-

cantly lower, indicating poor model performance. The un-

derlying cause, in this case, is due to the unintended bias

reducing the separability of classes by giving non-toxic ex-

amples in the “short” subgroup a higher score than many

toxic examples from the other subgroups. However, a low

combined AUC is not of much help in diagnosing bias, as it

could have many other causes, nor does it help distinguish

which subgroups are likely to be most negatively impacted.

The AUC measure on both the individual datasets and the

aggregated one provide poor measures of unintended bias,

as neither answer the key question in measuring bias: is the

model performance on one subgroup different than its per-

formance on the average example?

The pinned AUC metric tackles this question directly. The

pinned AUC metric for a subgroup is deﬁned by comput-

ing the AUC on a secondary dataset containing two equally-

balanced components: a sample of comments from the sub-

group of interest and a sample of comments that reﬂect the

underlying distribution of comments. By creating this auxil-

iary dataset that “pin’s” the subgroup to the underlying dis-

tribution, we allow the AUC to capture the divergence of the

model performance on one subgroup with respect to the av-

erage example, providing a direct measure of bias.

More formally, if we let Drepresent the full set of com-

ments and Dtbe the set of comments in subgroup t, then we

can generate the secondary dataset for term tby applying

some sampling function sas in Equation 3 below1. Equa-

tion 4 then deﬁnes the pinned AUC of term t,pAUCt, as

1The exact technique for sub-sampling and deﬁning Dmay

vary depending the data. See appendix.

Dataset AUC Pinned AUC

Combined 0.79 N/A

tall 0.93 0.84

average 0.93 0.84

short 0.93 0.79

Table 3: AUC results.

the AUC of the corresponding secondary dataset.

pDt=s(Dt) + s(D),|s(Dt)|=|s(D)|(3)

pAUCt=AU C(pDt)(4)

Table 3 demonstrates how the pinned AUC is able to

quantitatively reveal both the presence and victim of un-

intended bias. In this example, the “short” subgroup has a

lower pinned AUC than the other subgroups due to the bias

in the score distribution for those comments. While this is a

simple example, it extends to much larger sets of subgroups,

where pinned AUC can reveal unintended bias that would

otherwise be hidden.

Pinned AUC Equality Difference

While the actual value of the pinned AUC number is impor-

tant, for the purposes of unintended bias, it is most important

that the pinned AUC values are similar across groups. Simi-

lar pinned AUC values mean similar performance within the

overall distribution, indicating a lack of unintended bias. As

with equality of odds, in the ideal case, per-group pinned

AUCs and overall AUC would be equal. We therefore sum-

marize pinned AUC equality difference similarly to equality

difference for false positive and false negative rates above.

Pinned AUC equality difference, shown in equation 5, is

deﬁned as a sum of the differences between the per-term

pinned AUC (pAUCt) and the overall AUC on the aggre-

gated data over all identity terms (AUC ). A lower sum

represents less variance between performance on individual

terms, and therefore less unintended bias.

Pinned AUC

Equality Difference =X

t∈T

|AUC −pAU Ct|(5)

Experiments

We evaluate three models: a baseline, a bias-mitigated

model, and a control. Each of the three models is trained

using an identical convolutional neural network architec-

ture2. The baseline model is trained on all 127,820 super-

vised Wikipedia TalkPage comments. The bias-mitigated

model has undergone the bias mitigation technique de-

scribed above, adding 4,620 additional assumed non-toxic

training samples from Wikipedia articles to balance the dis-

tribution of speciﬁc identity terms. The control group also

adds 4,620 randomly selected comments from Wikipedia ar-

ticles, meant to conﬁrm that model improvements in the ex-

periment are not solely due to the addition of data.

2The details of the model and code are available at

https://github.com/conversationai/unintended-ml-bias-analysis

Model General Phrase Templates

Baseline 0.960 0.952

Random Control 0.957 0.946

Bias Mitigated 0.959 0.960

Table 4: Mean AUC on the general and phrase templates test

sets.

To capture the impact of training variance, we train each

model ten times, and show all results as scatter plots, with

each point representing one model.

Overall AUC

Table 4 shows the mean AUC for all three models on the

general test set and on the identity phrase set. We see that the

bias-mitigated model performs best on the identity phrase

set, while not losing performance on the general set, demon-

strating a reduction in unintended bias without compromis-

ing general model performance.

Error Rates

To evaluate using the error rate equality difference metric de-

ﬁned above and inspired by (Hardt, Price, and Srebro 2016),

we convert each model into a binary classiﬁer by selecting

a threshold for each model using the equal error rate com-

puted on the general test set. Here we compare the false posi-

tive and false negative rates for each identity term with each

model. A more fair model will have similar false positive

and negative rates across all terms, and a model with unin-

tended bias will have a wide variance in these metrics.

Figure 3 shows the per-term false positive rates for the

baseline model, the random control, and the bias-mitigated

model. The bias-mitigated model clearly shows more uni-

formity of performance across terms, demonstrating that

the bias-mitigation technique does indeed reduce unintended

bias. The performance is still not completely uniform how-

ever, there is still room for improvement.

Figure 4 shows the per-term false negative rates for the

three experiments. The effect is less pronounced here since

we added non-toxic (negative) data only, aiming speciﬁcally

to combat false positives. Most importantly, we do not see an

increase in variance of false negative rates, demonstrating

that the bias mitigation technique reduces unintended bias

on false positives, while not introducing false negative bias

on the measured terms.

Pinned AUC

We also demonstrate a reduction in unintended bias using

the new pinned AUC metric introduced in this work. As with

error rates, a more fair model will have similar performance

across all terms. Figure 5 show the per-term pinned AUC for

each model, and we again see more uniformity in from the

bias-mitigated model. This demonstrates that the bias mit-

igation technique reduces unintended bias of the model’s

real-valued scores, not just of the thresholded binary clas-

siﬁer used to measure equality difference.

Figure 3: Per-term false positive rates for the baseline, ran-

dom control, and bias-mitigated models.

Figure 5: Per-term pinned AUC for the baseline, random

control, and bias-mitigated models.

Equality Difference Summary

Finally, we look at the equality difference for false positives,

false negatives, and pinned AUC to summarize each chart

into one metric, shown in 5. The bias-mitigated model shows

smaller sums of differences for all three metrics, indicating

more similarity in performance across identity terms, and

therefore less unintended bias.

Future Work

This work relies on machine learning researchers selecting

narrow a deﬁnition of unintended bias tied to a speciﬁc set

of identity terms to measure and correct for. For future work,

we hope to remove the human step of identifying the relevant

Figure 4: Per-term false negative rates for the baseline, ran-

dom control, and bias-mitigated models.

Sums of differences

Metric Baseline Control Bias-Mitigated

False Positive

Equality Difference 74.13 77.72 52.94

False Negative

Equality Difference 36.73 36.91 30.73

Pinned AUC

Equality Difference 6.37 6.84 4.07

Table 5: Sums of differences between the per-term value and

the overall value for each model.

identity terms, either by automating the mining of identity

terms affected by unintended bias or by devising bias miti-

gation strategies that do not rely directly on a set of identity

terms. We also hope to generalize the methods to be less

dependent on individual words, so that we can more effec-

tively deal with biases tied to words used in many different

contexts, e.g. white vs black.

Conclusion

In this paper, we have proposed a deﬁnition of unintended

bias for text classiﬁcation and distinguished it from fairness

in the application of ML. We have presented strategies for

quantifying and mitigating unintended bias in datasets and

the resulting models. We demonstrated that applying these

strategies mitigate the unintended biases in a model with-

out harming the overall model quality, and with very little

impact even on the original test set.

What we present here is a ﬁrst step towards fairness in text

classiﬁcation, the path to fair models will of course require

many more steps.

Appendix

Pinned AUC

We deﬁned pinned AUC as copied below.

pDt=s(Dt) + s(D),|s(Dt)|=|s(D)|(6)

pAUCt=AU C(pDt)(7)

Depending on the exact data in the full set D, there are

many options for sub-sampling down to s(D), each with im-

pacts on the pinned AUC metric and it’s ability to reveal un-

intended bias. A full evaluation of these are left for future

work, but here is a quick summary of the possible variants:

1. Replacement: While Dt⊂D, it may make sense to sam-

ple such that Dt6⊂ s(D). The results shown in this work

sample this way.

2. Other subgroups: If Dcontains many subgroups in dif-

ferent amounts, s(D)could be sampled to ensure equal

representation from each group. In this work, Dis syn-

thetically constructed such that each subgroup is equally

represented.

References

Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.;

Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.;

Ghemawat, S.; Goodfellow, I.; Harp, A.; Irving, G.; Isard,

M.; Jia, Y.; Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Leven-

berg, J.; Man´

e, D.; Monga, R.; Moore, S.; Murray, D.; Olah,

C.; Schuster, M.; Shlens, J.; Steiner, B.; Sutskever, I.; Tal-

war, K.; Tucker, P.; Vanhoucke, V.; Vasudevan, V.; Vi´

egas,

F.; Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke, M.; Yu,

Y.; and Zheng, X. 2015. TensorFlow: Large-scale machine

learning on heterogeneous systems. Software available from

tensorﬂow.org.

Beutel, A.; Chen, J.; Zhao, Z.; and Chi, E. H. 2017. Data

decisions and theoretical implications when adversarially

learning fair representations. CoRR abs/1707.00075.

Blodgett, S. L., and O’Connor, B. 2017. Racial disparity in

natural language processing: A case study of social media

african-american english. CoRR abs/1707.00061.

Bolukbasi, T.; Chang, K.; Zou, J. Y.; Saligrama, V.; and

Kalai, A. 2016. Man is to computer programmer as woman

is to homemaker? debiasing word embeddings. CoRR

abs/1607.06520.

Chollet, F., et al. 2015. Keras. https://github.com/

fchollet/keras.

Fawcett, T. 2006. An introduction to roc analysis. Pattern

recognition letters 27(8):861–874.

Feldman, M.; Friedler, S. A.; Moeller, J.; Scheidegger, C.;

and Venkatasubramanian, S. 2015. Certifying and remov-

ing disparate impact. In Proceedings of the 21th ACM

SIGKDD International Conference on Knowledge Discov-

ery and Data Mining, KDD ’15, 259–268. New York, NY,

USA: ACM.

Friedler, S. A.; Scheidegger, C.; and Venkatasubramanian,

S. 2016. On the (im)possibility of fairness. CoRR

abs/1609.07236.

Hardt, M.; Price, E.; and Srebro, N. 2016. Equality of op-

portunity in supervised learning. CoRR abs/1610.02413.

Hovy, D., and Spruit, S. L. 2016. The social impact of

natural language processing. In ACL.

Kleinberg, J. M.; Mullainathan, S.; and Raghavan, M. 2016.

Inherent trade-offs in the fair determination of risk scores.

CoRR abs/1609.05807.

Tatman, R. 2017. Gender and dialect bias in youtube’s au-

tomatic captions. Valencia, Spain: European Chapter of the

Association for Computational Linguistics.

Wulczyn, E.; Thain, N.; and Dixon, L. 2017. Ex machina:

Personal attacks seen at scale. In Proceedings of the 26th

International Conference on World Wide Web, WWW ’17,

1391–1399. Republic and Canton of Geneva, Switzerland:

International World Wide Web Conferences Steering Com-

mittee.