Conference PaperPDF Available

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989
Copenhagen, Denmark, September 7–11, 2017. c
2017 Association for Computational Linguistics
Men Also Like Shopping:
Reducing Gender Bias Amplification using Corpus-level Constraints
Jieyu Zhao§Tianlu Wang§Mark Yatskar
Vicente Ordonez§Kai-Wei Chang§
§University of Virginia
{jz4fu, tw8cb, vicente, kc2wc}
University of Washington
Language is increasingly being used to de-
fine rich visual recognition problems with
supporting image collections sourced from
the web. Structured prediction models are
used in these tasks to take advantage of
correlations between co-occurring labels
and visual input but risk inadvertently en-
coding social biases found in web corpora.
In this work, we study data and models as-
sociated with multilabel object classifica-
tion and visual semantic role labeling. We
find that (a) datasets for these tasks con-
tain significant gender bias and (b) mod-
els trained on these datasets further am-
plify existing bias. For example, the ac-
tivity cooking is over 33% more likely
to involve females than males in a train-
ing set, and a trained model further ampli-
fies the disparity to 68% at test time. We
propose to inject corpus-level constraints
for calibrating existing structured predic-
tion models and design an algorithm based
on Lagrangian relaxation for collective in-
ference. Our method results in almost no
performance loss for the underlying recog-
nition task but decreases the magnitude of
bias amplification by 47.5% and 40.5% for
multilabel classification and visual seman-
tic role labeling, respectively.
1 Introduction
Visual recognition tasks involving language, such
as captioning (Vinyals et al., 2015), visual ques-
tion answering (Antol et al., 2015), and visual se-
mantic role labeling (Yatskar et al., 2016), have
emerged as avenues for expanding the diversity
of information that can be recovered from im-
ages. These tasks aim at extracting rich seman-
tics from images and require large quantities of la-
beled data, predominantly retrieved from the web.
Methods often combine structured prediction and
deep learning to model correlations between la-
bels and images to make judgments that otherwise
would have weak visual support. For example, in
the first image of Figure 1, it is possible to pre-
dict a spatula by considering that it is a com-
mon tool used for the activity cooking. Yet such
methods run the risk of discovering and exploiting
societal biases present in the underlying web cor-
pora. Without properly quantifying and reducing
the reliance on such correlations, broad adoption
of these models can have the inadvertent effect of
magnifying stereotypes.
In this paper, we develop a general framework
for quantifying bias and study two concrete tasks,
visual semantic role labeling (vSRL) and multil-
abel object classification (MLC). In vSRL, we use
the imSitu formalism (Yatskar et al., 2016, 2017),
where the goal is to predict activities, objects and
the roles those objects play within an activity. For
MLC, we use MS-COCO (Lin et al., 2014; Chen
et al., 2015), a recognition task covering 80 object
classes. We use gender bias as a running example
and show that both supporting datasets for these
tasks are biased with respect to a gender binary1.
Our analysis reveals that over 45% and 37%
of verbs and objects, respectively, exhibit bias to-
ward a gender greater than 2:1. For example, as
seen in Figure 1, the cooking activity in imSitu
is a heavily biased verb. Furthermore, we show
that after training state-of-the-art structured pre-
dictors, models amplify the existing bias, by 5.0%
for vSRL, and 3.6% in MLC.
1To simplify our analysis, we only consider a gender bi-
nary as perceived by annotators in the datasets. We recog-
nize that a more fine-grained analysis would be needed for
deployment in a production system. Also, note that the pro-
posed approach can be applied to other NLP tasks and other
variables such as identification with a racial or ethnic group.
Figure 1: Five example images from the imSitu visual semantic role labeling (vSRL) dataset. Each im-
age is paired with a table describing a situation: the verb, cooking, its semantic roles, i.e agent, and
noun values filling that role, i.e. woman. In the imSitu training set, 33% of cooking images have man
in the agent role while the rest have woman. After training a Conditional Random Field (CRF), bias is
amplified: man fills 16% of agent roles in cooking images. To reduce this bias amplification our cal-
ibration method adjusts weights of CRF potentials associated with biased predictions. After applying our
methods, man appears in the agent role of 20% of cooking images, reducing the bias amplification
by 25%, while keeping the CRF vSRL performance unchanged.
To mitigate the role of bias amplification when
training models on biased corpora, we propose
a novel constrained inference framework, called
RBA, for Reducing Bias Amplification in predic-
tions. Our method introduces corpus-level con-
straints so that gender indicators co-occur no more
often together with elements of the prediction task
than in the original training distribution. For ex-
ample, as seen in Figure 1, we would like noun
man to occur in the agent role of the cooking
as often as it occurs in the imSitu training set when
evaluating on a development set. We combine
our calibration constraint with the original struc-
tured predictor and use Lagrangian relaxation (Ko-
rte and Vygen, 2008; Rush and Collins, 2012) to
reweigh bias creating factors in the original model.
We evaluate our calibration method on imSitu
vSRL and COCO MLC and find that in both in-
stances, our models substantially reduce bias am-
plification. For vSRL, we reduce the average mag-
nitude of bias amplification by 40.5%. For MLC,
we are able to reduce the average magnitude of
bias amplification by 47.5%. Overall, our calibra-
tion methods do not affect the performance of the
underlying visual system, while substantially re-
ducing the reliance of the system on socially bi-
ased correlations2.
2Code and data are available at https://github.
2 Related Work
As intelligence systems start playing important
roles in our daily life, ethics in artificial in-
telligence research has attracted significant in-
terest. It is known that big-data technologies
sometimes inadvertently worsen discrimination
due to implicit biases in data (Podesta et al.,
2014). Such issues have been demonstrated in var-
ious learning systems, including online advertise-
ment systems (Sweeney, 2013), word embedding
models (Bolukbasi et al., 2016; Caliskan et al.,
2017), online news (Ross and Carter, 2011), web
search (Kay et al., 2015), and credit score (Hardt
et al., 2016). Data collection biases have been
discussed in the context of creating image cor-
pus (Misra et al., 2016; van Miltenburg, 2016)
and text corpus (Gordon and Van Durme, 2013;
Van Durme, 2010). In contrast, we show that given
a gender biased corpus, structured models such as
conditional random fields, amplify the bias.
The effect of the data imbalance can be easily
detected and fixed when the prediction task is sim-
ple. For example, when classifying binary data
with unbalanced labels (i.e., samples in the major-
ity class dominate the dataset), a classifier trained
exclusively to optimize accuracy learns to always
predict the majority label, as the cost of mak-
ing mistakes on samples in the minority class can
be neglected. Various approaches have been pro-
posed to make a “fair” binary classification (Baro-
cas and Selbst, 2014; Dwork et al., 2012; Feldman
et al., 2015; Zliobaite, 2015). For structured pre-
diction tasks the effect is harder to quantify and
we are the first to propose methods to reduce bias
amplification in this context.
Lagrangian relaxation and dual decomposi-
tion techniques have been widely used in NLP
tasks (e.g., (Sontag et al., 2011; Rush and Collins,
2012; Chang and Collins, 2011; Peng et al., 2015))
for dealing with instance-level constraints. Simi-
lar techniques (Chang et al., 2013; Dalvi, 2015)
have been applied in handling corpus-level con-
straints for semi-supervised multilabel classifica-
tion. In contrast to previous works aiming for
improving accuracy performance, we incorporate
corpus-level constraints for reducing gender bias.
3 Visualizing and Quantifying Biases
Modern statistical learning approaches capture
correlations among output variables in order to
make coherent predictions. However, for real-
world applications, some implicit correlations are
not appropriate, especially if they are amplified.
In this section, we present a general framework to
analyze inherent biases learned and amplified by a
prediction model.
Identifying bias We consider that prediction
problems involve several inter-dependent output
variables y1, y2, ...yK, which can be represented
as a structure y={y1, y2, ...yK} ∈ Y. This
is a common setting in NLP applications, includ-
ing tagging, and parsing. For example, in the
vSRL task, the output can be represented as a
structured table as shown in Fig 1. Modern tech-
niques often model the correlation between the
sub-components in yand make a joint prediction
over them using a structured prediction model.
More details will be provided in Section 4.
We assume there is a subset of output vari-
ables gy, g Gthat reflects demographic at-
tributes such as gender or race (e.g. gG=
{man,woman}is the agent), and there is another
subset of the output oy, o Othat are co-
related with g(e.g., ois the activity present in an
image, such as cooking). The goal is to identify
the correlations that are potentially amplified by a
learned model.
To achieve this, we define the bias score of a
given output, o, with respect to a demographic
variable, g, as:
b(o, g) = c(o, g)
Pg0Gc(o, g0),
where c(o, g)is the number of occurrences of o
and gin a corpus. For example, to analyze how
genders of agents and activities are co-related in
vSRL, we define the gender bias toward man for
each verb b(verb, man)as:
c(verb, man)
c(verb, man) + c(v erb, woman).(1)
If b(o, g)>1/kGk, then ois positively correlated
with gand may exhibit bias.
Evaluating bias amplification To evaluate the
degree of bias amplification, we propose to com-
pare bias scores on the training set, b(o, g), with
bias scores on an unlabeled evaluation set of im-
ages ˜
b(o, g)that has been annotated by a predic-
tor. We assume that the evaluation set is iden-
tically distributed to the training set. There-
fore, if ois positively correlated with g(i.e,
b(o, g)>1/kGk) and ˜
b(o, g)is larger than
b(o, g), we say bias has been amplified. For
example, if b(cooking,woman) = .66, and
b(cooking,woman) = .84, then the bias of
woman toward cooking has been amplified. Fi-
nally, we define the mean bias amplification as:
b(o, g)b(o, g).
This score estimates the average magnitude of bias
amplification for pairs of oand gwhich exhibited
4 Calibration Algorithm
In this section, we introduce Reducing Bias
Amplification, RBA, a debiasing technique for
calibrating the predictions from a structured pre-
diction model. The intuition behind the algorithm
is to inject constraints to ensure the model pre-
dictions follow the distribution observed from the
training data. For example, the constraints added
to the vSRL system ensure the gender ratio of each
verb in Eq. (1) are within a given margin based on
the statistics of the training data. These constraints
are applied at the corpus level, because comput-
ing gender ratio requires the predictions of all test
instances. As a result, a joint inference over test
instances is required3. Solving such a giant in-
ference problem with constraints is hard. There-
fore, we present an approximate inference algo-
rithm based on Lagrangian relaxation. The advan-
tages of this approach are:
Our algorithm is iterative, and at each it-
eration, the joint inference problem is de-
composed to a per-instance basis. This can
be solved by the original inference algo-
rithm. That is, our approach works as a meta-
algorithm and developers do not need to im-
plement a new inference algorithm.
The approach is general and can be applied in
any structured model.
Lagrangian relaxation guarantees the solu-
tion is optimal if the algorithm converges and
all constraints are satisfied.
In practice, it is hard to obtain a solution where
all corpus-level constrains are satisfied. However,
we show that the performance of the proposed ap-
proach is empirically strong. We use imSitu for
vSRL as a running example to explain our algo-
Structured Output Prediction As we men-
tioned in Sec. 3, we assume the structured output
yYconsists of several sub-components. Given
a test instance ias an input, the inference problem
is to find
arg max
yYfθ(y, i),
where fθ(y, i)is a scoring function based on a
model θlearned from the training data. The struc-
tured output yand the scoring function fθ(y, i)can
be decomposed into small components based on
an independence assumption. For example, in the
vSRL task, the output yconsists of two types of
binary output variables {yv}and {yv,r}. The vari-
able yv= 1 if and only if the activity vis chosen.
Similarly, yv,r = 1 if and only if both the activity v
and the semantic role rare assigned 4. The scoring
function fθ(y, i)is decomposed accordingly such
fθ(y, i) = X
yvsθ(v, i) + X
yv,r sθ(v, r, i),
3A sufficiently large sample of test instances must be used
so that bias statistics can be estimated. In this work we use
the entire test set for each respective problem.
4We use rto refer to a combination of role and noun. For
example, one possible value indicates an agent is a woman.
represents the overall score of an assignment, and
sθ(v, i)and sθ(v, r, i)are the potentials of the sub-
assignments. The output space Ycontains all fea-
sible assignments of yvand yv,r , which can be rep-
resented as instance-wise constraints. For exam-
ple, the constraint, Pvyv= 1 ensures only one
activity is assigned to one image.
Corpus-level Constraints Our goal is to inject
constraints to ensure the output labels follow a
desired distribution. For example, we can set a
constraint to ensure the gender ratio for each ac-
tivity in Eq. (1) is within a given margin. Let
v} ∪ {yi
v,r }be the output assignment for
test instance i5. For each activity v, the con-
straints can be written as
where bb(v, man)is the desired gender ra-
tio of an activity v,γis a user-specified margin.
Mand Ware a set of semantic role-values rep-
resenting the agent as a man or a woman, respec-
Note that the constraints in (2) involve all the
test instances. Therefore, it requires a joint in-
ference over the entire test corpus. In general,
these corpus-level constraints can be represented
in a form of APiyib0, where each row
in the matrix ARl×Kis the coefficients of one
constraint, and bRl. The constrained inference
problem can then be formulated as:
fθ(yi, i),
s.t. AX
where {Yi}represents a space spanned by possi-
ble combinations of labels for all instances. With-
out the corpus-level constraints, Eq. (3) can be
optimized by maximizing each instance i
yiYifθ(yi, i),
Lagrangian Relaxation Eq. (3) can be solved
by several combinatorial optimization methods.
For example, one can represent the problem as an
5For the sake of simplicity, we abuse the notations and use
ito represent both input and data index.
Dataset Task Images O-Type kOk
imSitu vSRL 60,000 verb 212
MS-COCO MLC 25,000 object 66
Table 1: Statistics for the two recognition prob-
lems. In vSRL, we consider gender bias relating
to verbs, while in MLC we consider the gender
bias related to objects.
integer linear program and solve it using an off-
the-shelf solver (e.g., Gurobi (Gurobi Optimiza-
tion, 2016)). However, Eq. (3) involves all test in-
stances. Solving a constrained optimization prob-
lem on such a scale is difficult. Therefore, we con-
sider relaxing the constraints and solve Eq. (3) us-
ing a Lagrangian relaxation technique (Rush and
Collins, 2012). We introduce a Lagrangian multi-
plier λj0for each corpus-level constraint. The
Lagrangian is
L(λ, {yi}) =
λj AjX
where all the λj0,j∈ {1, . . . , l}. The solu-
tion of Eq. (3) can be obtained by the following
iterative procedure:
1) At iteration t, get the output solution of each
instance i
yi,(t)= argmax
L(λ(t1), y)(5)
2) update the Lagrangian multipliers.
λ(t)= max 0, λ(t1)+X
where λ(0) =0.ηis the learning rate for updat-
ing λ. Note that with a fixed λ(t1), Eq. (5) can
be solved using the original inference algorithms.
The algorithm loops until all constraints are satis-
fied (i.e. optimal solution achieved) or reach max-
imal number of iterations.
5 Experimental Setup
In this section, we provide details about the two vi-
sual recognition tasks we evaluated for bias: visual
semantic role labeling (vSRL), and multi-label
classification (MLC). We focus on gender, defin-
ing G={man,woman}and focus on the agent
role in vSRL, and any occurrence in text associ-
ated with the images in MLC. Problem statistics
are summarized in Table 1. We also provide setup
details for our calibration method.
5.1 Visual Semantic Role Labeling
Dataset We evaluate on imSitu (Yatskar et al.,
2016) where activity classes are drawn from verbs
and roles in FrameNet (Baker et al., 1998) and
noun categories are drawn from WordNet (Miller
et al., 1990). The original dataset includes about
125,000 images with 75,702 for training, 25,200
for developing, and 25,200 for test. However, the
dataset covers many non-human oriented activities
(e.g., rearing,retrieving, and wagging),
so we filter out these verbs, resulting in 212 verbs,
leaving roughly 60,000 of the original 125,000 im-
ages in the dataset.
Model We build on the baseline CRF released
with the data, which has been shown effective
compared to a non-structured prediction base-
line (Yatskar et al., 2016). The model decomposes
the probability of a realized situation, y, the com-
bination of activity, v, and realized frame, a set of
semantic (role,noun) pairs (e, ne), given an image
ias :
p(y|i;θ)ψ(v, i;θ)Y
ψ(v, e, ne, i;θ)
where each potential value in the CRF for subpart
x, is computed using features fifrom the VGG
convolutional neural network (Simonyan and Zis-
serman, 2014) on an input image, as follows:
ψ(x, i;θ) = ewT
where wand bare the parameters of an affine
transformation layer. The model explicitly cap-
tures the correlation between activities and nouns
in semantic roles, allowing it to learn common pri-
ors. We use a model pretrained on the original task
with 504 verbs.
5.2 Multilabel Classification
Dataset We use MS-COCO (Lin et al., 2014),
a common object detection benchmark, for multi-
label object classification. The dataset contains 80
object types but does not make gender distinctions
between man and woman. We use the five asso-
ciated image captions available for each image in
this dataset to annotate the gender of people in the
images. If any of the captions mention the word
man or woman we mark it, removing any images
that mention both genders. Finally, we filter any
object category not strongly associated with hu-
mans by removing objects that do not occur with
man or woman at least 100 times in the training
set, leaving a total of 66 objects.
Model For this multi-label setting, we adapt a
similar model as the structured CRF we use for
vSRL. We decompose the joint probability of the
output y, consisting of all object categories, c, and
gender of the person, g, given an image ias:
p(y|i;θ)ψ(g, i;θ)Y
ψ(g, c, i;θ)
where each potential value for x, is computed us-
ing features, fi, from a pretrained ResNet-50 con-
volutional neural network evaluated on the image,
ψ(x, i;θ) = ewT
We trained a model using SGD with learning rate
105, momentum 0.9and weight-decay 104, fine
tuning the initial visual network, for 50 epochs.
5.3 Calibration
The inference problems for both models are:
arg max
yYfθ(y, i) = log p(y|i;θ).
We use the algorithm in Sec. (4) to calibrate the
predictions using model θ. Our calibration tries to
enforce gender statistics derived from the training
set of corpus applicable for each recognition prob-
lem. For all experiments, we try to match gen-
der ratios on the test set within a margin of .05 of
their value on the training set. While we do adjust
the output on the test set, we never use the ground
truth on the test set and instead working from the
assumption that it should be similarly distributed
as the training set. When running the debiasing al-
gorithm, we set η= 101and optimize for 100
6 Bias Analysis
In this section, we use the approaches outlined
in Section 3 to quantify the bias and bias amplifi-
cation in the vSRL and the MLC tasks.
6.1 Visual Semantic Role Labeling
imSitu is gender biased In Figure 2(a), along
the x-axis, we show the male favoring bias of im-
Situ verbs. Overall, the dataset is heavily biased
toward male agents, with 64.6% of verbs favoring
a male agent by an average bias of 0.707 (roughly
3:1 male). Nearly half of verbs are extremely bi-
ased in the male or female direction: 46.95% of
verbs favor a gender with a bias of at least 0.7.6
Figure 2(a) contains several activity labels reveal-
ing problematic biases. For example, shopping,
microwaving and washing are biased toward
a female agent. Furthermore, several verbs such
as driving,shooting, and coaching are
heavily biased toward a male agent.
Training on imSitu amplifies bias In Fig-
ure 2(a), along the y-axis, we show the ratio of
male agents (% of total people) in predictions on
an unseen development set. The mean bias ampli-
fication in the development set is high, 0.050 on
average, with 45.75% of verbs exhibiting ampli-
fication. Biased verbs tend to have stronger am-
plification: verbs with training bias over 0.7in
either the male or female direction have a mean
amplification of 0.072. Several already problem-
atic biases have gotten much worse. For example,
serving, only had a small bias toward females
in the training set, 0.402, is now heavily biased
toward females, 0.122. The verb tuning, origi-
nally heavily biased toward males, 0.878, now has
exclusively male agents.
6.2 Multilabel Classification
MS-COCO is gender biased In Figure 2(b)
along the x-axis, similarly to imSitu, we ana-
lyze bias of objects in MS-COCO with respect
to males. MS-COCO is even more heavily bi-
ased toward men than imSitu, with 86.6% of ob-
jects biased toward men, but with smaller average
magnitude, 0.65. One third of the nouns are ex-
tremely biased toward males, 37.9% of nouns fa-
vor men with a bias of at least 0.7. Some prob-
lematic examples include kitchen objects such as
knife,fork, or spoon being more biased to-
ward woman. Outdoor recreation related objects
such tennis racket,snowboard and boat
tend to be more biased toward men.
6In this gender binary, bias toward woman is 1the bias
toward man
(a) Bias analysis on imSitu vSRL (b) Bias analysis on MS-COCO MLC
Figure 2: Gender bias analysis of imSitu vSRL and MS-COCO MLC. (a) gender bias of verbs toward
man in the training set versus bias on a predicted development set. (b) gender bias of nouns toward man
in the training set versus bias on the predicted development set. Values near zero indicate bias toward
woman while values near 0.5indicate unbiased variables. Across both dataset, there is significant bias
toward males, and significant bias amplification after training on biased training data.
Training on MS-COCO amplifies bias In Fig-
ure 2(b), along the y-axis, we show the ratio of
man (% of both gender) in predictions on an un-
seen development set. The mean bias amplifica-
tion across all objects is 0.036, with 65.67% of
nouns exhibiting amplification. Larger training
bias again tended to indicate higher bias amplifi-
cation: biased objects with training bias over 0.7
had mean amplification of 0.081. Again, several
problematic biases have now been amplified. For
example, kitchen categories already biased toward
females such as knife,fork and spoon have
all been amplified. Technology oriented categories
initially biased toward men such as keyboard
and mouse have each increased their bias toward
males by over 0.100.
6.3 Discussion
We confirmed our hypothesis that (a) both the im-
Situ and MS-COCO datasets, gathered from the
web, are heavily gender biased and that (b) mod-
els trained to perform prediction on these datasets
amplify the existing gender bias when evaluated
on development data. Furthermore, across both
datasets, we showed that the degree of bias am-
plification was related to the size of the initial
bias, with highly biased object and verb categories
exhibiting more bias amplification. Our results
demonstrate that care needs be taken in deploying
such uncalibrated systems otherwise they could
not only reinforce existing social bias but actually
make them worse.
7 Calibration Results
We test our methods for reducing bias amplifica-
tion in two problem settings: visual semantic role
labeling in the imSitu dataset (vSRL) and multil-
abel image classification in MS-COCO (MLC). In
all settings we derive corpus constraints using the
training set and then run our calibration method in
batch on either the development or testing set. Our
results are summarized in Table 2 and Figure 3.
7.1 Visual Semantic Role Labeling
Our quantitative results are summarized in the first
two sections of Table 2. On the development
set, the number of verbs whose bias exceed the
original bias by over 5% decreases 30.5% (Viol.).
Overall, we are able to significantly reduce bias
amplification in vSRL by 52% on the develop-
ment set (Amp. bias). We evaluate the under-
lying recognition performance using the standard
measure in vSRL: top-1 semantic role accuracy,
which tests how often the correct verb was pre-
dicted and the noun value was correctly assigned
to a semantic role. Our calibration method results
in a negligible decrease in performance (Perf.). In
Figure 3(c) we can see that the overall distance to
the training set distribution after applying RBA de-
creased significantly, over 39%.
Figure 3(e) demonstrates that across all initial
training bias, RBA is able to reduce bias amplifi-
cation. In general, RBA struggles to remove bias
amplification in areas of low initial training bias,
(a) Bias analysis on imSitu vSRL without RBA (b) Bias analysis on MS-COCO MLC without RBA
(c) Bias analysis on imSitu vSRL with RBA (d) Bias analysis on MS-COCO MLC with RBA
(e) Bias in vSRL with (blue) / without (red) RBA (f) Bias in MLC with (blue) / without (red) RBA
Figure 3: Results of reducing bias amplification using RBA on imSitu vSRL and MS-COCO MLC.
Figures 3(a)-(d) show initial training set bias along the x-axis and development set bias along the y-
axis. Dotted blue lines indicate the 0.05 margin used in RBA, with points violating the margin shown
in red while points meeting the margin are shown in green. Across both settings adding RBA signifi-
cantly reduces the number of violations, and reduces the bias amplification significantly. Figures 3(e)-(f)
demonstrate bias amplification as a function of training bias, with and without RBA. Across all initial
training biases, RBA is able to reduce the bias amplification.
Method Viol. Amp. bias Perf. (%)
vSRL: Development Set
CRF 154 0.050 24.07
CRF + RBA 107 0.024 23.97
vSRL: Test Set
CRF 149 0.042 24.14
CRF + RBA 102 0.025 24.01
MLC: Development Set
CRF 40 0.032 45.27
CRF + RBA 24 0.022 45.19
MLC: Test Set
CRF 38 0.040 45.40
CRF + RBA 16 0.021 45.38
Table 2: Number of violated constraints, mean
amplified bias, and test performance before and af-
ter calibration using RBA. The test performances
of vSRL and MLC are measured by top-1 seman-
tic role accuracy and top-1 mean average preci-
sion, respectively.
likely because bias is encoded in image statistics
and cannot be removed as effectively with an im-
age agnostic adjustment. Results on the test set
support our development set results: we decrease
bias amplification by 40.5% (Amp. bias).
7.2 Multilabel Classification
Our quantitative results on MS-COCO RBA are
summarized in the last two sections of Table 2.
Similarly to vSRL, we are able to reduce the num-
ber of objects whose bias exceeds the original
training bias by 5%, by 40% (Viol.). Bias amplifi-
cation was reduced by 31.3% on the development
set (Amp. bias). The underlying recognition sys-
tem was evaluated by the standard measure: top-
1 mean average precision, the precision averaged
across object categories. Our calibration method
results in a negligible loss in performance. In Fig-
ure 3(d), we demonstrate that we substantially re-
duce the distance between training bias and bias
in the development set. Finally, in Figure 3(f) we
demonstrate that we decrease bias amplification
for all initial training bias settings. Results on the
test set support our development results: we de-
crease bias amplification by 47.5% (Amp. bias).
7.3 Discussion
We have demonstrated that RBA can significantly
reduce bias amplification. While were not able to
remove all amplification, we have made significant
progress with little or no loss in underlying recog-
nition performance. Across both problems, RBA
was able to reduce bias amplification at all initial
values of training bias.
8 Conclusion
Structured prediction models can leverage correla-
tions that allow them to make correct predictions
even with very little underlying evidence. Yet such
models risk potentially leveraging social bias in
their training data. In this paper, we presented a
general framework for visualizing and quantify-
ing biases in such models and proposed RBA to
calibrate their predictions under two different set-
tings. Taking gender bias as an example, our anal-
ysis demonstrates that conditional random fields
can amplify social bias from data while our ap-
proach RBA can help to reduce the bias.
Our work is the first to demonstrate structured
prediction models amplify bias and the first to
propose methods for reducing this effect but sig-
nificant avenues for future work remain. While
RBA can be applied to any structured predic-
tor, it is unclear whether different predictors am-
plify bias more or less. Furthermore, we pre-
sented only one method for measuring bias. More
extensive analysis could explore the interaction
among predictor, bias measurement, and bias de-
amplification method. Future work also includes
applying bias reducing methods in other struc-
tured domains, such as pronoun reference resolu-
tion (Mitkov, 2014).
Acknowledgement This work was supported in
part by National Science Foundation Grant IIS-
1657193 and two NVIDIA Hardware Grants.
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Mar-
garet Mitchell, Dhruv Batra, C Lawrence Zitnick,
and Devi Parikh. 2015. Vqa: Visual question an-
swering. In Proceedings of the IEEE International
Conference on Computer Vision, pages 2425–2433.
Collin F Baker, Charles J Fillmore, and John B Lowe.
1998. The Berkeley framenet project. In Proceed-
ings of the Annual Meeting of the Association for
Computational Linguistics (ACL), pages 86–90.
Solon Barocas and Andrew D Selbst. 2014. Big data’s
disparate impact. Available at SSRN 2477899.
Tolga Bolukbasi, Kai-Wei Chang, James Y Zou,
Venkatesh Saligrama, and Adam T Kalai. 2016.
Man is to computer programmer as woman is to
homemaker? debiasing word embeddings. In The
Conference on Advances in Neural Information Pro-
cessing Systems (NIPS), pages 4349–4357.
Aylin Caliskan, Joanna J Bryson, and Arvind
Narayanan. 2017. Semantics derived automatically
from language corpora contain human-like biases.
Science, 356(6334):183–186.
Kai-Wei Chang, S. Sundararajan, and S. Sathiya
Keerthi. 2013. Tractable semi-supervised learning
of complex structured prediction models. In Pro-
ceedings of the European Conference on Machine
Learning (ECML), pages 176–191.
Yin-Wen Chang and Michael Collins. 2011. Exact de-
coding of phrase-based translation models through
Lagrangian relaxation. In EMNLP, pages 26–37.
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr-
ishna Vedantam, Saurabh Gupta, Piotr Doll´
ar, and
C Lawrence Zitnick. 2015. Microsoft coco captions:
Data collection and evaluation server. arXiv preprint
Bhavana Bharat Dalvi. 2015. Constrained Semi-
supervised Learning in the Presence of Unantici-
pated Classes. Ph.D. thesis, Google Research.
Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer
Reingold, and Richard Zemel. 2012. Fairness
through awareness. In Proceedings of the 3rd In-
novations in Theoretical Computer Science Confer-
ence, pages 214–226. ACM.
Michael Feldman, Sorelle A Friedler, John Moeller,
Carlos Scheidegger, and Suresh Venkatasubrama-
nian. 2015. Certifying and removing disparate im-
pact. In Proceedings of International Conference
on Knowledge Discovery and Data Mining (KDD),
pages 259–268.
Jonathan Gordon and Benjamin Van Durme. 2013. Re-
porting bias and knowledge extraction. Automated
Knowledge Base Construction (AKBC).
Inc. Gurobi Optimization. 2016. Gurobi optimizer ref-
erence manual.
Moritz Hardt, Eric Price, Nati Srebro, et al. 2016.
Equality of opportunity in supervised learning. In
Conference on Neural Information Processing Sys-
tems (NIPS), pages 3315–3323.
Matthew Kay, Cynthia Matuszek, and Sean A Munson.
2015. Unequal representation and gender stereo-
types in image search results for occupations. In
Human Factors in Computing Systems, pages 3819–
3828. ACM.
Bernhard Korte and Jens Vygen. 2008. Combinatorial
Optimization: Theory and Application. Springer
Tsung-Yi Lin, Michael Maire, Serge Belongie, James
Hays, Pietro Perona, Deva Ramanan, Piotr Doll´
and C Lawrence Zitnick. 2014. Microsoft coco:
Common objects in context. In European Confer-
ence on Computer Vision, pages 740–755. Springer.
G. Miller, R. Beckwith, C. Fellbaum, D. Gross, and
K.J. Miller. 1990. Wordnet: An on-line lexical
database. International Journal of Lexicography,
Emiel van Miltenburg. 2016. Stereotyping and bias in
the flickr30k dataset. MMC.
Ishan Misra, C Lawrence Zitnick, Margaret Mitchell,
and Ross Girshick. 2016. Seeing through the human
reporting bias: Visual classifiers from noisy human-
centric labels. In Conference on Computer Vision
and Pattern Recognition (CVPR), pages 2930–2939.
Ruslan Mitkov. 2014. Anaphora resolution. Rout-
Nanyun Peng, Ryan Cotterell, and Jason Eisner. 2015.
Dual decomposition inference for graphical models
over strings. In Conference on Empirical Methods
in Natural Language Processing (EMNLP), pages
John Podesta, Penny Pritzker, Ernest J. Moniz, John
Holdren, and Jefrey Zients. 2014. Big data: Seiz-
ing opportunities and preserving values. Executive
Office of the President.
Karen Ross and Cynthia Carter. 2011. Women and
news: A long and winding road. Media, Culture
& Society, 33(8):1148–1165.
Alexander M Rush and Michael Collins. 2012. A Tuto-
rial on Dual Decomposition and Lagrangian Relax-
ation for Inference in Natural Language Processing.
Journal of Artificial Intelligence Research, 45:305–
Karen Simonyan and Andrew Zisserman. 2014. Very
deep convolutional networks for large-scale image
recognition. arXiv preprint arXiv:1409.1556.
David Sontag, Amir Globerson, and Tommi Jaakkola.
2011. Introduction to dual decomposition for infer-
ence. Optimization for Machine Learning, 1:219–
Latanya Sweeney. 2013. Discrimination in online ad
delivery. Queue, 11(3):10.
Benjamin D Van Durme. 2010. Extracting implicit
knowledge from text. Ph.D. thesis, University of
Oriol Vinyals, Alexander Toshev, Samy Bengio, and
Dumitru Erhan. 2015. Show and tell: A neural im-
age caption generator. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recog-
nition, pages 3156–3164.
Mark Yatskar, Vicente Ordonez, Luke Zettlemoyer, and
Ali Farhadi. 2017. Commonly uncommon: Seman-
tic sparsity in situation recognition. In Proceedings
of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR).
Mark Yatskar, Luke Zettlemoyer, and Ali Farhadi.
2016. Situation recognition: Visual semantic role
labeling for image understanding. In Proceedings of
the IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 5534–5542.
Indre Zliobaite. 2015. A survey on measuring indirect
discrimination in machine learning. arXiv preprint
... In addition, some researchers construct the baseline for the fairness measurement, and detect fairness bugs by comparing the measurement value with the baseline. Zhao et al. [141] used the fairness measurements calculated based on training data as their baseline against which to evaluate. Specifically, they used the obtained ML model to annotate unlabeled data instances, and revealed situations when the ML process amplified existing bias by comparing the fairness measurements on training data and those on the annotated dataset. ...
... Since 2017, more and more research communities have started to study fairness testing. For example, Galhotra et al. [55] presented the first ML fairness testing approach in the software engineering community and won the Distinguished Paper Award of ESEC/FSE 2017; Zhao et al. [141] detected bias in the datasets and ML models for visual recognition tasks, and won the Best Paper Award of EMNLP 2017; Diaz et al. [91] detected age-related bias in sentiment analysis systems and won the Best Paper Award of CHI 2018. These three best paper awards, as well as being a credit to their authors, also demonstrate both the significant level of interest and the high quality of research on fairness in three different research communities: Additionally, we find that other research communities, such as computer security and database communities, are also embracing fairness testing, demonstrating the importance of our survey to a broad audience. ...
Full-text available
Software systems are vulnerable to fairness bugs and frequently exhibit unfair behaviors, making software fairness an increasingly important concern for software engineers. Research has focused on helping software engineers to detect fairness bugs automatically. This paper provides a comprehensive survey of existing research on fairness testing. We collect 113 papers and organise them based on the testing workflow (i.e., the testing activities) and the testing components (i.e., where to find fairness bugs) for conducting fairness testing. We also analyze the research focus, trends, promising directions, as well as widely-adopted datasets and open source tools for fairness testing.
... Finally, in order to facilitate research into annotator response patterns and bias, as recommended by Prabhakaran et al. (2021), we provide psychological and demographic metadata of our annotators. The background and biases of human annotators have been shown to impact their annotations (Hovy and Prabhumoye, 2021;Davani et al., 2022Davani et al., , 2021Bolukbasi et al., 2016) with particularly damaging effects that amplify pre-existing biases (Mujtaba and Mahapatra, 2019;Zhao et al., 2017). Annotators' biases may be particularly relevant in domains characterized by high subjectivity, such as moral values (Garten et al., 2019a,b;Hoover et al., 2020). ...
Full-text available
Moral framing and sentiment can affect a variety of online and offline behaviors, including donation, pro-environmental action, political engagement, and even participation in violent protests. Various computational methods in Natural Language Processing (NLP) have been used to detect moral sentiment from textual data, but in order to achieve better performances in such subjective tasks, large sets of hand-annotated training data are needed. Previous corpora annotated for moral sentiment have proven valuable, and have generated new insights both within NLP and across the social sciences, but have been limited to Twitter. To facilitate improving our understanding of the role of moral rhetoric, we present the Moral Foundations Reddit Corpus, a collection of 16,123 Reddit comments that have been curated from 12 distinct subreddits, hand-annotated by at least three trained annotators for 8 categories of moral sentiment (i.e., Care, Proportionality, Equality, Purity, Authority, Loyalty, Thin Morality, Implicit/Explicit Morality) based on the updated Moral Foundations Theory (MFT) framework. We use a range of methodologies to provide baseline moral-sentiment classification results for this new corpus, e.g., cross-domain classification and knowledge transfer.
... To address this, we constructed few-shot prompts with existing undesired examples and sent the generated texts to be labeled. The generated texts are manually inspected to avoid bias amplification (Zhao et al. 2017). We observed a nontrivial performance improvement by incorporating the synthetic dataset. ...
Full-text available
We present a holistic approach to building a robust and useful natural language classification system for real-world content moderation. The success of such a system relies on a chain of carefully designed and executed steps, including the design of content taxonomies and labeling instructions, data quality control, an active learning pipeline to capture rare events, and a variety of methods to make the model robust and to avoid overfitting. Our moderation system is trained to detect a broad set of categories of undesired content, including sexual content, hateful content, violence, self-harm, and harassment. This approach generalizes to a wide range of different content taxonomies and can be used to create high-quality content classifiers that outperform off-the-shelf models.
... For example, a face attribute classifier may yield a disproportionately higher error rate on images of non-White or females [5,30]. Another line of work has investigated biased or spurious associations in public image datasets and models between different dimensions of sensitive groups and non-protected attributes such as semantic descriptions, facial expressions, and age [1,6,28,64,65]. Our paper focuses on the former: the mitigation of accuracy disparity. ...
Existing pruning techniques preserve deep neural networks' overall ability to make correct predictions but may also amplify hidden biases during the compression process. We propose a novel pruning method, Fairness-aware GRAdient Pruning mEthod (FairGRAPE), that minimizes the disproportionate impacts of pruning on different sub-groups. Our method calculates the per-group importance of each model weight and selects a subset of weights that maintain the relative between-group total importance in pruning. The proposed method then prunes network edges with small importance values and repeats the procedure by updating importance values. We demonstrate the effectiveness of our method on four different datasets, FairFace, UTKFace, CelebA, and ImageNet, for the tasks of face attribute classification where our method reduces the disparity in performance degradation by up to 90% compared to the state-of-the-art pruning algorithms. Our method is substantially more effective in a setting with a high pruning rate (99%). The code and dataset used in the experiments are available at
... A number of studies [21,26] have raised concerns regarding unfair prediction in commercialized recognition systems, and a number of studies [20,[27][28][29][30][31][32] have been conducted to achieve fair prediction on various computer vision tasks. Several image captioning and VQA models [27,33,34] have been noted to exaggerate biases [27,34] that lead to incorrect captions due to over-reliance on prejudicial context. To mitigate bias, the models focus on the object, not the context. ...
Full-text available
Recent studies have raised concerns regarding racial and gender disparity in facial attribute classification performance. As these attributes are directly and indirectly correlated with the sensitive attribute in a complex manner, simple disparate treatment is ineffective in reducing performance disparity. This paper focuses on achieving counterfactual fairness for facial attribute classification. Each labeled input image is used to generate two synthetic replicas: one under factual assumptions about the sensitive attribute and one under counterfactual. The proposed causal graph-based attribute translation generates realistic counterfactual images that consider the complicated causal relationship among the attributes with an encoder–decoder framework. A causal graph represents complex relationships among the attributes and is used to sample factual and counterfactual facial attributes of the given face image. The encoder–decoder architecture translates the given facial image to have sampled factual or counterfactual attributes while preserving its identity. The attribute classifier is trained for fair prediction with counterfactual regularization between factual and corresponding counterfactual translated images. Extensive experimental results on the CelebA dataset demonstrate the effectiveness and interpretability of the proposed learning method for classifying multiple face attributes.
... It has been discovered that word embeddings do not only reflect but also have the tendency to amplify the biases present in the data they are trained on (Wang and Russakovsky, 2021) which can lead to the spread of unfavourable stereotypes (Zhao et al., 2017). The implicit associations which are a feature of human reasoning are also encoded by embeddings (Greenwald et al., 1998;Caliskan et al., 2016). ...
Conference Paper
Pre-trained word embedding models are easily distributed and applied, as they alleviate users from the effort to train models themselves. With widely distributed models, it is important to ensure that they do not exhibit undesired behaviour, such as biases against population groups. For this purpose, we carry out an empirical study on evaluating the bias of 15 publicly available, pre-trained word embeddings model based on three training algorithms (GloVe, word2vec, and fastText) with regard to four bias metrics (WEAT, SEMBIAS, DIRECT BIAS, and ECT). The choice of word embedding models and bias metrics is motivated by a literature survey over 37 publications which quantified bias on pre-trained word embeddings. Our results indicate that fastText is the least biased model (in 8 out of 12 cases) and small vector lengths lead to a higher bias.
Full-text available
This paper reviews dilemmas and implications of erroneous data for clinical implementation of AI. It is well-known that if erroneous and biased data are used to train AI, there is a risk of systematic error. However, even perfectly trained AI applications can produce faulty outputs if fed with erroneous inputs. To counter such problems, we suggest 3 steps: (1) AI should focus on data of the highest quality, in essence paraclinical data and digital images, (2) patients should be granted simple access to the input data that feed the AI, and granted a right to request changes to erroneous data, and (3) automated high-throughput methods for error-correction should be implemented in domains with faulty data when possible. Also, we conclude that erroneous data is a reality even for highly reputable Danish data sources, and thus, legal framework for the correction of errors is universally needed.
Full-text available
Machine learning is a means to derive artificial intelligence by discovering patterns in existing data. Here we show that applying machine learning to ordinary human language results in human-like semantic biases. We replicate a spectrum of known biases, as measured by the Implicit Association Test, using a widely used, purely statistical machine-learning model trained on a standard corpus of text from the Web. Our results indicate that text corpora contain re-coverable and accurate imprints of our historic biases, whether morally neutral as towards insects or flowers, problematic as towards race or gender, or even simply veridical, reflecting the status quo distribution of gender with respect to careers or first names. Our methods hold promise for identifying and addressing sources of bias in culture, including technology.
Full-text available
Traditional semi-supervised learning (SSL) techniques consider the missing labels of unlabeled datapoints as latent/unobserved variables, and model these variables, and the parameters of the model, using techniques like Expectation Maximization (EM). Such semisupervised learning techniques are widely used for Automatic Knowledge Base Construction (AKBC) tasks. We consider two extensions to traditional SSL methods which make it more suitable for a variety of AKBC tasks. First, we consider jointly assigning multiple labels to each instance, with a flexible scheme for encoding constraints between assigned labels: this makes it possible, for instance, to assign labels at multiple levels from a hierarchy. Second, we account for another type of latent variable, in the form of unobserved classes. In open-domain webscale information extraction problems, it is an unrealistic assumption that the class ontology or topic hierarchy we are using is complete. Our proposed framework combines structural search for the best class hierarchy with SSL, reducing the semantic drift associated with erroneously grouping unanticipated classes with expected classes. Together, these extensions allow a single framework to handle a large number of knowledge extraction tasks, including macro-reading, noun-phrase classification, word sense disambiguation, alignment of KBs to wikipedia or on-line glossaries, and ontology extension. To summarize, this thesis argues that many AKBC tasks which have previously been addressed separately can be viewed as instances of single abstract problem: multiview semisupervised learning with an incomplete class hierarchy. In this thesis we present a generic EM framework for solving this abstract task.
Full-text available
Nowadays, many decisions are made using predictive models built on historical data.Predictive models may systematically discriminate groups of people even if the computing process is fair and well-intentioned. Discrimination-aware data mining studies how to make predictive models free from discrimination, when historical data, on which they are built, may be biased, incomplete, or even contain past discriminatory decisions. Discrimination refers to disadvantageous treatment of a person based on belonging to a category rather than on individual merit. In this survey we review and organize various discrimination measures that have been used for measuring discrimination in data, as well as in evaluating performance of discrimination-aware predictive models. We also discuss related measures from other disciplines, which have not been used for measuring discrimination, but potentially could be suitable for this purpose. We computationally analyze properties of selected measures. We also review and discuss measuring procedures, and present recommendations for practitioners. The primary target audience is data mining, machine learning, pattern recognition, statistical modeling researchers developing new methods for non-discriminatory predictive modeling. In addition, practitioners and policy makers would use the survey for diagnosing potential discrimination by predictive models.
An untested assumption behind the crowdsourced descriptions of the images in the Flickr30K dataset (Young et al., 2014) is that they "focus only on the information that can be obtained from the image alone" (Hodosh et al., 2013, p. 859). This paper presents some evidence against this assumption, and provides a list of biases and unwarranted inferences that can be found in the Flickr30K dataset. Finally, it considers methods to find examples of these, and discusses how we should deal with stereotype-driven descriptions in future applications.