ArticlePDF Available

Reviewer bias in single- versus double-blind peer review

Authors:

Abstract and Figures

Significance Scientific peer review has been a cornerstone of the scientific method since the 1600s. Debate continues regarding the merits of single-blind review, in which anonymous reviewers know the authors of a paper and their affiliations, compared with double-blind review, in which this information is hidden. We present an experimental study of this question. In computer science, research often appears first or exclusively in peer-reviewed conferences rather than journals. Our study considers full-length submissions to the highly selective 2017 Web Search and Data Mining conference (15.6% acceptance rate). Each submission is simultaneously scored by two single-blind and two double-blind reviewers. Our analysis shows that single-blind reviewing confers a significant advantage to papers with famous authors and authors from high-prestige institutions.
Content may be subject to copyright.
SOCIAL SCIENCESCOMPUTER SCIENCES
Reviewer bias in single- versus double-blind
peer review
Andrew Tomkinsa,1, Min Zhangb, and William D. Heavlina
aGoogle, Inc., Mountain View, CA 94043; and bState Key Laboratory of Intelligent Technology and Systems, Department of Computer Science and
Technology, Tsinghua University, Beijing 100084, China
Edited by Susan T. Fiske, Princeton University, Princeton, NJ, and approved October 10, 2017 (received for review May 3, 2017)
Peer review may be “single-blind,” in which reviewers are aware
of the names and afﬁliations of paper authors, or “double-blind,”
in which this information is hidden. Noting that computer science
research often appears ﬁrst or exclusively in peer-reviewed con-
ferences rather than journals, we study these two reviewing mod-
els in the context of the 10th Association for Computing Machin-
ery International Conference on Web Search and Data Mining, a
highly selective venue (15.6% acceptance rate) in which expert
committee members review full-length submissions for accep-
tance. We present a controlled experiment in which four com-
mittee members review each paper. Two of these four review-
ers are drawn from a pool of committee members with access to
author information; the other two are drawn from a disjoint pool
without such access. This information asymmetry persists through
the process of bidding for papers, reviewing papers, and enter-
ing scores. Reviewers in the single-blind condition typically bid
for 22% fewer papers and preferentially bid for papers from top
universities and companies. Once papers are allocated to review-
ers, single-blind reviewers are signiﬁcantly more likely than their
double-blind counterparts to recommend for acceptance papers
from famous authors, top universities, and top companies. The
estimated odds multipliers are tangible, at 1.63, 1.58, and 2.10,
respectively.
peer review |double-blind |scientiﬁc method
The scientiﬁc peer-review process dates back to the 1600s
and is generally regarded as a cornerstone of the scientiﬁc
method (2). The details of its implementation have been scruti-
nized and explored across many academic disciplines.
Our focus is on the implications of making author informa-
tion available to reviewers. This question remains an active area
of debate, with many signiﬁcant journals and conferences elect-
ing to make this information available and many others electing
to hide it. Terminology is not completely uniform across the sci-
ences, but following common use in computer science, we refer to
single-blind reviewing as the practice of making reviewers aware
of author identity but not the other way around. In double-blind
reviewing, neither party is aware of the identity of the other.
There is extensive literature on scientiﬁc peer reviewing over-
all and on single-blind vs. double-blind reviewing in particular.
A detailed survey (3) reviews over 600 pieces of literature on
reviewing; a more recent survey (4) focuses speciﬁcally on issues
of peer-reviewer blindness. As the question engenders strong
feelings, there are also numerous editorials on the subject (5–7).
Standard practices for reviewer blindness differ across ﬁelds
(8). Nonetheless, there are numerous examples of journals
switching reviewing model, with various and sometimes contra-
dictory analyses of the outcomes (9–12).
Critics of anonymous review argue that retaining anonymity
may be infeasible, may introduce too much overhead, may make
it difﬁcult to evaluate work in the light of a group’s ongoing
research direction, or may make it difﬁcult to detect conﬂicts
of interest (13, 14). Supporters argue that knowledge of the
authors introduces undesirable biases in the reviewing process
(15–17). We discuss three particular forms of bias in detail. First,
Knobloch-Westerwick et al. (18) proposed the Matilda effect,
in which papers from male ﬁrst authors are evaluated to have
greater scientiﬁc merit than papers from female ﬁrst authors,
particularly in male-dominated ﬁelds. Second, Merton (19) pro-
posed the Matthew effect, in which already-famous researchers
receive the lion’s share of recognition for new work. Third,
the seminal experimental study of Blank (15) spends signiﬁcant
time discussing biases resulting from the fame or quality of the
authors’ institution(s). See Other Studies for studies of double-
blind reviewing.
Materials and Methods
Our study covers submissions to the 10th International Association for Com-
puting Machinery Conference on Web Search and Data Mining (WSDM
2017). In computer science, research typically appears ﬁrst and often exclu-
sively in conferences rather than in journals. Analysis of citation patterns
suggests that computer scientists are in fact rewarded preferentially for
publishing in conferences rather than in journals (20, 21). Conference
reviewing in computer science is typically based on full-length manuscripts
rather than abstracts, and each is reviewed in full by multiple experts
invited to the conference program committee. Selective conferences such
as WSDM typically accept 15–20% of submissions. The present work came
about when two of the authors of this paper were asked to cochair the pro-
gram of WSDM 2017, which historically has preferred single-blind review-
ing. We were asked to consider switching to double-blind reviewing. Upon
a review of the literature, we discovered no within-subject experimental
study of the question and so undertook this study to make an informed
Signiﬁcance
Scientiﬁc peer review has been a cornerstone of the scien-
tiﬁc method since the 1600s. Debate continues regarding the
merits of single-blind review, in which anonymous reviewers
know the authors of a paper and their afﬁliations, compared
with double-blind review, in which this information is hid-
den. We present an experimental study of this question. In
computer science, research often appears ﬁrst or exclusively
in peer-reviewed conferences rather than journals. Our study
considers full-length submissions to the highly selective 2017
Web Search and Data Mining conference (15.6% acceptance
rate). Each submission is simultaneously scored by two single-
blind and two double-blind reviewers. Our analysis shows
that single-blind reviewing confers a signiﬁcant advantage to
papers with famous authors and authors from high-prestige
institutions.
An extended abstract of this work has been previously posted as a preprint (1).
Author contributions: A.T. and M.Z. designed research; A.T. and M.Z. performed research;
W.D.H. analyzed data; and A.T., M.Z., and W.D.H. wrote the paper.
Conﬂict of interest statement: A.T. and W.D.H. are employed and paid by Google, Inc.
Google often provides funding to conferences, including the WSDM conference studied
in this work.
1To whom correspondence should be addressed. Email: atomkins@gmail.com.
1073/pnas.1707323114/-/DCSupplemental.
www.pnas.org/cgi/doi/10.1073/pnas.1707323114 PNAS Early Edition |1 of 6
recommendation to the chairs of WSDM 2018 and to offer our ﬁndings to
the rest of the community. See Conferences vs. Journals in Computer Science
for process differences between journal and conference reviewing.
The following list describes the experimental design:
i) Program committee (PC) is split randomly into two groups of equal size:
single-blind PC (SBPC) and double-blind PC (DBPC).
ii) During bidding, the SBPC sees author names and afﬁliations, while the
DBPC does not. Both groups see paper titles and abstracts. Otherwise,
the bidding interface is the same.
iii) A separate assignment is computed for the SBPC and the DBPC, using the
standard assignment algorithm provided by the EasyChair conference
management system. The overall assignment allocates four PC members
to each paper with exactly two from the SBPC and two from the DBPC.
iv) The assigned papers are sent for reviewing. The SBPC and the DBPC
again receive the same reviewing form, except that SBPC members see
author names and afﬁliations in the reviewing form. PDF documents do
not include author names or afﬁliations.
v) After reviews are received, the experiment is closed, and the data are set
aside for analysis. The experimental setup is described to PC members,
all of whom are moved to the single-blind condition. Discussions are
managed by the senior program committee (SPC) member assigned to
each paper, with all four reviewers participating.
Under this design, we study single-blind vs. double-blind reviewer
behavior in two settings: reviewing papers and also a preliminary “bid-
ding” stage in which reviewers express interest in papers to review. See
Experimental Design Considerations for considerations behind this experi-
mental design.
During bidding, each reviewer considers the submitted papers and enters
a bid for each. Three bids are possible: yes, maybe, and no. (There is a
fourth value to indicate a conﬂict of interest, but we do not consider these
bids here; we consider them separately in Results.) If a reviewer takes no
action with respect to a paper, the default bid is no. The distribution of
bids per reviewer is shown in Fig. 1. Sixty percent of reviewers have at
least 20 bids, which allows an effective allocation of papers. In Results
we discuss the observation that single-blind reviewers appear to enter
fewer bids.
We allocate exactly two double-blind reviewers and two single-blind
reviewers to each paper, from a total pool of 974 double-blind and 983
single-blind reviewers. Due to some midstream withdrawals, the num-
ber of papers in consideration at the end of the experiment is exactly
500. Of these, 453 papers have four reviews and 47 have three reviews
due to reviewer dereliction. Review scores are selected from the values
{6, 3, 2, 4, 6}, with 6 corresponding to a strong recommendation to
accept the paper and 6 a strong recommendation to reject the paper.
Fig. 1. Cumulative distribution function of number of bids for single- and
double-blind reviewers.
Reviewers also enter a “rank” for the paper relative to other papers scored
by the same reviewer, ranging from 4 (top paper seen by this reviewer)
to 1 (bottom 50% of this reviewer’s batch). Finally, reviewers enter a text-
ual review.
Covariates for Implicit Bias Analysis. We use the EasyChair conference man-
agement system to manage submissions and reviewing. During submission,
each author’s name, institution, and country are provided to the system.
Based on this information, we generate some additional covariates as part
of our exploration of the behavior of single-blind vs. double-blind review-
ers. To begin, if there is a single most common country among the authors
(even if not the majority country), we associate this country with the paper.
For each (reviewer, paper) pair, we then compute the following six Boolean
covariates:
i) Academic paper. We hand wrote a set of rules to determine whether an
author’s institution is academic or not (corporate, governmental, non-
proﬁt, and unafﬁliated are all considered nonacademic institutions). If a
strict majority of the authors are from an academic institution, we con-
sider the paper to be an academic paper.
ii) Female author. We attempt to determine whether at least one of the
paper’s authors is female. Earlier work typically considered papers whose
ﬁrst author was female, but submissions to WSDM do not always fol-
low the same conventions for ﬁrst authors, so we did not have a reli-
able way to determine whether one author contributed more than
another. Hence, we consider papers with a female author vs. papers
with no female author. In Results we consider other alternatives to
this approach. To make this determination, we manually annotate the
gender of the 1,491 authors. We ﬁnd 1,197 male authors, 246 female
authors, and 48 authors for whom we could not determine gender from
online searches.
iii) Paper from United States. This feature is true if the paper is from the
United States, as deﬁned above.
iv) Famous author. We deﬁne a famous author to be an author with
at least 3 accepted papers at earlier WSDM conferences (www.wsdm-
conference.org/) and at least 100 papers according to a commonly
used computer science bibliography known as dblp. There are 57 such
authors. This property is true if the paper has at least one famous author.
v) Same country as reviewer. We wish to study whether knowledge of the
authors allows a reviewer from the same country to treat the paper pref-
erentially. This feature is true if the country of the paper as deﬁned
above is the same as the country of the reviewer as provided during the
EasyChair registration process.
vi) Top university. We deﬁne top universities as the top 50 global computer
science universities per www.topuniversities.com. While this choice is
imperfect, the universities align reasonably well with our expectations
for top universities. This property is true if any author is from a top uni-
versity.
vii) Top companies. We deﬁne top companies as Google, Microsoft, Yahoo!,
and Facebook. This property is true if any author is from a top company.
Table 1 gives the distributions for each of these features.
Blinded Paper Quality Score. We constructed a proxy measure for the intrin-
sic quality of a paper from the double-blind reviewers by combining lin-
early their scores and ranks, here standardized to have zero mean and
unit variance. Among the double-blind reviewers, the correlation between
these two measures is 0.75, and principal components would combine
these with equal weights. However, we choose to maximize the correla-
tion between the pairs of double-blind reviewers of the same paper. For
a given score sand rank r, this between-reviewer correlation is maximized
by a quality score q=s+0.111r. The achieved correlation between the
two double-blinded raters is 0.38. See Interreviewer Agreement for discuss-
ion of this.
We take the quality score of a paper to be the average quality score of
the double-blind reviews for that paper, referred to below as “blinded paper
quality score” (bpqs). We scale bpqs to have unit SD.
Bid Attractiveness Scores: Bids by Reviewer and Bids by Paper. By analogy
with bpqs, for modeling bid behavior we develop two direct scores to cap-
ture the likelihood of a reviewer to bid and the likelihood of a paper to
receive bids. To encode the willingness of a particular reviewer to bid, we
calculate the total bids of that reviewer; we refer to this score as the “bids
by reviewer” (bbr). To score the intrinsic bid attractiveness of the paper,
we calculate the total number of bids on this paper by the double-blind
2 of 6 |www.pnas.org/cgi/doi/10.1073/pnas.1707323114 Tomkins et al.
SOCIAL SCIENCESCOMPUTER SCIENCES
Table 1. Summary of features and prevalence
No. of Fraction of
Factor Feature name papers Papers, %
Paper from United States United States 176 35
Same country as reviewer Same 146 29
Female author Wom 219 44
Famous author Fam 81 16
Top university Uni 135 27
Top company Com 90 18
reviewers; we refer to this score as the “bids by paper” (bbp). In modeling
bids, we use both these scores as covariates.
Data Sharing. We describe in Raw Data and Privacy why it is not possible
to release our raw data without risk of abrogating the privacy of the par-
ticipants in our study. We therefore follow the approach taken by Eckles
et al. (22) in a similar situation, providing sufﬁcient statistics for analysis.
See Raw Data and Privacy for full details on the shared data accompanying
this document.
Study Approval. This research has been approved by the Ethics Commit-
tee for Information Sciences at the University of Amsterdam and the Vrije
Universiteit Amsterdam. We attained informed consent from participants
according to procedures approved by the committee.
Results
Summary of Results. We ﬁnd three signiﬁcant differences be-
tween single-blind and double-blind reviewing. First, we ﬁnd
that single-blind reviewers bid less proliﬁcally, entering about
22% fewer bids on average. Second, we ﬁnd that single-blind
reviewers bid preferentially on papers from top universities and
top companies, compared with their double-blind counterparts.
Third, we ﬁnd that single-blind reviewers are relatively more
likely than double-blind reviewers to submit a positive review for
papers with a famous author and for papers from a top university
or a top company.
Modeling Reviews. Our modeling approach is to predict the odds
that a single-blind reviewer gives a positive (accept) score to a
paper, using the following logistic regression model:
Pr[score >0]
Pr[score <= 0] =ehΘ,vi.
Θis a set of learned coefﬁcients, and vis a vector of features
consisting of a constant offset feature, the overall paper qual-
ity score bpqs deﬁned in Materials and Methods, and the seven
implicit bias Booleans in Table 1. Hence, the unit of analysis in
this model is a (single-blind reviewer, paper) pair.
We present the results of the logistic regression in Table
2. There are signiﬁcant nonzero coefﬁcients for the Com
(P= 0.002), Fam (P= 0.027) and Uni (P= 0.012) features. The
other features do not show signiﬁcant effects. The corresponding
odds multipliers are 2.10 for Com, 1.63 for Fam, and 1.58 for
Uni. Relative to the underlying quality score bpqs, these values
correspond to increases of 0.92 bpqs, 0.61 bpqs, and 0.57 bpqs
SDs. For Wom, the odds multiplier is 0.78, equivalent to 0.31
bpqs SDs, and is not statistically signiﬁcant.
Our hypothesis in undertaking this work was that it would be
very difﬁcult to see any effects on review behavior given the scale
of the data and the difﬁculty other studies have encountered
in ﬁnding signiﬁcant biases for single-blind reviewing. Thus, we
were surprised to encounter three signiﬁcant effects with sub-
stantial odds multipliers.
Modeling Bids. We take a similar approach to modeling bids, but
some changes are required, as a reviewer may bid for an arbitrary
number of papers.
As Fig. 1 suggests, the ﬁrst question we should reasonably ask is
whether single-blind and double-blind reviewers bid for the same
number of papers. We test this using a Mann–Whitney test and
ﬁnd that single-blind reviewers bid for fewer papers (P= 0.0002).
On average, single-blind reviewers bid for 19.9 papers compared
with 24.9 for double-blind reviewers, a decrease of 22%.
Thus, the difference in behavior between the two reviewer
classes is quite signiﬁcant. We now ask a follow-up question:
Given that single-blind reviewers bid less, do they also bid dif-
ferently for particular types of papers? To answer this question,
we pursue a similar analysis to our regression study of review
scores. However, rather than including an overall paper quality
score (bpqs) into the regression, we instead include covariates
for the bid appetite of the reviewer (bbr) and the bid attractive-
ness of the paper (bbp) as described in Materials and Methods.
We retain the constant offset term.
The results are shown in Table 3. In addition to the differ-
ence in likelihood to bid, we also see that the Com and Uni fea-
tures are signiﬁcant, with P= 0.01 and P= 0.011, respectively,
indicating that the bids entered by single-blind reviewers tend to
favor top companies and universities with modest odds multipli-
ers of 1.17 and 1.13, respectively.
The Matilda Effect. As described above, there is signiﬁcant work
regarding the importance of author gender in reviewing (19).
Some of this work clearly points to lower assessments of scientiﬁc
merit for work purportedly authored by women. For neither bid-
ding nor reviewing are the effects (odds multipliers of 1.05 and
0.78, respectively) statistically signiﬁcant (P= 0.27, P= 0.16).
We reran the same logistic regression analysis from two addi-
tional perspectives: papers whose ﬁrst author is female (16.4% of
papers) and papers written by a strict majority of female authors
(3.8% of papers). In both cases, we do not see a signiﬁcant P
value for the Wom feature.
We also ran our analysis using US census data to identify pre-
dominantly male or female ﬁrst names, in the case that our hand
coding identiﬁed genders that would not be readily apparent to
reviewers. With this alternate coding, we also did not see a sig-
niﬁcant gender effect.
The inﬂuence of author gender on bidding or reviewing behav-
ior is not statistically signiﬁcant. However, the estimated effect
size for Wom is nonnegligible. In an expanded paper describing
this work, we performed a metareview of our ﬁndings combined
with other studies reported in the literature on the effect of gen-
der on reviewing. By the standards of metareviewing, the over-
all effect against female authors can be considered statistically
signiﬁcant (1).
Table 2. Learned coefﬁcients and signiﬁcance for review score
prediction
Conﬁdence POdds bpqs
Name Coefﬁcient SE interval value multiplier equivalent
Const 1.83 0.24 [2.31, 1.36] 0.000 0.16
bpqs 0.80 0.08 [0.64, 0.97] 0.000 2.23 1.00
Com 0.74 0.24 [0.27, 1.21] 0.002 2.10 0.92
Fam 0.49 0.22 [0.05, 0.93] 0.027 1.63 0.61
Uni 0.46 0.18 [0.09, 0.83] 0.012 1.58 0.57
Wom 0.25 0.18 [0.60, 0.10] 0.160 0.78 0.31
Same 0.14 0.24 [0.34, 0.62] 0.564 1.15 0.17
Aca 0.06 0.22 [0.38, 0.51] 0.775 1.07 0.08
United 0.01 0.21 [0.42, 0.44] 0.964 1.01 0.01
States
Tomkins et al. PNAS Early Edition |3 of 6
Table 3. Learned coefﬁcients and signiﬁcance for bid prediction
Conﬁdence Odds
Name Coefﬁcient SE interval Pvalue multiplier
Const 4.87 0.08 [5.04, 4.71] 0.000 0.01
bbr 0.05 0.00 [0.04, 0.05] 0.000 1.05
bbp 0.08 0.00 [0.07, 0.09] 0.000 1.09
Com 0.16 0.06 [0.04, 0.28] 0.010 1.17
Uni 0.12 0.05 [0.03, 0.22] 0.011 1.13
Fam 0.07 0.06 [0.06, 0.19] 0.287 1.07
Wom 0.05 0.04 [0.04, 0.14] 0.268 1.05
United States 0.02 0.05 [0.07, 0.11] 0.681 1.02
Aca 0.01 0.06 [0.10, 0.12] 0.881 1.01
Aggregate Review Statistics. We checked the lengths of reviews
along with the distribution of scores and ranks across the single-
blind and double-blind conditions. The results are shown in
Table 4. Average review length for single-blind reviewers is 2,073
characters vs. 2,061 for double-blind, not signiﬁcantly longer for
either condition by Mann–Whitney test (P= 0.81). Scores and
ranks show a similar pattern, with no signiﬁcant difference in
either score or rank distribution.
Changes During Discussion. One may reasonably ask what hap-
pens after the experiment concludes and the discussion phase
begins. During this phase is it common to see some changes
in review scores. We analyzed these scores and saw 32 changes
to scores entered by single-blind reviewers compared with 41
changes to scores entered by double-blind reviewers. This dif-
ference is not signiﬁcant (Fisher exact, P= 0.28). We com-
pared the changes in scores to determine whether double-blind
reviewers tend to have changes of larger magnitude than single-
blind reviewers. The distributions of score changes are not sig-
niﬁcantly different (Mann–Whitney, P= 0.58). We then checked
whether double-blind reviewers tend to move more in the direc-
tion of the initial mean score than single-blind reviewers after
discovering the authors of the paper. Here also, we ﬁnd no dif-
ference in the magnitude of shifts toward the mean (Mann–
Whitney, P= 0.58). Hence, during the discussion phase, after the
authors have been revealed, we cannot conclude that the initially
double-blind reviewers behave differently relative to single-blind
reviewers.
Conﬂicts of Interest. One may hypothesize that in a double-blind
setting there will be fewer declared conﬂicts of interest, as
reviewers will not recognize possible conﬂicts. In WSDM 2017,
the EasyChair tool automatically (but imperfectly) detects con-
ﬂicts based on the email domains of authors and reviewers.
Reviewers may specify additional conﬂicts as they bid for papers.
It is possible to conﬁgure the system to allow authors to specify
conﬂicts with PC members at submission time, but we did not
enable this conﬁguration.
We consider the overall set of conﬂicts generated both auto-
matically by EasyChair and by reviewer speciﬁcation. We ﬁnd
that the total number of reviewers expressing a conﬂict (59/121
in the single-blind setting vs. 47/121 in the double-blind setting) is
not signiﬁcantly different (Fisher exact, P= 0.35). Likewise, the
number of conﬂicts expressed by those reviewers who express a
conﬂict is not signiﬁcant (Mann–Whitney, P= 0.63). Hence, in
the settings we adopted, we do not see that double-blind review-
ing introduces a signiﬁcant difference in expression of conﬂicts
of interest.
Discussion
Final decisions regarding acceptance to WSDM 2017 are made
by the program chairs, based on input from the senior program
committee. These decision-making stages took place after our
experiment was closed. Hence, our ﬁndings conclude that review-
ers in the single-blind condition are more likely to recommend
acceptance for certain types of papers, but we cannot make sta-
tistical statements about the ﬁnal acceptance decision for the
paper. Nonetheless, we have shown stark differences in review-
ing behavior for a key part of the decision process.
Bidding Behavior. Bidding is a common and important part of
conference peer review in computer science (23). Our ﬁnd-
ings show single-blind reviewers entering fewer bids than their
double-blind counterparts. In general, a sparser bid landscape
results in lower-quality assignments of papers to knowledge-
able reviewers. Hence, single-blind reviewing may provide a dis-
advantage in the quality of overall decisions due to bid den-
sity alone.
At the same time, we observe that single-blind reviewers bid
preferentially for papers from top institutions. We do not have
data to argue the mechanisms that lead to this behavior; a nat-
ural hypothesis is that reviewers might use information about
the quality of the paper’s institution to estimate that the paper
is more likely to be of interest and might in response enter
a more positive bid on that paper. Whatever the mechanism,
papers from top institutions may encounter a relatively richer
pool of bids under single-blind reviewing and might therefore be
assigned to more knowledgeable reviewers than papers of equiv-
alent quality from lower-ranked institutions.
Reviewing Behavior. Our ﬁndings with respect to reviewing raise
similar questions. A reviewer who knows that a particular paper
is from a top school or company, or has a famous author, is sig-
niﬁcantly more likely to recommend acceptance than a reviewer
who does not know this information.
It is natural to conclude that two identical reviewers, one
given information about authors, will reach different conclu-
sions on the same paper. However, this is not exactly the sta-
tistical statement we are able to make. The two reviewers are
not identical, as they were produced by a paper allocation algo-
rithm based on different bid landscapes. Reviewers that bid on
a paper are more likely to be assigned to review the paper,
and, as we have already discussed, the bidding dynamics of the
two review models are different. It is possible, for example, that
the single-blind reviewer of a particular paper may have bid
on the paper due to knowledge of the author’s prior work, while
the double-blind reviewer may have bid due to the topic of the
paper implied by the title. The reviewers who wind up reading a
paper in the two conditions are not identically distributed, and
this should be taken into account in interpreting our ﬁndings.
That said, it is nonetheless true that, across the overall bidding,
allocation, and reviewing process, the single-blind reviewers with
knowledge of the authors and afﬁliations are much more positive
regarding papers from famous authors and top institutions than
their double-blind counterparts. We have reasonable basis for
the concern that authors who are not famous and not from a
top institution may see lower likelihood of acceptance for the
same work.
Conﬂict of Interest. The literature uses the term conﬂict of inter-
est in two distinct senses. First, as in Results, a reviewer might
have a conﬂict of interest with an author, for instance because
Table 4. Aggregate comparison of review statistics
Single-blind Double-blind Mann–Whitney
Measure average average Pvalue
Review length 2,073 2,061 0.81
Reviewer score 2.07 1.90 0.51
Reviewer rank 1.89 1.87 0.52
4 of 6 |www.pnas.org/cgi/doi/10.1073/pnas.1707323114 Tomkins et al.
SOCIAL SCIENCESCOMPUTER SCIENCES
the reviewer was the author’s advisor and cannot in general be
expected to be impartial. In our particular setting, including the
automated conﬂict detection tools described above, our ﬁndings
show that a similar number of conﬂicts are discovered in single-
blind and double-blind settings. However, we expect this issue to
depend strongly on the particular capabilities of the conference
or journal management software, so our ﬁnding may not gener-
alize to all settings.
The term conﬂict of interest is also used if an author’s results
might inﬂuence his or her personal or professional success. For
example, if an author receives funding from the makers of a
pharmaceutical, the author might expect the funding to be at
risk if he or she publishes ﬁndings attacking the efﬁcacy of the
pharmaceutical. There is a concern that, under double-blind
reviewing, reviewers may be less able to recognize that such con-
ﬂicts exist (14). Our ﬁndings do not address this issue.
Methodological Questions. There are several questions one may
raise with respect to our experiment. First is the issue that we
study the behavior of the PC with respect to bidding and scoring
papers only. After these steps are complete, the SPC member
conducts some discussion among the reviewers, and the program
chairs make a ﬁnal decision. While Results suggests there may not
be signiﬁcant changes speciﬁcally in how reviewers modify their
scores during discussion, it is nonetheless possible that during
these stages, the ﬁnal acceptance decision may show unexpected
behaviors. This is clearly an area for further work. However, we
have observed that the critical inputs to this ﬁnal decision stage
(score and rank of reviewers) are impacted signiﬁcantly by the
reviewing method.
It is possible also that PC members behaved differently in our
setting than they would in a “pure” reviewing situation involving
only a single type of reviewing. First, while single-blind review-
ers in our experiment were shown author names and afﬁliations
in the software tools used for bidding and managing reviews, the
manuscripts themselves were anonymized. The effects might be
stronger if the author names and afﬁliations were visible in the
manuscript throughout the process of reading and reviewing it.
Second, reviewers may have recognized in discussion with col-
mation or might have been inﬂuenced by a brief mention in the
conference call for papers stating that we would experiment with
double-blind reviewing this year (24). Based on such insights or
based on WSDM’s historical preference for single-blind review-
ing, it is possible that double-blind reviewers might have sought
author information on their own, further diminishing the distinc-
tion between the conditions.
Practical Issues with Double-Blind Reviewing. There is a long-
standing question whether it is practical to anonymize a sub-
mission. This question depends on the nature of the ﬁeld (for
instance, it would be impossible to anonymize work in a large
and well-known systems project). Hill and Provost (13) argue
that it is possible to automatically identify authors in many cases
based on the text of the paper alone. However, other studies
have observed that reviewers’ guesses about authorship are often
wrong (3).
A second issue in the practical difﬁculty of retaining anonymity
in double-blind reviewing is the increasingly common practice of
publishing early versions of work on arXiv.org. For example, this
paper appeared on arXiv before being submitted to any peer-
reviewed venue. This practice was a signiﬁcant contributor to
the decision of the Journal of the American Economic Associa-
tion to abandon double-blind reviewing (14). WSDM 2017 did
not state a policy with regard to publishing preprints on arXiv,
but when asked, we discouraged but did not forbid such publi-
cation. In its 2016 call for papers (25), the Neural Information
Processing Systems (NIPS) machine-learning conference, which
performs double-blind reviewing, informed authors that prior
submissions on arXiv are allowed, but reviewers are asked “not
to actively look for such submissions.” If reviewers happened to
be aware of the work, NIPS nonetheless allows the reviewing
to proceed.
These practical issues appear to be signiﬁcant and unresolved.
Conclusion
In conclusion, the heart of our ﬁndings is that single-blind
reviewers make use of information about authors and institu-
tions. Speciﬁcally, single-blind reviewers are more likely to bid
on papers from top institutions and more likely to recommend
for acceptance papers from famous authors or top institutions,
compared with their double-blind counterparts.
The primary ethical question is whether this behavior is
acceptable. In one interpretation, single-blind reviewers make
use of prior information that may allow them to make better
overall judgments. As a consequence, however, it may be that
other work is disadvantaged, in the sense that two contributions
of roughly equal merit might be scored differently by single-blind
reviewers, in favor of the one from a top school, while double-
blind reviewers may not show this bias as strongly.
Clearly, our understanding of the implications of reviewing
methodologies remains nascent. Nonetheless, we feel that pro-
gram and general chairs of conferences should seriously consider
the advantages of using double-blind reviewing.
ACKNOWLEDGMENTS. We acknowledge the support of Andrei Voronkov
and the team at easychair.org. Without their technical assistance, the exper-
iment would have been prohibitively difﬁcult. We are also grateful to the
Ethics Committee for Information Sciences at the University of Amsterdam
and the Vrije Universiteit Amsterdam for their valuable feedback on the
ethical structure of the experiment. Finally, we thank the general chairs of
the conference, Milad Shokouhi and Maarten de Rijke, for many detailed
discussions on these topics and the WSDM steering committee for their sup-
funding from the Natural Science Foundation of China (Grants 61672311
and 61532011).
1. Tomkins A, Zhang M, Heavlin WD (2017) Single versus double blind reviewing at
WSDM 2017. arXiv:1702.00502.
2. Lamont M (2010) How Professors Think: Inside the Curious World of Academic Judg-
ment (Harvard Univ Press, Cambridge, MA).
3. Snodgrass R (2006) Single-versus double-blind reviewing: An analysis of the literature.
ACM Sigmod Rec 35:8–21.
4. Largent EA, Snodgrass RT (2016) Blind peer review by academic journals. Blinding as
a Solution to Bias: Strengthening Biomedical Science, Forensic Science, and Law, eds
Robertson C, Kesselheim A (Academic, Cambridge, MA), pp 75–95.
5. Snodgrass RT (2007) Editorial: Single- versus double-blind reviewing. ACM Trans
Database Syst (TODS) 32:1–29.
6. McKinley KS (2008) Improving publication quality by reducing bias with double-blind
reviewing and author response. ACM Sigplan Not 43:5–9.
7. Schulzrinne H (2009) Double-blind reviewing: More placebo than miracle cure? ACM
SIGCOMM Comput Commun Rev 39:56–59.
8. Walker R, Rocha da Silva P (2015) Emerging trends in peer review—a survey. Front
Neurosci 9:169.
9. Budden AE, et al. (2008) Double-blind review favours increased representation of
female authors. Trends Ecol Evol 23:4–6.
10. Webb TJ, O’Hara B, Freckleton RP (2008) Does double-blind review beneﬁt female
authors? Trends Ecol Evol 23:351–353.
11. Madden S, DeWitt D (2006) Impact of double-blind reviewing on SIGMOD publication
rates. SIGMOD Rec 35:29–32.
12. Tung AKH (2006) Impact of double blind reviewing on SIGMOD publication: A more
detail analysis. SIGMOD Rec 35:6–7.
13. Hill S, Provost F (2003) The myth of the double-blind review?: Author identiﬁcation
using only citations. SIGKDD Explor Newsl 5:179–184.
14. Jaschik S (2011) Rejecting double blind. Availableat https://www.insidehighered.com/
new s/2 011 /05 /31 /am eri can economic association abandons double blind journal
reviewing. Accessed January 29, 2017.
15. Blank RM (1991) The effects of double-blind versus single-blind reviewing: Experi-
mental evidence from the American economic review. Am Econ Rev 81:1041–1067.
16. Roberts SG, Verhoef T (2016) Double-blind reviewing at EvoLang 11 reveals gender
bias. J Lang Evol 1:163–167.
Tomkins et al. PNAS Early Edition |5 of 6
17. Okike K, Hug KT, Kocher MS, Leopold SS (2016) Single-blind vs double-blind peer
review in the setting of author prestige. JAMA 316:1315–1316.
18. Knobloch-Westerwick S, Glynn CJ, Huge M (2013) The Matilda effect in science com-
munication. Sci Commun 35:603–625.
19. Merton RK (1968) The Matthew effect in science. Science 159:56–63.
20. Vrettas G, Sanderson M (2015) Conferences versus journals in computer science. J
Assoc Inf Sci Technol 66:2674–2684.
21. CORPORATE Committee on Academic Careers for Experimental Scientists and
CORPORATE Commission on Physical Sciences, Mathematics & Applications (1994)
Press, Washington, DC).
22. Eckles D, Kizilcec RF, Bakshy E (2016) Estimating peer effects in networks
with peer encouragement designs. Proc Natl Acad Sci USA 113:7316–
7322.
23. Price S, Flach PA (2017) Computational support for academic peer review: A perspec-
tive from artiﬁcial intelligence. Commun ACM 60:70–79.
24. Tomkins A, Zhang M (2017) 2017 call for papers. Available at www.wsdm-
conference.org/2017/calls/papers/. Accessed January 29, 2017.
25. Luxburg U, Guyon I (2016) 2016 call for papers. Available at https://nips.cc/
Conferences/2016/CallForPapers. Accessed February 13, 2017.
26. Peters DP, Ceci SJ (1982) Peer-review practices of psychological journals: The fate of
published articles, submitted again. Behav Brain Sci 5:187–195.
27. Peters DP, Ceci SJ (2014) The Peters & Ceci study of journal publications. Avail-
able at https://thewinnower.com/discussions/7- the- peters- ceci- study- of- journal-
publications. Accessed January 29, 2017.
28. Rothwell PM, Martyn CN (2000) Reproducibility of peer review in clinical neuro-
science: Is agreement between reviewers any greater than would be expected by
chance alone? Brain 123:1964–1969.
29. Lawrence N (2015) Get your NIPS reviews in! Available at inverseprobability.com/
2015/07/23/get-your-nips-review-in. Accessed January 29, 2017.
30. Lawrence N (2015) NIPS experiment analysis. Available at inverseprobability.com/
2015/03/30/nips-experiment-analysis. Accessed January 29, 2017.
6 of 6 |www.pnas.org/cgi/doi/10.1073/pnas.1707323114 Tomkins et al.

Supplementary resource (1)

Data
November 2017
... Our work also contributes to recent endeavors in understanding the peer review process (Lawrence and Cortes, 2014;Shah et al., 2018;Tomkins et al., 2017;Stelmakh et al., 2021bStelmakh et al., , 2020Stelmakh et al., , 2021c. Specifically, the mapping learnt via L(1,1) aggregation can be used to understand the community's aggregate preferences over various criteria. ...
... • There are various other problems in peer review such as miscalibration (Ge et al., 2013;Roos et al., 2011;Wang and Shah, 2019), noise (Stelmakh et al., 2019a), fraud (Vijaykumar, 2020a,b;Jecmen et al., 2020), biases (Tomkins et al., 2017;Stelmakh et al., 2019b;Nielsen et al., 2021). These problems have been treated independently of each other in the literature, and addressing them jointly along with the problem of subjectivity is a challenging and important open problem. ...
Article
It is common to see a handful of reviewers reject a highly novel paper, because they view, say, extensive experiments as far more important than novelty, whereas the community as a whole would have embraced the paper. More generally, the disparate mapping of criteria scores to final recommendations by different reviewers is a major source of inconsistency in peer review. In this paper we present a framework inspired by empirical risk minimization (ERM) for learning the community's aggregate mapping. The key challenge that arises is the specification of a loss function for ERM. We consider the class of L(p,q) loss functions, which is a matrix-extension of the standard class of Lp losses on vectors; here the choice of the loss function amounts to choosing the hyperparameters p and q. To deal with the absence of ground truth in our problem, we instead draw on computational social choice to identify desirable values of the hyperparameters p and q. Specifically, we characterize p=q=1 as the only choice of these hyperparameters that satisfies three natural axiomatic properties. Finally, we implement and apply our approach to reviews from IJCAI 2017.
... 236,238 (Experiments with double-blind review produced mixed results on the question of whether it aids underrepresented authors, 375-381 but do suggest that authors from top-ranked universities see lower acceptance rates when their identities are concealed. [379][380][381][382][383] ) Crucially, individuals who obtain their Ph.D.s from lower-ranked institutions wind up being just as productive at elite institutions, despite differences in pedigree. 336,369 Unfortunately, the achievements necessary to attend an elite university (where peer networks begin to be established) represent a form of social capital, and systemic racism ensures that people of color have less of that capital. ...
Article
Full-text available
A vocal group of academic scientists have repeatedly articulated the idea that academic freedom is under attack from within academia. Examples of supposed suppression of free expression often involve diversity,...
... 29 Bias. Blind review is intended to reduce bias in the evaluation of research, for example by preventing reviewers from being impressed by an author's personal or institutional prestige, 30 or by encouraging junior scholars to share their critical views of senior scholars' work. On the other hand, blind review also protects the identities of reviewers who do a bad job or discriminate against authors, whose identities or affiliations are often in fact discernible to knowledgeable reviewers. ...
... These include, but are not limited to, a higher drop-out rate for various career-related reasons, and fewer resources and changes to building an effective "paper factory" as a senior researcher. Even if being a female may not affect negatively in peer-review process (Tomkins et al., 2017), female professors and senior researchers face a high load of faculty services and teaching (Misra et al., 2012;Roper, 2019) that is immediately away from the productive research time. Huang et al. (2020) conclude that the most pronounced-and also the most worrisome-gender gap is indeed between the most productive authors. ...
Article
Full-text available
The International Conference on Pervasive Computing and Communications (IEEE PerCom) is a CORE 2021 A* conference (top 7% of ranked venues) that aims to present scientific advances in a broad spectrum of technologies and topics in ubiquitous/pervasive computing, including wireless networking, mobile and distributed computing, sensor systems, ambient intelligence, and smart devices. During the last couple of years, the PerCom organization committee has successfully included many prestigious female researchers to submit, participate, and organize the conference. However, there is still work to do and to help the progress, this article analyses the history of the conference from a gender perspective. This article goes through accepted articles of the last 20 years of the PerCom conferences, showing that even if the role of female authors, in general, has increased, more first and leading female researchers should still be welcomed in the community. Through this analysis, this article aims to highlight the role of gender in the conference program and seeks to find trends and possible improvements to achieve a broader gender balance in pervasive computing.
... 13 14 Such bias is often not limited just to institutional affiliations, but also spreads against female researchers and certain ethnic minorities, all the while favouring respected authors and high-ranking universities. 15 Such selective reporting of affiliations makes it difficult to estimate the real contributions from LMIC authors. To reduce such biases, double-blind peer-review has been adopted by multiple journals including Nature. ...
Article
Full-text available
Sub-Saharan Africa (SSA) suffers from one of the highest caseloads of oncological patients in the world (128.2 cases per 100,000). Cancer incidence and mortality are on the rise in the low- and middle-income countries (LMICs), where more than 75% of global cancer burden is predicted to occur by the year 2040. Given this anticipated rise in caseloads, our recent Lancet Oncology Commission report called for urgent collaborations between LMICs and high-income countries (HICs) to build research capacity in limited-resource environments and strengthen cancer control efforts. In this editorial we highlight what has been done and what more needs to be done in this regard.
... More generally, expert reviewers can be the most critical (Gallo et al., 2016). Reviewers might also be biased by gender (Ceci & Williams, 2011;Fox & Paine, 2019), nationality (Harris et al., 2019;Primack et al., 2009;, ethnicity (Woolston, 2021), and prestige (Tomkins et al., 2017). Cognitive cronyism, in the sense of judging results from known specialism better, is widely suspected but with little evidence (Lee et al., 2013;Wang & Sandström, 2015), and it is possible that cognitive cronies are more critical because they are more expert (e.g., Gallo et al., 2016). ...
Preprint
Full-text available
Purpose: Scholars often aim to conduct high quality research and their success is judged primarily by peer reviewers. Research quality is difficult for either group to identify, however, and misunderstandings can reduce the efficiency of the scientific enterprise. In response, we use a novel term association strategy to seek quantitative evidence of aspects of research that associate with high or low quality. Design/methodology/approach: We extracted the words and 2-5-word phrases most strongly associating with different quality scores in each of 34 Units of Assessment (UoAs) in the Research Excellence Framework (REF) 2021. We extracted the terms from 122,331 journal articles 2014-2020 with individual REF2021 quality scores. Findings: The terms associating with high- or low-quality scores vary between fields but relate to writing styles, methods, and topics. We show that the first-person writing style strongly associates with higher quality research in many areas because it is the norm for a set of large prestigious journals. We found methods and topics that associate with both high- and low-quality scores. Worryingly, terms associating with educational and qualitative research attract lower quality scores in multiple areas. REF experts may rarely give high scores to qualitative or educational research because the authors tend to be less competent, because it is harder to make world leading research with these themes, or because they do not value them. Originality: This is the first investigation of journal article terms associating with research quality.
... For example, the issuing of a single overall rating for proposal reviews at the NSF introduces personal interpretations on the relative importance of the intellectual merit and broader impacts criteria (Lee, 2015;Intemann, 2009;Roberts, 2009). Additional well-documented social phenomena in evaluative STEM contexts, like "halo effects" favoring reputable scientists and institutions (Huber et al., 2022;Sine et al., 2003;Hsiang Liao, 2017;Tomkins et al., 2017) and increased bias in individuals with stronger self-perceptions of objectivity (Begeny et al., 2020;Sheltzer and Smith, 2014;Moss-Racusin et al., 2012;Uhlmann and Cohen, 2007), build on findings that environments characterized by explicit overtures of meritocracy are paradoxically more likely to produce and legitimize nonmeritorious outcomes (Moss-Racusin et al., 2012;Uhlmann and Cohen, 2007;Castilla and Benard, 2010;Handley et al., 2015;Norton et al., 2004;Uhlmann and Cohen, 2005;Apfelbaum et al., 2012;White-Lewis, 2020). In this context, the racial funding disparities can be viewed as the product of a system and culture operating under an assumed meritocracy, rather than an aspiring one. ...
Article
Full-text available
Concerns about systemic racism at academic and research institutions have increased over the past decade. Here, we investigate data from the National Science Foundation (NSF), a major funder of research in the United States, and find evidence for pervasive racial disparities. In particular, white principal investigators (PIs) are consistently funded at higher rates than most non-white PIs. Funding rates for white PIs have also been increasing relative to annual overall rates with time. Moreover, disparities occur across all disciplinary directorates within the NSF and are greater for research proposals. The distributions of average external review scores also exhibit systematic offsets based on PI race. Similar patterns have been described in other research funding bodies, suggesting that racial disparities are widespread. The prevalence and persistence of these racial disparities in funding have cascading impacts that perpetuate a cumulative advantage to white PIs across all of science, technology, engineering, and mathematics.
Article
Purpose Scholars often aim to conduct high quality research and their success is judged primarily by peer reviewers. Research quality is difficult for either group to identify, however and misunderstandings can reduce the efficiency of the scientific enterprise. In response, we use a novel term association strategy to seek quantitative evidence of aspects of research that are associated with high or low quality. Design/methodology/approach We extracted the words and 2–5-word phrases most strongly associated with different quality scores in each of 34 Units of Assessment (UoAs) in the Research Excellence Framework (REF) 2021. We extracted the terms from 122,331 journal articles 2014–2020 with individual REF2021 quality scores. Findings The terms associating with high- or low-quality scores vary between fields but relate to writing styles, methods and topics. We show that the first-person writing style strongly associates with higher quality research in many areas because it is the norm for a set of large prestigious journals. We found methods and topics that associate with both high- and low-quality scores. Worryingly, terms associated with educational and qualitative research attract lower quality scores in multiple areas. REF experts may rarely give high scores to qualitative or educational research because the authors tend to be less competent, because it is harder to do world leading research with these themes, or because they do not value them. Originality/value This is the first investigation of journal article terms associating with research quality.
Article
In this paper we study the implications for conference program committees of adopting single-blind reviewing, in which committee members are aware of the names and affiliations of paper authors, versus double-blind reviewing, in which this information is not visible to committee members. WSDM 2017, the 10th ACM International ACM Conference on Web Search and Data Mining, performed a controlled experiment in which each paper was reviewed by four committee members. Two of these four reviewers were chosen from a pool of committee members who had access to author information; the other two were chosen from a disjoint pool who did not have access to this information. This information asymmetry persisted through the process of bidding for papers, reviewing papers, and entering scores. Reviewers in the single-blind condition typically bid for 26\% more papers, and bid preferentially for papers from top institutions. Once papers were allocated to reviewers, single-blind reviewers were significantly more likely than their double-blind counterparts to recommend for acceptance papers from famous authors and top institutions. In each case, the estimated odds multiplier is around $1.5\times$, so the result is quite strong. We did not however see differences in bidding or reviewing behavior between single-blind and double-blind reviewers for papers with female authors. We describe our findings in detail and offer some recommendations.
Article
This study investigates whether bias with single-blind review is greatest in a setting of author or institutional prestige. Most medical journals practice single-blind review¹ (authors’ identities known to reviewers), but double-blind review (authors’ identities masked to reviewers) may improve the quality of reviews.² Bias with single-blind review might be greatest in the setting of author or institutional prestige.²
Article
The impact of introducing double-blind reviewing in the most recent Evolution of Language conference is assessed. The ranking of papers is compared between EvoLang 11 (double-blind review) and EvoLang 9 and 10 (single-blind review). Main effects were found for first author gender by conference. The results mirror some findings in the literature on the effects of double-blind review, suggesting that it helps reduce a bias against female authors.
Article
Peer effects, in which the behavior of an individual is affected by the behavior of their peers, are central to social science. Because peer effects are often confounded with homophily and common external causes, recent work has used randomized experiments to estimate effects of specific peer behaviors. These experiments have often relied on the experimenter being able to randomly modulate mechanisms by which peer behavior is transmitted to a focal individual. We describe experimental designs that instead randomly assign individuals' peers to encouragements to behaviors that directly affect those individuals. We illustrate this method with a large peer encouragement design on Facebook for estimating the effects of receiving feedback from peers on posts shared by focal individuals. We find evidence for substantial effects of receiving marginal feedback on multiple behaviors, including giving feedback to others and continued posting. These findings provide experimental evidence for the role of behaviors directed at specific individuals in the adoption and continued use of communication technologies. In comparison, observational estimates differ substantially, both underestimating and overestimating effects, suggesting that researchers and policy makers should be cautious in relying on them.
Article
The reward and communication systems of science are considered.