Content uploaded by Ofra Amir
Author content
All content in this area was uploaded by Ofra Amir on Oct 01, 2019
Content may be subject to copyright.
Available via license: CC BY 4.0
Content may be subject to copyright.
Truth Discovery via Proxy Voting
RESHEF MEIR, Technion- Israel Institute of Technology, Israel
OFRA AMIR, Technion- Israel Institute of Technology, Israel
GAL COHENSIUS, Technion- Israel Institute of Technology, Israel
OMER BEN-PORAT, Technion- Israel Institute of Technology, Israel
LIRONG XIA, RPI, USA
Truth discovery is a general name for a broad range of statistical methods aimed to extract the correct answers
to questions, based on multiple answers coming from noisy sources. For example, workers in a crowdsourcing
platform. In this paper, we design simple truth discovery methods inspired by proxy voting, that give higher
weight to workers whose answers are close to those of other workers.
We prove that under standard statistical assumptions, proxy-based truth discovery (P-TD) allows us to
estimate the true competence of each worker, whether workers face questions whose answers are real-valued,
categorical, or rankings. We then demonstrate through extensive empirical study on synthetic and real data
that P-TD is substantially better than unweighted aggregation, and competes well with other truth discovery
methods, in all of the above domains.
“All happy families are alike; each unhappy family is unhappy in its own way.”
— Leo Tolstoy , Anna Karenina
Authors’ addresses: Reshef Meir, Technion- Israel Institute of Technology, Bloomeld Building, Technion City, Haifa,
3200003, Israel, reshefm@ie.technion.ac.il; Ofra Amir, Technion- Israel Institute of Technology, Bloomeld Building,
Technion City, Haifa, 3200003, Israel, oamir@technion.ac.il; Gal Cohensius, Technion- Israel Institute of Technology,
Bloomeld Building, Technion City, Haifa, 3200003, Israel, @ie.technion.ac.il; Omer Ben-Porat, Technion- Israel Institute of
Technology, Bloomeld Building, Technion City, Haifa, 3200003, Israel, reshefm@ie.technion.ac.il; Lirong Xia, RPI, ? Troy,
NY, ? USA, xial@cs.rpi.edu.
, Vol. 1, No. 1, Article . Publication date: May 2019.
arXiv:1905.00629v1 [cs.AI] 2 May 2019
1 INTRODUCTION
Consider a standard crowdsourcing task such as image labeling [
12
,
28
] or corpus annotation [
32
].
1
Such tasks are often used to construct large databases, that can later be used to train and test machine
learning algorithms. Crowdsourcing workers are usually not experts, thus answers obtained this
way often contain many mistakes [
23
,
37
,
39
]. A simple approach to improve accuracy is to ask the
same question to a number of workers and to aggregate their answers by some aggregation rule.
Truth discovery is a general name for a broad range of methods that aim to extract some
underlying ground truth from noisy answers. While the mathematics of truth discovery dates back
to the early days of statistics, at least to the Condorcet Jury Theorem [
10
], the rise of crowdsourcing
platforms suggests an exciting modern application for truth discovery.
A simple approach to improve accuracy in crowdsourcing applications is to ask the same ques-
tion a number of workers and to aggregate their answers by some aggregation rule. The use of
aggregation rules suggests a natural connection between truth discovery and social choice, which
deals with aggregation of voters’ opinions and preferences. Indeed, voting rules have come useful
in the design of truth discovery and crowdsourcing techniques [
6
,
11
,
27
]. It is our intention to
further explore and exploit these connections in the current paper.
In political and organizational elections, a common practice is to allow voting-by-proxy, where
some voters let others vote on their behalf [
5
,
15
]. Thus the aggregation is performed over a subset
of “active” voters, who are weighted by the number of “inactive” voters with similar opinions. In a
recent paper, Cohensius et al. 2017 showed that under some assumptions on the distribution of
preferences, proxy voting reduces the variance of the outcome, and thus requires fewer active voters
to reach the socially-optimal alternative. Cohensius et al. suggested that an intuitive explanation
for the eectiveness of proxy voting lies in the fact that the more ‘representative’ voters tend to
also be similar to one another, but did not provide formal justications of that claim. Further, in
their model the designer seeks to approximate the subjective “preferences of the society,” whereas
truth-discovery is concerned with questions for which there is an objective ground truth.
In this paper, we consider algorithms for truth discovery that are inspired by proxy voting, with
crowdsourcing as our main motivation. Our goal is to develop a simple approach for tackling the
following challenge in a broad range of domains: We are given a set of workers, each answering
multiple questions, and want to: (a) identify the competent workers; and (b) aggregate workers’
answers such that the outcome will be as close as possible to the true answers. These challenges
are tightly related: a good estimate of workers’ competence allows us to use better aggregation
methods (e.g. by giving higher weight to good workers); and aggregated answers can be used
as an approximation of the ground truth to asses workers’ competence. Indeed, several current
approaches in truth discovery tackle these goals jointly (see Related Work below).
Our approach decouples the above goals. For (a), we apply proxy voting to estimate each worker’s
competence, where each worker increases the estimated competence of similar workers. For the
truth discovery problem (b), we then use a straightforward aggregation function (e.g. Majority or
Average), giving a higher weight to more competent workers (i.e. who are closer to others).
We depart from previous work on proxy voting mentioned above by dropping the assumption
that each worker relegates her vote to the (single) nearest proxy. While this requirement makes
sense in a political setting so as to keep the voting process fair (1 vote per person), it is somewhat
arbitrary when our only goal is a good estimation of workers’ competence and the ground truth.
1
Crowdsourcing is also used for a variety of tasks in which there is no “ground truth,” such as studying vocabulary size [
24
],
rating the quality of an image [
31
], etc. In this paper, we focus only on questions for which there is a well dened true
answer, as in the examples above.
2
For analysis purposes, we use distances rather than proximity. We assume that each worker has
some underlying fault level
fi
that is the expected distance between her answers and the ground
truth. The question of optimal aggregation when competence/fault levels are known is well studied
in the literature, and hence our main challenge is to estimate fault levels. To capture the positive
inuence of similar workers, we dene the proxy distance of each worker, as her average distance
from all other workers. While the similarity between workers has been considered in the literature
(see Related Work), we are unaware of any systematic study of its uses. Our main theoretical result
can be written as follows:
Theorem (Anna Karenina principle, informal). The expected proxy distance of each worker is
linear in her fault level fi.
Essentially, the theorem says that as in Tolstoy’s novel, “good workers are all alike,” (thereby
boosting one another proxy distances), whereas “each bad worker is bad in her own way” and thus
not particularly close to other workers. The exact linear function depends on the statistical model,
and in particular on whether the data is categorical or continuous.
The Anna Karenina principle suggests a natural proxy-based truth-discovery (P-TD) algorithm,
that rst estimates fault levels based on proxy distances, and then uses standard techniques from the
truth discovery literature to assign workers’ weights and aggregate their answers. We emphasize
that a good estimate of
fi
may be of interest regardless of the aggregation procedure (this is goal (a)
above). For example, the operators of the crowdsourcing platform may use it to decide on payments
for workers or for terminating the contract with low-quality workers.
1.1 Contribution and paper structure
Our main theoretical contribution is a formal proof of Theorem 1 in the following domains: (i)
when answers are continuous with independent normal noise, and where the answers of each
worker
i
have variance
fi
; (ii) when answers are categorical and each worker
i
fails each question
independently with probability
fi
; (iii) when answers are rankings of alternatives sampled from
the Condorcet noise model with parameter
fi
. In all three domains, the parameters of the linear
function depend on the distribution from which fault levels are sampled. We show conditions under
which our estimates of the true fault levels and of the ground truth are statistically consistent. In the
continuous domain, we further show that our proxy-based method generalizes another common
approach for fault estimation.
We devote one section to each domain (continuous, categorical, rankings). In each section, the
theoretical results are followed by an extensive empirical evaluation of truth discovery methods
on synthetic and real data. We compare P-TD to standard (unweighted) aggregation and to other
approaches suggested in the truth discovery literature. We show that P-TD is substantially better
than straight-forward (unweighted) aggregation, and often beats other approaches of weighted
aggregation. We also show how to extend P-TD to an iterative algorithm that competes well with
more advanced approaches.
Due to space constraints and to allow continuous reading, most proofs are deferred to appendices.
Appendices also contain additional gures that show the ndings brought in the paper apply
broadly.
2 PRELIMINARIES
We denote by
⟦Z⟧
the indicator variable of the Boolean condition
Z
.
∆(Z)
is the set of probability
distributions over set Z.
Adomain is a tuple
⟨X,d⟩
where
X
is a set of possible world states; and
d
:
X×X→R+
is a
distance measure.
3
We assume there is a xed set of
n
workers, denoted by
N
. An instance
I=⟨S,z⟩
in domain
⟨X,d⟩
is a set of reports
S=(si)i∈N
where
si∈X
for all
i∈N
; and a ground truth
z∈X
. To make
things concrete, we will consider three domains in particular:
•
Continuous domain. Here we have
m
questions with real-valued answers. Thus
X=Rm
+
; and
we dene dEto be the squared normalized Euclidean distance.
•
Categorical domain. Here we have
m
questions with categorical answers in some nite set
A
.
Thus X=Am; and we dene dHas the normalized Hamming distance.
•
Rankings domain. Here
X
is all orders over a set of element
C
, with the Kendall-tau distance
dKT .
Anoise model in domain
⟨X,d⟩
is a function
h
from
X×R+
to
∆(X)
. We consider noise models
that are informative, in the sense that: (I)
h(z,
0
)
returns
z
w.p. 1 (i.e., without noise); and (II)
Es∼h(z,f)[(d(z,s))t]is strictly increasing in f, for any moment t(i.e., higher fmeans more noise).
Apopulation in domain
⟨X,d⟩
is a set of
n
workers, each with a fault level
fi
. A Proto-population
is a distribution
F
over fault levels. We denote by
µ(F )
,
ϑ(F )
the mean and variance of distribution
F
, respectively. We omit
F
when clear from the context. Note that the higher
µ
is, we should
expect more erroneous answers.
Generated instances. Fix a particular domain
⟨X,d⟩
. Given a population
f=(fi)i∈N
, a noise
model
h
, and a ground truth
z∈X
, we can generate instance
I=⟨S,z⟩
by sampling each answer
si
independently from
h(z,fi)
. We can similarly generate instances from a proto-population
F
by
rst sampling each fifrom F, and then sampling an instance from the resulted population.
Aggregation. An aggregation function in a domain
⟨X,d⟩
is a function
r
:
Xn→X
. In this work
we consider simple aggregation functions:
rM
is the Mean function in the continuous domain; and
rP
is the Plurality function in the categorical domain. The functions are applied to each question
independently. Formally: (rM(S))j=1
ni∈Nsij and (rP(S))j=argmaxx∈A|{si j =x}|
In the ranking domain we apply several popular voting rules,
2
including Plurality, Borda, Veto,
Copeland, and Kemeny. All the aggregation functions we consider have natural weighted analogs,
denoted as r(S,w)for a weight vector w.
Aggregation errors. Given an instance
I=⟨S,z⟩
and aggregation function or algorithm
r
in
domain
⟨X,d⟩
, the error of
r
on
I
is dened as
d(z,r(S))
. The goal of truth discovery is to nd
aggregation functions that tend to have low error.
For the convenience of the reader, a list of notation and acronyms is available at the end of the
appendix.
2.1 A general Proxy-based Truth-Discovery scheme
Given an instance
I
and a basic aggregation function
r
, we apply the following work ow, whose
specics depend on the domain.
(1) Collect the answers S=(si)i∈Nfrom all workers.
(2) Compute the pairwise distance d(si,si′)for each pair of workers.
(3) For each worker i:
(a) Compute proxy distance πiby averaging over all pairwise distances:
πi:=1
n−1
i′,i
d(si,si′).(1)
(b) Estimate the fault level ˆ
fifrom πi.
2
The more accurate term for functions that aggregate several rankings into a single ranking is social welfare functions [
4
].
4
(c) Transform fault level ˆ
fito a weight wi.
(4) Aggregate and return the estimated true answers ˆ
z=r(S,w).
By Theorem 1, the (expected) proxy distance of each agent is a linear transformation of her
fault level. Thus, we implement step 3b by reversing the linear transformation to get an estimated
fault level
ˆ
fi
, and then use known results to obtain the (estimated) optimal weight in step 3c. The
details are given in the respective sections, where we refer to the algorithms that return
ˆ
f
and
ˆ
z
as
Proxy-based Estimation of Fault Levels (P-EFL) and Proxy-based Truth Discovery (P-TD), respectively.
3 CONTINUOUS ANSWERS DOMAIN
As specied in the preliminaries, each report is a vector
s∈Rm
. The distance measure we use is
the normalized squared Euclidean distance:3dE(s,s′)=1
mj≤m(sj−s′
j)2.
The Independent Normal Noise model. For our theoretical analysis, we assume an independent
normal noise (INN). Formally,
si
:
=z+ϵi
, where
ϵi=(ϵi1, . . . , ϵim ) ∼ N(0,Σi)
is a
m
-dimensional
noise. In the simplest case the noise of each worker is i.i.d. across dimensions, whereas workers
are also independent but with possibly dierent variance. We further assume that questions are
equally dicult
4
and not correlated given the fault level, i.e.
Σi=I·fi
. Note that
E[dE(si,z)| fi]=
V ar [ϵi|fi]=fi.
3.1 Estimating Fault Levels
Given en instance I=⟨S,z⟩, our rst goal is to get a good estimate of the true fault level.
Fault estimation by distance from the empirical mean. In situations where there is a simple
aggregation method
r
, a simple approach that is a step in many truth discovery algorithms, is
estimating the quality of each worker according to their distance from the aggregated outcome [
26
].
We name this approach Estimation Fault Levels by Distance from the Outcome (D-EFL). In the
continuous domain, we use D-EFL (Alg. 1) where
r
is the mean function and
d
is the square
Euclidean distance, however, we leave the notation general as the algorithm can be used in other
domains with appropriate distance and aggregation functions.
We later analyze the properties of D-EFL in Section 3.2. But before, we will describe our own
approach that relies on the proxy distance.
Fault estimation by proxy distance. Applying Eq.
(1)
to the continuous domain, we get that the
proxy distance of each worker is πi(I)=1
n−1i′,idE(si,si′)=1
(n−1)mi′,ij≤m(si j −si′j)2.
Note that once
fi
is xed, the proxy distance
πi
is a random variable that depends on two separate
randomizations. First, the sampling of the other workers’ fault levels from the proto-population
F
;
and second, the realization of a particular instance I=⟨z,S⟩, where si∼hI N N (z,fi).
Theorem 1 (Anna Karenina principle for the INN model). Suppose that instance
I=⟨S,z⟩
is
sampled from proto-population Fvia the INN model. For every worker i,E[πi(I)| fi]=fi+µ.
3
The squared Euclidean distance in not a true metric (violates triangle inequality), but this is not required for our needs.
The squared Euclidean distance is often used as a dissimilarity measure in various clustering applications [7, 8, 25].
4
In the INN model the equal-diculty assumption is a normative decision, since we can always scale the data. Essentially, it
means that we measure errors in standard deviations, giving the same importance to all questions.
5
ALGORITHM 1:
Estimate-Fault-Levels-by-Distance-
from-outcome (D-EFL)
Input: Dataset S
Output: Fault levels estimation (ˆ
f0
i)i∈N
Estimate the ground truth as y0←r(S);
for each worker i∈Ndo
Set ˆ
f0
i←d(si,y0);
end
ALGORITHM 2: Proxy-based-Estimate-Fault-Levels
(P-EFL for continuous answers)
Input: Dataset S; parameter u.
Output: Fault levels estimation (ˆ
fi)i∈N
Compute dii ′←dE(si,si′)for every pair of workers;
For each worker i∈N, set PSi←1
n−1i′,idii ′;
Set ˆ
µ←Estimate Mu (S);
For each worker i∈N, set ˆ
fi←PSi−ˆ
µ;
Proof. Denote dii ′:=dE(si,si′), which is a random variable.
E[πi(I)| fi]=Ef−i∼F n−1Es1∼N(z,f1), .. .,sn∼N(z,fn)1
n−1
i′,i
dii ′|f1, . . . , fn|fi
=Ef−i1
n−1
i′,i
Esi,si′[dii′|fi,fi′]=1
n−1
i′,i
Efi′∼F [Esi,si′[1
m
m
j=1
(sij −si′j)2|fi,fi′]]
We use the fact that the dierence of two independent normal variables is also a normal variable
whose expectation is the dierence of expectations, and whose variance is the sum of the two
variances. Denote by
x=ϵi1−ϵi′1
then
E[x|fi,fi′]=E[si1|fi] − E[si′1|fi′]=z1−z1=
0; since
fi
is
the variance of sij for all j,
V AR[x|fi,fi′]=V AR[si1|fi]+VAR [si′1|fi′]] =fi+fi′.(2)
We continue by bounding the inner expression for i,i′.
Efi′∼F [Esi,si′[1
m
m
j=1
(sij −si′j)2|fi,fi′]] =Efi′∼ F [Esi,si′[(si1−si′1)2|fi,fi′]]
=Efi′∼F [E[x2|fi,fi′]] =Efi′∼ F [VAR [x|fi,fi′]+(E[x|fi,fi′])2]
=Efi′∼F [V AR[x|fi,fi′]] =Efi′∼F [fi+fi′]=fi+Efi′∼ F [fi′]=fi+µ.
Finally,
ES−i,R[πi(I)| fi]=1
n−1
i′,i
(fi+µ)=fi+µ,
as required. □
By Theorem 1, given
πi(I)
and an estimate of
µ
, we can extract an estimate of
fi
, which suggests
the P-EFL algorithm (Alg. 2).
Estimating parameters. What should be the value for
ˆ
µ
? If we know
µ(F )
, we can of course use it.
Otherwise, we can estimate it from the data. We rst argue that lower values of
ˆ
µ
would result in a
more conservative estimation of
fi
: Consider two workers with
f1>f2
and some
π1,π2
. Denote by
ˆ
fˆ
µ
i
the estimate we get when using some parameter
ˆ
µ
. Then it is easy to see that the ratio between
ˆ
fˆ
µ
1and ˆ
fˆ
µ
2gets closer to 1 as we pick smaller ˆ
µ.
We dene Algorithm 3 for estimating
µ
as
ˆ
µ
:
=u·i∈Nd(si,y0)
(where
y0=r(S)
). If we use
u=1
n
, then
ˆ
µ
is the average of
ˆ
f0
i
. By the argument above, lower values of
u
would result in more
conservative estimation, therefore a default conservative value we could use is
u=
0, in which case
the estimated fault level ˆ
fiin D-EFL would equal πi.
6
We can see the output of the P-EFL algorithm on four dierent instances in Fig. 1. The blue dots
and green dots would be the output when using
u=
0and
u=1
n
, respectively. Ideally, we would
like the estimated fault to be on the solid green line. Note that in real datasets, we do not have
access to the “true” fi, and we use the distance from the ground truth instead.
We show in the next subsections that the value
u=1
n−1
(which is less conservative than both) is
also of interest.
3.2 Equivalence and Consistency of P-EFL and D-EFL
Theorem 2 (Eqivalence of P-EFL and D-EFL). Denote by
ˆ
f
,
ˆ
f0
the output of algorithms
1
n−1-P-EFL and D-EFL, respectively. For any instance I, and any worker i,ˆ
fi=n
n−1ˆ
f0
i.
Note that
n
n−1
does not depend on the particular instance or on the identity of the worker.
Moreover, since in the continuous domain only relative fault matters (multiplying all
fi
by a
constant is just a change of scale), we get D-EFL as a special case of the proxy-based algorithm.
Note that this equivalence does not depend on any statistical assumptions.
Theorem 1 does not guarantee that estimated fault levels are good estimates. We want to verify
that at least for large instances from the INN model, they converge to the correct value. More
precisely, an algorithm is consistent under a statistical model, if for any ground truth parameter and
any
τ>
0, the probability for the outcome of the algorithm to be
τ
away from the ground truth
according to some measure goes to 0as the datasize grows.
Theorem 3 (Consistency of D-EFL (continuous)). When
F
has bounded support and
z
is
bounded, D-EFL is consistent as
n→ ∞
,
m=ω(log(n))
and
n=ω(logm)
. That is,
|ˆ
f0
i−fi| →
0for
all i∈N, as n→ ∞,m=ω(log(n)) and n=ω(logm).
3.3 Aggregation
When fault levels are known, the best aggregation method is well understood.
Proposition 4 ([
1
]). Under the Independent Normal Noise model,
x∗=i≤n1
fisi
is minimizing
E[dE(z,x)|s1, . . . , sn].
That is, the optimal way to aggregate the data is by taking a weighted mean of the answers of
each question, where the weight of each worker is inversely proportional to her variance (i.e. to
her fault level).
Prop. 4 suggests the following algorithmic skeleton (Alg. 4) for aggregating continuous answers.
A common approach to truth discovery is to combine Algorithm 4 with Algorithm 1 (D-EFL).
We refer to this simple algorithm as the Distance-based Truth Discovery (D-TD) algorithm.
We dene the Proxy Truth Discovery (P-TD) algorithm for the continuous domain, by similarly
combining Algorithm 4 with
u
-P-EFL (Algorithm 2). When using
u=1
n−1
, the P-TD and D-TD
algorithms coincide by Thm. 2.
Note that both algorithms are well dened for any instance, whether the assumptions of the INN
model hold or not. Moreover, in the INN model, Theorems 1 and 3 guarantee that with enough
workers and questions, ˆ
fiis a good estimate of the real fault level fi.
This of course does not yet guarantee that either algorithm returns accurate answers. For this,
we need the following two results. The rst says that good approximation of
f
entails a good
approximation of
z
; and the second says that in the limit, D-TD (and thus P-TD) return the correct
answers. Recall that x∗=1
ni∈Nw∗
isiis the best possible estimation of zby Prop. 4.
Theorem 5. For any instance, such that
∀i∈N,ˆ
fi∈ (
1
±δ)fi
for some
δlog
0
.
25; it holds that
d(ˆ
z,z) ≤ d(x∗,z)+O(δ·maxi,j(sij )2).
7
ALGORITHM 3: Estimate-Mu
Input: Dataset S; parameter u.
Output: Estimation ˆ
µ
Set ˆ
f0←D-EFL (S);
Set ˆ
µ←u·i∈Nˆ
f0
i;
Fig. 1. In each figure, the x-axis shows the true fault levels of workers in a population of
n=
40 workers.
The populations in the top row are sampled from Normal proto-populations (truncated at 0
.
1), with
m=
15
questions. The y-axis shows the proxy distance of each worker (blue circles), and the output of the P-EFL
algorithm (green dots). The dashed blue line shows the expected proxy distance by Theorem 1. The green line
is the identity function. The boom figures show the same data for two samples from real datasets.
Theorem 6 (Consistency of D-TD (continuous)). When
F
has bounded support and
z
is
bounded, D-TD is consistent as
n→ ∞
,
m=ω(log(n))
and
n=ω(logm)
. That is, for any
τ>
0,
Pr [d(ˆ
z,z)>τ] → 0for all i∈N, as n→ ∞,m=ω(log(n)) and n=ω(logm).
3.4 Empirical Results
We compared the performance of
1
n−1
-P-TD (which coincides with D-TD) to the baseline method
UA on synthetic and real data. In addition, we created an “Oracle Aggregation” (OA) baseline,
which runs the aggregation skeleton with the true fault level
fi
when available, or the empirical
fault d(si,z)otherwise.
We also tried other values for the parameter u, which had similar results.
We generated instances from the INN model, where
F=N(
1
,
1
)
(additional distributions in
Appendix B). Each instance was generated by rst sampling a population
f
from
F
, and then
generating the instance from
f
. The Buildings dataset was collected via Amazon Mechanical Turk.
The Triangles dataset is from [
18
] (see Appendix B.1 for further details). We used each such dataset
as a distribution over instances, where for each instance we sampled
m
questions and
n
workers
uniformly at random with replacement. We then normalized each question so that its answers have
mean 0and variance 1. For every combination of nand mwe sampled 500 instances.
We can see in Fig. 2 that the P-TD/D-TD method is substantially better than unweighted mean
almost everywhere.
8
ALGORITHM 4:
Aggregation
skeleton (continuous)
Input: Dataset S; parameter u.
Output: Answers ˆ
z
ˆ
f←Estimate F aultLevels (S,u);
∀i∈N, set wi←1
ˆ
fi
;
Set ˆ
z←rM(S,w);
Fig. 2. A comparison of weighted to unweighted aggregation on a synthetic distribution (le) and two real
datasets. In each cell, the number on the color scale indicates the ratio of average errors. Areas where the
ratio
<
1are blue with
▽
, and indicate an advantage to P-TD (which is equivalent to D-TD); areas where the
ratio is
>
1are red with
▲
and indicate advantage to UA. Gray means a tie, and
∗
means that both error rates
are negligible, or a missing data point.
4 CATEGORICAL ANSWERS DOMAIN
In this setting we have
m
categorical (multiple-choice) questions. The ground truth is a vector
z∈Am
, where
|A|=k
. The distance measure we use is the normalized Hamming distance (note
that for binary labels it coincides with the squared Euclidean distance):
dH(s,s′):=1
m|{j≤m:sj,s′
j}| =1
m
j≤m
sj,s′
j.(3)
We follow the same scheme of Section 3: given an instance
I=⟨S,z⟩
, we rst show how to
estimate workers’ fault levels under a simple noise model, then transform them to weights and
aggregate multiple answers.
Independent Errors Model. For our theoretical analysis, we assume an independent error (IER)
model. Formally, for every question
j
, and every worker
i∈N
,
si j ,zj
with probability
fi
; and for
any
x∈A\ {zj},si j =x
w.p.
fi
k−1
. That is, all wrong answers occur with equal probability. Note
that E[dH(si,z)| fi]=fi.
We denote by
θ
:
=1
k−1
the probability that two workers who are wrong select the same answer.
4.1 Estimating Fault Levels
Fault estimation by distance from the Plurality outcome. As in the continuous case, it is a common
practice to use a simple aggregation method (in this case Plurality) to estimate fault levels. We
similarly denote the estimated fault level by
ˆ
f0
i
:
=d(si,rP(S))
, and refer to it as the D-EFL algorithm
for the categorical domain.
Fault estimation by proxy distance. Applying the denition of the proxy distance (Eq.
(1)
) to the
categorical domain, we get:
πi(I)=1
n−1
i′,i
dH(si,si′)=1
(n−1)m
i′,i
j≤m
sij ,zj.
9
ALGORITHM 5:
Proxy-based-Estimate-Fault-
Levels (P-EFL, categorical)
Input: Dataset S; parameter u
Output: Fault levels estimation ˆ
f
Set
dii ′←d(si,si′)
for every pair
i,i′
;
Set ˆ
µ←Estimate Mu (S,u);
for each i∈Ndo
Set πi←1
n−1i′,idii ′;
Set ˆ
fi←πi−ˆ
µ
1−(1+θ)ˆ
µ;
end
Fig. 3. The figures present the same information as in Fig. 1, for generated populations of
n=
40 workers
and
m=
50 yes/no questions. Each figure shows one instance from Normal distributions with dierent
µ
and
ϑ
. The boom right figure shows the correlation between
ˆ
f
and
f
(blue line); and between
ˆ
f0
and
f
(dashed
orange line), as the number of workers ngrows (every datapoint is an average over 1000 instances).
Theorem 7 (Anna Karenina principle for the IER model). Suppose that instance
I=⟨S,z⟩
is
sampled from proto-population Fvia the IER model. For every worker i,
E[πi(I)| fi]=µ+(1− (1+θ)µ)fi.(4)
The proof is somewhat more nuanced than in the continuous case. We rst show that for every
pair of workers,
Pr [sij =si′j|fi,fi′]=1−fi−fi′+(1+θ)fifi′.
Then, for every population,
E[πi|f]=fi+(1− (1+θ)fi)1
n−1
i′,i
fi′,(5)
and then we take expectation again over populations to prove the claim.
We get that there is a positive relation between
fi
and
πi
exactly when
µ(F ) <1
1+θ=
1
−1
k
, i.e.
when the average fault levels are below those of completely random answers.
To estimate fifrom the data, the P-EFL algorithm (Alg. 5) reverses the linear relation.
Setting parameter values. As in the continuous domain, setting
u=1
n
means that
ˆ
µ
is the average
of
ˆ
f0
i
. Also as in the continuous domain, we can use
u=
0as a conservative default value, in which
case ˆ
fi=πi(I).
In contrast to the continuous case, it is obvious that the estimates we get from the P-EFL and the
D-EFL algorithms are not at all equivalent. To see why, note that a small change in the report of a
single worker may completely change the Plurality outcome (and thus the fault estimates of all
workers in the D-EFL algorithm), but only has a gradual eect on P-EFL.
We do not know whether D-EFL is consistent, but we can show that P-EFL is.
Theorem 8 (Consistency of P-EFL). Suppose the support of
F
is a closed subset of
(
0
,
1
]
and
µF<k−1
k
. Then
1
n
-P-EFL is consistent as
n→ ∞
and
m=ω(log n)
. That is, for any
δ>
0,
Pr [| ˆ
fi−fi|>δ] → 0for all i∈N, as n→ ∞ and m=ω(log n).
10
Evaluation. How good is this estimation of P-EFL for a given population? Rephrasing Theorem 7,
we can write E[πi|fi]=fi+(1− (1+θ)fi)µ.
That is, the proxy distance is a sum of two components. The rst is the actual fault level
fi
(the
“signal”). The second one decreases the signal proportionally to
µ
. Thus the lower
µ
is, the better
estimation we get on average. We can see this eect in Fig. 3: the top left gure presents an instance
with lower
µ
than the gure to its right, so the dependency of
πi
on
fi
is stronger and we get a
better estimation. The top right gure has the same
µ
as the middle one, but higher
ϑ
, thus fault
levels are more spread out and easier to estimate. The two gures on the bottom left demonstrate
that a good t is not necessarily due the the IER model, as the estimated faults for the real dataset
on the middle are much more accurate.
The bottom right gure shows that in P-EFL is somewhat more accurate that D-EFL on average.
4.2 Aggregation
The vector
x∗
that minimizes the expected distance to the ground truth
z
is also the MLE under
equal prior. This is since we simply try to nd the most likely answer for each question. When
fault levels are known, this was studied extensively in a binary setting [
16
,
29
,
35
]. Specically,
Grofman et al. [
16
] identied the optimal rule for binary aggregation. Ben-Yashar and Paroush [
2
]
extended these results to questions with multiple answers.5
For a worker with fault level
fi
, we denote
w∗
i
:
=log (1−fi)(k−1)
fi
. We refer to
w∗
as the Grofman
weights of population f.
Proposition 9 ([
2
,
16
]). Suppose that
⟨z,S⟩
is a random instance from the IER model. Let
x∗
j
:
=
argmaxxj∈Ai∈Nw∗
isij =xj
. Then
x∗
is the maximum likelihood estimator of
z
(and thus also
minimizes d(x,z)in expectation).
That is,
x∗
is the result of a weighted plurality rule, where the optimal weight of
i
is her Grofman
weight
w∗
i
(note that it depends only on
fi
). Note that workers whose fault level is above random
error (
fi>
1
−1
k
) get a negative weight. Of course, since we have no access to the true fault level,
we cannot use Grofman weights directly.
Prop. 9 suggests a simple aggregation skeleton, which is the same as Alg. 4, except it uses
rP
instead of
rM
, and sets weights to
wi←log (1−ˆ
fi)(k−1)
ˆ
fi
.
6
D-TD and
u
-P-TD are the combinations
of this categorical skeleton with D-EFL and with u-P-EFL, respectively.
As in the continuous case, the algorithm is well-dened for any categorical dataset, but in the
special case of the IER noise model we get that the workers’ weights are a reasonable estimate
of the Grofman weights due to Theorems 7 and 8. Lastly for this section, we show that P-TD is
consistent.
Theorem 10 (Consistency of P-TD). Suppose the support of
F
is a closed subset of
(
0
,
1
)
and
µF<k−1
k
. Then
1
n
-P-TD is consistent as
n→ ∞
and
m=ω(log n)
. That is, for any
τ>
0,
Pr [d(ˆ
z,z)>
τ] → 0as n→ ∞ and m=ω(log n).
4.3 Empirical results
We compared the performance of P-TD to the competing methods UA (which returns
rP(S)
) and
D-TD on synthetic and real data. In all simulations we used the default parameter
u=
0(i.e.,
5
Ben-Yashar and Paroush [
2
] also considered other extensions including unequal priors, distinct utilities for the decision
maker, and general confusion matrices instead of equal-probability errors. In all these cases the optimal decision rule is not
necessarily a weighted plurality rule, and generally requires comparing all pairs of answers.
6Also appears as Alg. 8 in the appendix for completeness.
11
Fig. 4. Performance of UA, D-TD, and P-TD. Results are shown for 3 synthetic distributions and one real
dataset. IP-TD is discussed later in Sec. 4.4.
Fig. 5. Each heatmap in the top row compares P-TD to UA (as in Fig. 2), and each heatmap in the middle row
compares P-TD to D-TD, varying nand m(1000 samples for each). The boom row is discussed in Sec. 4.4.
0-P-TD). Averages are over 1000 samples for each
n
and
m
. The oracle benchmark OA returns
rP(S,w∗). Note that OA and UA are not aected by the number of questions.
For synthetic data we generated instances from the IER model: one distribution with Yes/No
questions (
k=
2), where
F=N(
0
.
45
,
0
.
1
)
; and another with multiple-choice questions (
k=
4),
where
F=N(
0
.
7
,
0
.
1
)
(additional distributions in Appendix D). In addition, we used three datasets
from [
34
] (Flags, GoldenGate, Dogs) and one that we collected (DotsBinary). Their description is in
Appendix D.1.
Fig. 4 shows that both P-TD and D-TD have a strong advantage over unweighted aggregation.
Fig.5 directly compares P-TD to UA (top row) and to D-TD (mid row). We can see that P-TD
dominates except in some regions.
4.4 Iterative methods
There are more advanced methods for truth discovery that are based on the following reasoning: a
good estimate of workers’ fault levels leads to aggregated answers that are close to the ground truth
(by appropriately weighing the workers); and a good approximation of the ground truth can get us a
good estimate of workers’ competence (by measuring their distance from the approximate answers).
Thus we can iteratively improve both estimates (see e.g. in [
26
], Section 2.2.2). The Iterative D-TD
12
ALGORITHM 6:
Iterative Distance-from-
outcome Truth Discovery (ID-TD)
Input: number of iterations T; dataset S
Output: Fault levels ˆ
f=(ˆ
fi)i∈N, answers ˆ
z
Initialize w0←1
n;
for t=0,1,2, . . . , T−1do
yt←rP(S,wt);
∀i∈N, set ˆ
ft
i←dH(si,yt);
∀i∈N, set wt+1
i←log (1−ˆ
ft
i)(k−1)
ˆ
ft
i;
end
Set ˆ
f←ˆ
fT;
Set ˆ
z←rP(S,wT);
ALGORITHM 7:
Iterative-Proxy-based-
Estimate-Fault-Levels (IP-EFL)
Input: number of iterations T; dataset S
Output: Fault levels ˆ
f=(ˆ
fi)i∈N
Initialize w0←1
n;
Compute dii ′←d(si,si′)for every pair of workers;
for t=0,1,2, . . . , T−1do
for every worker i∈Ndo
Set ˆ
ft
i=πt
i←i′,iwt
idii ′;
Set wt+1
i←log (1−ˆ
ft
i)(k−1)
ˆ
ft
i;
end
end
Set ˆ
f←ˆ
fT;
algorithm (ID-TD, see Alg. 6) captures this reasoning. Note that the D-TD algorithm is a special
case of ID-TD with a single iteration.
Iterative P-EFL. We can adopt a similar iterative approach for estimating fault levels using proxy
voting. Intuitively, in each iteration we compute the proxy distance of a worker according to her
weighted average distance from all workers, and then recalculate the estimated fault levels and
weights. Note that as in the single step P-EFL, this estimation does not require any aggregation
step. We avoid estimating
µ
and instead use the default value of
ˆ
µ=
0, which means that
ˆ
fi
equals
the proxy distance. The complete pseudocode is in Alg. 7.
Intuitive analysis of IP-EFL. For ease of presentation, we assume
k=
2for the analysis in the
remainder of this section. Recall again that for unweighted proxy distance we have by Thm. 7:
E[πi|fi]=fi+(1−2fi)µ.(6)
Ideally, we would like to make this second part smaller to strengthen the signal. We argue
that weighted proxy distance obtains just that. We do not provide a formal proof, but rather
approximate calculations that should provide some intuition. Exact calculations are complicated
due to correlations among terms.
We assume that fault levels are already determined, and take expectation only over realization
of workers’ answers.
Lemma 11. In step tof the iterative algorithm,
E[πt
i|f]=fi+(1−2fi)i′,iwt
i′fi′
i′′ wt
i′′
(7)
The proof simply repeats the steps of the proof of Theorem 7.
Recall that
w∗
i
:
=log 1−fi
fi
is the actual optimal weight (Grofman weight) of worker
i
. We also
denote
w∗∗
i
:
=1
2−fi
. For values of
fi
not too far from 0
.
5,
w∗∗
i
is an approximation of
w∗
i
[
16
]).
Thus i′w∗∗
i′fi′
i′′ w∗∗
i′′
i′w∗
i′fi′
i′′ w∗
i′′ .
Recall that
ϑ
is the variance of
F
. Consider the “noisy” part that multiplies
(
1
−
2
fi)
above. In
expectation, the nominator holds:
E[
i′,i
w∗∗
i′fi′]=
i′,i
E[( 1
2−fi′)fi′]=1
2
i′,i
E[fi′] −
i′,i
[(fi′)2]=
1
2µ− (ϑ2+µ2)
n−1
13
Similarly, in expectation, the denominator holds E[i′,iw∗∗
i′]=
1
2−µ
n−1.
If we neglect both the correlation among nominator and denominator, and the fact that
wt
i
is
only an approximation of w∗
i, we get that:
E[i′,iwt
i′fi′
i′′ wt
i′′
]E[i′,iw∗
i′fi′
i′w∗
i′′
]E[i′,iw∗∗
i′fi′
i′w∗∗
i′′
]E[i′,iw∗∗
i′fi′]
E[i′w∗∗
i′′ ]=
1
2µ− (ϑ+µ2)
1
2−µ
=µ−ϑ
1
2−µ.
We conclude from Lemma 11 and the above discussion that after enough iterations,
E[πt
i|fi]=
E[E[πt
i|f]| fi]fi+(1−2fi)µ−ϑ
1
2−µ.
Since
µ<1
2
, this noise term is not larger than the noise in the unweighted P-EFL algorithm (
µ
in
Eq.
(6)
). We therefore expect that if
wt
is already a reasonable estimation of
w∗
then accuracy will
grow with further iterations.
Empirical results for iterative algorithms. The Iterative Proxy-based Truth Discovery (IP-TD)
algorithm, combines our aggregation skeleton with IP-EFL. We can see how adding iterations
aects the performance of P-TD on synthetic data in Fig. 4. We further compare the performance of
IP-TD to ID-TD on more distributions and datasets in the third row of Fig. 5 (and in Appendix D).
For both algorithms we used
T=
8iterations. A higher number of iterations had little eect on
the results. Note that as we use
u=
0, our IP-TD algorithm never explicitly estimates
µ
or
ϑ
, yet it
manages to take advantage of the variance among workers. We do see instances however where
the initial estimation is o, and any additional iteration makes it worse.
5 RANKING DOMAIN
Consider a set of alternatives
C
, where
L=L(C)
is the set of all rankings (permutations) over
C
.
Each pairwise relation over
C
corresponds to a binary vector
x∈ {−
1
,
1
}m
where
m=|C|
2
(i.e.
each dimension is a pair of candidates). In particular, every ranking
L∈ L
has a corresponding
vector
xL∈ {−
1
,
1
}m
. A vector
x∈ {−
1
,
1
}m
is called transitive if it corresponds to some ranking
Lx
s.t.
xLx=x
. The ground truth is a transitive vector
z
(equivalently, a ranking
Lz∈ L
). A natural
metric over rankings is the Kendall-tau distance (a.k.a swap distance): dK T (L,L′):=dH(xL,xL′).
Independent Condorcet Noise Model. According to Independent Condorcet noise (ICN) model, an
agent with fault level
fi∈ [
0
,
1
]
observes a vector
si∈ {−
1
,
1
}m
where for every pair of candidates
j=(a,b), we have si j ,zjwith probability fi.7In particular, simay not be transitive.
Mallows Models. The down side of the Condorcet noise model is that it may result in nontransitive
answers. Mallows model is similar, except it is guaranteed to produce transitive answers (rankings).
Formally, given ground truth
Lz∈ L
and parameter
ϕi>
0, the probability of observing order
Li∈ L
is proportional to
ϕd(Lz,Li)
i
. Thus for
ϕ=
1we get a uniform distribution, whereas for low
ϕiwe get orders concentrated around Lz.
In fact, if we throw away all non-transitive samples, the probability of getting rank
Li
under the
Condorcet noise model with parameter
fi
(conditional on the outcome being transitive) is exactly
the same as the probability of getting Liunder Mallows Model with parameter ϕi=1−fi
fi.
5.1 Estimating Fault Levels
By denition, the ICN model is a special case of the IER model, where
k=
2,
m=|C|
2
, and the
ground truth is a transitive vector. We thus get the following result as an immediate corollary of
Theorem 7, and can therefore use P-EFL directly.
7Classically, the Condorcet model assumes all voters have the same parameter [43].
14
Theorem 12 (Anna Karenina principle for the Condorcet model). Suppose that instance
I=⟨S,z⟩
is sampled from population
(fi)i∈N
via the ICN model, where all
fi
are sampled independently
from proto-population Fwith expected value µ. For every worker i,E[πi(I)| fi]=µ+(1−2µ)fi.
5.2 Aggregation
Note that while our results on fault estimation from the binary domain directly apply (at least to the
ICN model), aggregation is more tricky: an issue-by-issue aggregation may result in a non-transitive
(thus invalid) solution. The voting rules we consider are guaranteed to output a valid ranking.
The problem of retrieving
Lz
given
n
votes
L1, . . . , Ln
is a classical problem, and in fact any
social welfare function r:Ln→ L oers a possible solution [4].
There is a line of work that deals with nding the MLE under various assumptions on the
noise model [2, 13]. In general, these estimators may take a complicated form that depends on all
parameters of the distribution. Yet some cases are simpler.
The Kemeny rule and optimal aggregation. It is well known that for both Condorcet noise model
and Mallows model, when all voters have the same fault level, the maximum likelihood estimator
of Lzis obtained by applying the Kemeny-Young voting rule on S[43].
The Kemeny-Young rule
rKY
(henceforth, Kemeny) computes the binary vector
y0
that corre-
sponds to the majority applied separately on every pair of candidates (that is,
y0
:
=sign i≤nsi
);
then rKY (S):=argminL∈ L dH(xL,y0).
In particular it can be applied when Sis composed of transitive vectors.
A natural question is whether there is a weighted version of KY that is an MLE and/or minimizes
the expected distance to
L∗
when fault levels
fi
are known. We did not nd any explicit reference
to this question or to the case of distinct fault levels in general. There are some other extensions:
[
13
] deal with a dierent variation of the noise model where it is less likely to swap pairs that are
further apart. [
42
] extend the Kemeny rule to deal with more general noise models and partial
orders.
As it turns out, using weighted KY with (binary) Grofman weights
w∗
i=log 1−fi
fi
provides us
with (at least) an approximately optimal outcome.
Proposition 13. Suppose that the ground truth
Lz
is sampled from a uniform prior on
L
. Suppose
that instance
I=⟨S,z⟩
is sampled from population
(fi)i∈N
via the ICN model. Let
L∗
:
=rKY (S,w∗(f))
.
Let
L′∈ L
be any random variable that may depend on
S
. Then
E[dKT (L∗,Lz)] ≤
2
E[dKT (L′,Lz)]
,
where expectation is over all instances.
Proof.
Consider the ground truth binary vector
z
. Let
y
:
=sign i≤nwisi
. Let
x′∈ {−
1
,
1
}m
be
an arbitrary random variable that may depend on the input prole. For every
j≤m
(i.e., every pair
of elements), we know from Prop. 9 that
yj
is the MLE for
zj
, and thus
Pr [x′
j=zj] ≤ Pr [yj=zj]
.
Now, recall that by denition of the KY rule, L∗is the closet ranking to Ly. Also denote x∗=xL∗.
E[dKT (L∗,Lz)] =E[dH(x∗,z)] ≤ E[dH(x∗,y)+dH(y,z)]
≤E[2dH(y,z)] (since x∗is the closest transitive vector to y)
=2
m
j≤m
Pr [yj,zj] ≤ 2
m
j≤m
Pr [x′
j,zj]=2E[dH(x′,z)].
In particular, this holds for any transitive vector x′=xL′corresponding to ranking L′, thus
E[dKT (L∗,Lz)] ≤ 2E[dH(xL′,z)] =2E[dK T (L′,Lz)],
as required. □
15
Fig. 6. Performance comparison of P-TD to D-TD, with all eight voting rules. The proto-population is Mallows
distribution with ϕi∼N(0.85,0.15).
Prop. 13 provides some justication to use Kemeny voting rule and Grofman weights for aggre-
gating rankings when the
fi
’s are known. We can now apply a similar reasoning as in the binary
case to estimate
fi
. Given a set of rankings
S
and any voting rule
r
(not necessarily Kemeny!), the
P-TD
r
algorithm is a combination of Alg. 8 with
k=
2and
r
instead of
rP
, and Alg. 5 with the
Kendall-tau distance and
u=
0. Since the meaning of negative weights is not clearly dened, we
replace every negative weight with 0(the full description appears as Alg. 9 in the appendix for
completeness).
5.3 Empirical results
We compared the performance of P-TD
r
using 8 dierent voting rules: Borda, Copeland, Kemeny-
Young (with weighted and unweighted majority graph), Plurality, Veto, Random dictator, and Best
dictator.8
We generated instances from Mallows model, where
F=N(
0
.
85
,
0
.
15
)
(additional distributions
in Appendix E.3). For every combination of
n
and
m
we sampled 1000 instances, each from a dierent
population. We can see the results in Fig. 6 and 7 (left). The advantage of P-TD is particularly
visible in the four latter voting rules, which are simpler. This is since the intermediate estimation
y0=r(S)is poor and misguides the D-TD algorithm.
We also used a dataset collected by Mao et al. [
27
], where groups of 20-30 subjects were asked to
order 4 images according to the number of dots in them (DOTS dataset); or according to the number
of steps required to solve the 8-puzzle board (PUZZLES dataset). We compared P-TD to D-TD on all
of these groups, with each of the eight voting rules. Fig. 7 (right) shows a clear advantage to P-TD.
6 DISCUSSION
Proxy voting can be used as a general scheme to estimate workers’ competence (or fault level) in a
broad range of truth discovery and crowdsourcing scenarios, as long as there is a natural notion
of distance between answers. For three such domains (real-valued answers, categorical answers,
and rankings) we showed formally a linear version of the “Anna Karenina principle”, i.e. that the
average distance from a worker to other workers is linear in her fault level, under common noise
models.
It is interesting that under several rather dierent noise models, we get that the proxy distance
itself (
πi(I)
) is a reasonable estimate of the fault level
fi
. This suggests that using the proxy distance
may be a good fallback option when estimating workers’ fault levels in domains when we do
not have an explicit noise model and/or basic aggregation function to work with. Results in the
continuous and ranking domains show substantial improvement, whereas in the Caterogical domain
this is more nuanced, especially when there are few questions. This may be due to workers that are
worse than random (note that consistency is not guaranteed in this case).
8For formal denition of the voting rules, see [4], or Appendix E.1.
16
Fig. 7. Le: Performance comparison on the synthetic data (Mallows with
N(
0
.
65
,
0
.
15
)
) and three voting
rules. Right: each entry is an average over 40 groups of players from the experiment of [
27
]. There is no
sampling involved. Top row is for n=31 and boom row for n=11.
6.1 Related Work
Similarity of workers (or “peer consistency”) has been previously considered in the context of
crowdsourcing. It has been most commonly used in “Games with a Purpose” which incorporated
“output agreement” mechanism [
36
], where two players are given a task and progress in the game
if their answers match. Inspired by this approach, Huang et al. [
21
] used such a mechanism to
incentivize workers by telling workers that their answers will be compared to those of a random
worker. They showed that this incentive scheme can enhance workers’ performance. However,
they did not use peer consistency for aggregation nor do they provide any theoretical analysis.
The iterative algorithm ID-TD that we mentioned belongs to a large class of iterative methods,
that couple competence estimation and truth discovery (goals (a) and (b) from our introduction) [
19
,
22
,
33
,
40
,
41
]. These approaches typically assume an underlying model of worker competence
and estimate the parameters of this model based on workers’ responses to assess their reliability.
They then utilize this information to aggregate labels (goal (b)) or to determine which workers
to assign to new tasks. Many more algorithms for truth discovery have been suggested (see two
recent surveys in [26, 38]), and full comparison with all of them is outside the scope of this work.
In contrast to these iterative techniques, our competence estimation methods (including our
iterative IP-EFL algorithm) can be completely independent of any aggregation rule—consider
our comment above on using
ˆ
µ=
0. They can thus be applied in scenarios where answers are
complicated and no obvious aggregation method is available (e.g., complex labeling of text, images,
or video). In fact, we apply it in the ranking domain where there are many aggregation rules but
no consensus on what is the right one to use.
Another common approach to assess workers’ competence (i.e. goal (a)) in the crowdsourcing
literature, is incorporating “gold questions” for which the ground truth answer is known; these
questions are then used to lter low-quality workers [
14
,
44
]. Several works proposed approaches
to make the use of gold questions more eective by adaptively choosing whether to ask a worker
additional gold questions [3, 30].
We argue that proxy-based truth discovery is not intended to replace the above arsenal of tools
for crowdsourcing, but to complement it. Indeed, it is very easy to weight workers according to their
17
proxy distance,
9
and combine this with any aggregation rule and gold questions, active selection,
and/or various incentive schemes tools that are being used in crowdsourcing platforms.
This is important, as the tools in use may be subject to various computational and incentive
constraints, legacy code, interpretability requirements, and so on. We thus believe it is important to
be able to boost the performance of a broad range of aggregation rules in diverse scenarios. Indeed,
the only thing we need to apply the Anna Karenina principle is a distance measure. Curiously, the
empirical performance we observed e.g. in the ranking domain, was for the simplest voting rules,
that had no theoretical guarantees.
6.2 Conclusion and future work
Recall that our initial inspiration was proxy voting in social choice settings. Having now a deeper
understanding of the reasons underlying the success of proxy voting in truth discovery tasks, we
would like to apply these insights back to social choice and political settings.
Apart from that, we believe that proxy voting could be very easily integrated into almost any truth
discovery or crowdsourcing system, and in many cases provide a signicant boost in performance
at a very low cost. A unifying “Anna Karenina" theorem that is not domain specic (perhaps uses
only abstract properties of the distance measure and the statistical model) would be of great help in
this direction. Another important direction is to consider the economic incentives in crowdsourcing
system, and see how proxy voting aects equilibrium behavior of participants.
On the practical side, we are currently engaged in several collaborations that apply proxy voting
to real crowdsourcing problem with complex input, and hope it would help to bring substantial
benet to the world.
REFERENCES
[1] AC Aitkin. On least squares and linear combination of observations. Proceedings of the RSE, 55:42–48, 1935.
[2]
Ruth Ben-Yashar and Jacob Paroush. Optimal decision rules for xed-size committees in polychotomous choice situations.
Social Choice and Welfare, 18(4):737–746, 2001.
[3]
Jonathan Bragg, Daniel S Weld, et al. Optimal testing for crowd workers. In AAMAS’16, pages 966–974. IFAAMAS, 2016.
[4]
Felix Brandt, Vincent Conitzer, Ulle Endriss, Jérôme Lang, and Ariel D Procaccia. Handbook of computational social
choice. Cambridge University Press, 2016.
[5] Markus Brill. Interactive democracy. In AAMAS’18, pages 1183–1187, 2018.
[6]
Ioannis Caragiannis, Ariel D Procaccia, and Nisarg Shah. When do noisy votes reveal the truth? In EC’13, pages 143–160.
ACM, 2013.
[7]
Randy L Carter, Robin Morris, and Roger K Blasheld. On the partitioning of squared euclidean distance and its
applications in cluster analysis. Psychometrika, 54(1):9–23, 1989.
[8]
Sung-Hyuk Cha. Comprehensive survey on distance/similarity measures between probability density functions.
international journal of mathematical models and methods in applied sciences, 1(2):1, 2007.
[9]
Gal Cohensius, Shie Mannor, Reshef Meir, Eli Meirom, and Ariel Orda. Proxy voting for better outcomes. In AAMAS’16,
pages 858–866. IFAAMAS, 2017.
[10]
Marie J Condorcet et al. Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix,
volume 252. American Mathematical Soc., 1785.
[11]
Vincent Conitzer and Tuomas Sandholm. Common voting rules as maximum likelihood estimators. In Proceedings of
the Twenty-First Conference on Uncertainty in Articial Intelligence, pages 145–152. AUAI Press, 2005.
[12]
Jia Deng, Olga Russakovsky, Jonathan Krause, Michael S Bernstein, Alex Berg, and Li Fei-Fei. Scalable multi-label
annotation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 3099–3102. ACM, 2014.
[13]
Mohamed Drissi-Bakhkhat and Michel Truchon. Maximum likelihood approach to vote aggregation with variable
probabilities. Social Choice and Welfare, 23(2):161–185, 2004.
[14]
Matthew R Gormley, Adam Gerber, Mary Harper, and Mark Dredze. Non-expert correction of automatically generated
relation annotations. In NAACL HLT Workshop, pages 204–207. ACL, 2010.
[15] James Green-Armytage. Direct voting and proxy voting. Constitutional Political Economy, 26(2):190–220, 2015.
9Admittedly, calling P-TD an “algorithm” is stretching the term!
18
[16] Bernard Grofman, Guillermo Owen, and Scott L Feld. Thirteen theorems in search of the truth. Theory and Decision,
15(3):261–278, 1983.
[17]
Ronald K Hambleton, Hariharan Swaminathan, and H Jane Rogers. Fundamentals of item response theory, volume 2.
Sage, 1991.
[18]
Yuval Hart, Moira R Dillon, Andrew Marantan, Anna L Cardenas, Elizabeth Spelke, and L Mahadevan. The statistical
shape of geometric reasoning. Scientic reports, 8(1):12906, 2018.
[19]
Chien-Ju Ho, Shahin Jabbari, and Jennifer Wortman Vaughan. Adaptive task assignment for crowdsourced classication.
In ICML, pages 534–542, 2013.
[20] John J Horton. The dot-guessing game: A fruit y for human computation research. SSRN, 2010.
[21]
Shih-Wen Huang and Wai-Tat Fu. Enhancing reliability using peer consistency evaluation in human computation. In
CSCW, pages 639–648. ACM, 2013.
[22]
David R Karger, Sewoong Oh, and Devavrat Shah. Iterative learning for reliable crowdsourcing systems. In NIPS,
pages 1953–1961, 2011.
[23]
Gabriella Kazai, Jaap Kamps, Marijn Koolen, and Natasa Milic-Frayling. Crowdsourcing for book search evaluation:
impact of hit design on comparative system ranking. In SIGIR’11, pages 205–214. ACM, 2011.
[24]
Emmanuel Keuleers, Michaël Stevens, Paweł Mandera, and Marc Brysbaert. Word knowledge in the crowd: Measuring
vocabulary size and word prevalence in a massive online experiment. Quarterly Journal of Experimental Psychology,
68(8):1665–1692, 2015. PMID: 25715025.
[25]
Esvey Kosman and KJ Leonard. Similarity coecients for molecular markers in studies of genetic relationships between
individuals for haploid, diploid, and polyploid species. Molecular ecology, 14(2):415–424, 2005.
[26]
Yaliang Li, Jing Gao, Chuishi Meng, Qi Li, Lu Su, Bo Zhao, Wei Fan, and Jiawei Han. A survey on truth discovery.
ACM SIGKDD Explorations Newsletter, 17(2):1–16, 2016.
[27]
Andrew Mao, Ariel D Procaccia, and Yiling Chen. Better human computation through principled voting. In AAAI.
Citeseer, 2013.
[28] Elliot McLaughlin. Image overload: Help us sort it all out, nasa requests. CNN.com. Retrieved at 18/9/2014.
[29] Shmuel Nitzan and Jacob Paroush. Collective decision making: an economic outlook. CUP Archive, 1985.
[30]
Chenxi Qiu, Anna Squicciarini, Dev Rishi Khare, Barbara Carminati, and James Caverlee. Crowdeval: A cost-ecient
strategy to evaluate crowdsourced worker’s reliability. In AAMAS’18, pages 1486–1494. IFAAMAS, 2018.
[31]
Flávio Ribeiro, Dinei Florencio, and Vítor Nascimento. Crowdsourcing subjective image quality evaluation. In Image
Processing (ICIP), 2011 18th IEEE International Conference on, pages 3097–3100. IEEE, 2011.
[32]
Marta Sabou, Kalina Bontcheva, Leon Derczynski, and Arno Scharl. Corpus annotation through crowdsourcing:
Towards best practice guidelines. In LREC, pages 859–866, 2014.
[33]
Maximilien Servajean, Alexis Joly, Dennis Shasha, Julien Champ, and Esther Pacitti. Crowdsourcing thousands of
specialized labels: a bayesian active training approach. IEEE Transactions on Multimedia, 19(6):1376–1391, 2017.
[34]
Nihar Bhadresh Shah and Denny Zhou. Double or nothing: Multiplicative incentive mechanisms for crowdsourcing.
In NIPS, pages 1–9, 2015.
[35]
Lloyd Shapley and Bernard Grofman. Optimizing group judgmental accuracy in the presence of interdependencies.
Public Choice, 43(3):329–343, 1984.
[36] Luis Von Ahn and Laura Dabbish. Designing games with a purpose. Communications of the ACM, 51(8):58–67, 2008.
[37]
Jeroen Vuurens, Arjen P de Vries, and Carsten Eickho. How much spam can you take? an analysis of crowdsourcing
results to increase accuracy. In ACM SIGIR Workshop on CIR’ 11, pages 21–26, 2011.
[38]
Dalia Attia Waguih and Laure Berti-Equille. Truth discovery algorithms: An experimental evaluation. arXiv preprint
arXiv:1409.6428, 2014.
[39]
Paul Wais, Shivaram Lingamneni, Duncan Cook, Jason Fennell, Benjamin Goldenberg, Daniel Lubarov, David Marin,
and Hari Simons. Towards building a high-quality workforce with mechanical turk. NIPS workshop, pages 1–5, 2010.
[40]
Peter Welinder and Pietro Perona. Online crowdsourcing: rating annotators and obtaining cost-eective labels. In
Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pages 25–32.
IEEE, 2010.
[41]
Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R Movellan, and Paul L Ruvolo. Whose vote should count more:
Optimal integration of labels from labelers of unknown expertise. In NIPS, pages 2035–2043, 2009.
[42]
Lirong Xia and Vincent Conitzer. A maximum likelihood approach towards aggregating partial orders. In IJCAI,
volume 22, page 446, 2011.
[43] H Peyton Young. Condorcet’s theory of voting. American Political science review, 82(04):1231–1244, 1988.
[44]
Ce Zhang, Feng Niu, Christopher Ré, and Jude Shavlik. Big data versus the crowd: Looking for relationships in all the
right places. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume
1, pages 825–834. Association for Computational Linguistics, 2012.
19
A PROOFS FOR CONTINUOUS DOMAIN
Theorem 2. Consider the result of 1
n−1-P-EFL. For any instance I, and any worker i,ˆ
fi=n
n−1ˆ
f0
i.
Proof. First note that
ˆ
f0
i=dE(s,1
n
i′∈N
si′)=1
m
j≤m
(sij −1
n
i′∈N
si′j)2
=1
m
j≤ms2
i j −si j
2
n
i′∈N
si′j+(1
n
i′∈N
si′j)2
Now, the estimated fault level based on proxy score is
n−1
n
ˆ
fi=n−1
n(πi(I) − ˆ
µ)=1
n
i′∈N
dE(si,si′) − 1
n
i′∈N
f0
i′
=1
n
i′∈N
1
m
j≤m
(sij −si′j)2−1
n
i′∈N
1
m
j≤ms2
i′j−si′j
2
n
i′′∈N
si′′j+(1
n
i′′∈N
si′′j)2
=1
m
j≤m1
n
i′∈N
(sij −si′j)2−1
n
i′∈Ns2
i′j−si′j
2
n
i′′∈N
si′′j+(1
n
i′′∈N
si′′j)2
=1
m
j≤ms2
i j +1
n
i′∈Ns2
i′j−2sij si′j− (s2
i′j−si′j
2
n
i′′∈N
si′′j+(1
n
i′′∈N
si′′j)2
=1
m
j≤ms2
i j −2sij
1
n
i′∈N
si′j+1
n
i′∈N
si′j
2
n
i′′∈N
si′′j−1
n
i′∈N
(1
n
i′′∈N
si′′j)2
=1
m
j≤ms2
i j −2sij
1
n
i′∈N
si′j+2(1
n
i′′∈N
si′′j)2− (1
n
i′′∈N
si′′j)2
=1
m
j≤ms2
i j −2sij
1
n
i′∈N
si′j+(1
n
i′′∈N
si′′j)2=ˆ
f0
i,
as required. □
Theorem 3.
When
F
has bounded support and
z
is bounded, D-EFL is consistent as
n→ ∞
,
m=ω(log(n))
and
n=ω(logm)
. That is,
|ˆ
f0
i−fi| →
0for all
i∈N
, as
n→ ∞
,
m=ω(log(n))
and
n=ω(logm).
Proof.
We prove in steps:
For each j≤m,y0
j=1
ni′∈Nsi′jis close to zjwith high prob-
ability.
This can be seen as the average of
n
i.i.d. random variables such that the mean of each
random variable is
zj
and the variance of each random variable is
µ(F )
. Therefore, by Hoeding’s
inequality for sub-Gaussian random variables, we have that for any T1>0:
Pr(|y0
j−zj|>T1)<exp{−C1T2
1n},
for constant C1>1.
20
For each j≤m, given |y0
j−zj| ≤ T1, then we have that m
j=1(sij −y0
j)2/mis close to fiwith
high probability.
m
j=1
(sij −y0
j)2
m
=
m
j=1
s2
i j
m+
m
j=1
y0
j
2
m−2
m
j=1
y0
jsi j
m
≤
m
j=1
s2
i j
m+
m
j=1
(zj+T1)2
m−2
m
j=1
zjsi j
m+2T1
m
j=1
si j
m
=
m
j=1
(sij −zj)2
m+T2
1+2T1
m
j=1
(sij +zj)
m
Because for each
j≤m
,
si j −zj
is a sample of the Gaussian distribution with variance
fi
, for any
T2>0, we have:
Pr(|
m
j=1
(sij −zj)/m|>T2)<exp{−C2T2
2m},
for constant C2>0, where C2is related to the minimum variance in F.
Because for each j≤m,E[(sij −zj)2]=fiand V ar [(sij −zj)2]=2f2
i, for any T3>0, we have:
Pr(|
m
j=1
(sij −zj)2/m−fi|>T3)<exp{−C3T2
3m},
for constant C3>0.
By union bound, with probability at least 1
−mexp{−C1T2
1n}+nexp{−C2T2
2m}+nexp{−C3T2
3m}
,
we have:
•For every j≤m,|y0
j−zj| ≤ T1.
•For every i≤n,|m
j=1(sij −zj)/m| ≤ T2.
•For every i≤n,|m
j=1(sij −zj)2/m−fi| ≤ T3
This means that
ˆ
fi=
m
j=1
(sij −y0
j)2/m
=
m
j=1
(sij −zj)2/m+T2
1+2T1
m
j=1
(sij +zj)/m
≤fi+T3+T2
1+2T1
m
j=1
(sij −zj)/m+4T1
m
j=1
zj/m
≤fi+T3+T2
1+2T1T2+4T1(
m
j=1
zj/m)
Similarly, we can prove that with probability at least 1
−
2
(mexp{−C1T2
1n}+nexp{−C2T2
2m}+
nexp{−C3T2
3m)}, for all i≤n, we have:
|ˆ
fi−fi| ≤ T3+T2
1+2T1T2+4T1(
m
j=1
zj/m),
21
Consistency follows after setting letting T1=ω(1)log m
n,T2=T3=ω(1)log n
m.□
Lemma 14. Suppose that
ˆ
fi∈ (
1
±δ)fi
for some
δ∈ [
0
,
1
]
. Then
wi
is in the range
[w∗
i(
1
−δ),w∗
i(
1
+
2δ)].
Proof.
wi=1
ˆ
fi
≥1
fi(1+δ)=1−δ
fi(1+δ)(1−δ)
=1−δ
fi(1−δ2)≥1−δ
fi
=(1−δ)w∗
i.
and
wi=1
ˆ
fi
≤1
fi(1−δ)=1+2δ
fi(1−δ)(1+2δ)
=1+2δ
fi(1+δ−2δ2)≤1+2δ
fi
=(1+2δ)w∗
i.
□
Theorem 5.
For any instance, such that Suppose that
∀i∈N,ˆ
fi∈ (
1
±δ)fi
for some
δ≤
0
.
25; it
holds that d(ˆ
z,z) ≤ d(x∗,z)+O(δ·maxi,j(sij )2).
Proof.
Note rst that by Lemma 14, each
wi
is in the range
[w∗
i(
1
−δ),w∗
i(
1
+
2
δ)]
. W.l.o.g.,
i∈Nw∗
i=1. Thus
(1−δ) ≤
i′
ˆ
wi′≤ (1+2δ),
and 1
i′ˆ
wi′
∈ (1−4δ,1+4δ)
as well.
Denote s=maxij si j . For each j≤m,
(ˆ
zj−zj)2=(1
i′ˆ
wi′
i∈N
ˆ
wisi j −zj)2
=(1
i′ˆ
wi′
i∈N
ˆ
w∗
i(1+τi)sij −zj)2(for some τi∈ [−2δ,2δ])
=(
i∈N
w∗
i(1+τ′
i)sij −zj)2(for some τ′
i∈ [−8δ,8δ])
=
i∈N
i′∈N
w∗
i(1+τ′
i)sij w∗
i′(1+τ′
i′)si′j+z2
j−2zj
i∈N
ˆ
w∗
i(1+τ′
i)sij
=
i∈N
i′∈N
w∗
isi jw∗
i′si′j(1+τ′
i+τ′
i′+τ′
iτ′
i′)+z2
j−2zj
i∈N
ˆ
w∗
i(1+τ′
i)sij
=
i∈N
i′∈N
w∗
isi jw∗
i′si′j(1+τ′′
ii ′)+z2
j−2zj
i∈N
w∗
i(1+τ′
i)sij
For |τ′′
ii ′|<25δ. Similarly,
(x∗−zj)2=
i∈N
i′∈N
w∗
isi jw∗
i′si′j+z2
j−2zj
i∈N
w∗
isi j .
22
Next,
d(ˆ
z,z) − d(x∗,z)=1
m
j≤m(ˆ
zj−zj)2− (x∗
j−zj)2
=1
m
j≤m
i∈N
i′∈N
w∗
isi jw∗
i′si′jτ′′
ii ′+2zj
i∈N
w∗
isi jτ′
i
≤25δ1
m
j≤m
i∈N
i′∈N
w∗
isi jw∗
i′si′j+16δ
i∈N
w∗
i(s)2
≤25δ1
m
j≤m
i∈N
i′∈N
w∗
iw∗
i′(s)2+16δ
i∈N
w∗
i(s)2
≤50δ(s)2,
as required. □
Theorem 6.
When
F
has bounded support and
z
is bounded, D-TD is consistent as
n→ ∞
,
m=
ω(log(n))
and
n=ω(logm)
. That is, for any
τ>
0,
Pr [d(ˆ
z,z)>τ] →
0for all
i∈N
, as
n→ ∞
,
m=ω(log(n)) and n=ω(logm).
Proof.
By Theorem 3, we know that as
m→ ∞
,
ˆ
fi
is close to
fi
(with small multiplicative error).
More specically, we showed that
|ˆ
fi−fi| ≤ T3+T2
1+2T1T2+4T1(
m
j=1
zj/m).
Equivalently,
1−G(T1,T2,T3) ≤
ˆ
fi
fi
≤1+G(T1,T2,T3),
where G(T1,T2,T3)=(T3+T2
1+2T1T2+4T1zmax )/fmi n .
Therefore, for any j≤mand any G(T1,T2,T3)<2
3, we have
n
i=1si j /ˆ
fi
n
i=11/ˆ
fi
≤n
i=1si j /fi
n
i=11/fi
×1+G(T1,T2,T3)
1−G(T1,T2,T3)≤n
i=1si j /fi
n
i=11/fi
(1+3G(T1,T2,T3))
and
n
i=1si j /ˆ
fi
n
i=11/ˆ
fi
≥n
i=1si j /fi
n
i=11/fi
×1−G(T1,T2,T3)
1+G(T1,T2,T3)≤n
i=1si j /fi
n
i=11/fi
(1−3G(T1,T2,T3))
It follows that
|n
i=1si j /ˆ
fi
n
i=11/ˆ
fi
−n
i=1si j /fi
n
i=11/fi
| ≤ 3G(T1,T2,T3)n
i=1si j /fi
n
i=11/fi
Therefore, for any
n
, if
n
i=1si j /fi
n
i=11/fi
is no more than
τ
away from the ground truth with probability
at least 1
−δ
, then
n
i=1si j /ˆ
fi
n
i=11/ˆ
fi
is no more than
(τ+
3
G(T1,T2,T3)(zmax +τ)
away from the ground
truth with probability at least 1− (δ+mexp{−C1T2
1n}+nexp{−C2T2
2m}+nexp{−C3T2
3m}).
The consistency of
n
i=1si j /ˆ
fi
n
i=11/ˆ
fi
follows after the fact that
n
i=1si j /fi
n
i=11/fi
is consistent and by letting
T1=ω(1)log m
n,T2=T3=ω(1)log n
m.□
23
Fig. 8. In this figure we can see the performance P-TD (which equals D-TD) on several distributions from the
INN model.
B MORE EMPIRICAL RESULTS FOR THE CONTINUOUS DOMAIN
B.1 Datasets
We used datasets from two sources. The Buldings dataset we collected via Amazon Turk, see
Appendix D.1 for details.
The Triangles dataset
10
is taken from a study of people’s geometric reasoning [
18
]. Participants
in the study were shown the base of a triangle (two vertices and their angles) and were asked to
position the third vertex. That is, their answer to each question is an x-coordinate and y-coordinate
of a vertex. In our analysis we treat each of the coordinates as a separate question.
In Fig. 8 we can see that P-TD works substantially better than unweighted mean on nearly all
datasets we tried.
C PROOFS FOR THE CATEGORICAL DOMAIN
C.1 Fault Estimation
Theorem 7.
Suppose that instance
I=⟨S,z⟩
is sampled from proto-population
F
via the IER model.
For every worker i,
E[πi(I)| fi]=µ+(1− (1+θ)µ)fi.
Proof. Note rst that for any two agents i,i′with xed fault levels fi,fi′, and any question j:
Pr [sij =si′j|fi,fi′]=Pr [si j =zj,si′j=zj|fi,fi′]+Pr [sij ,zj,si′j,zj,si j =si′j|fi,fi′]
=Pr [sij =zj|fi] · Pr [si′j=zj|fj]
+Pr [sij ,zj|fi] · Pr [si′j,zj|fi′] · Pr [si j =si′j|si j ,zj,si′j,zj]
=(1−fi)(1−fi′)+fifi′θ=1−fi−fi′+(1+θ)fifi′
10https://github.com/StatShapeGeometricReasoning/StatisticalShapeGeometricReasoning
24
Denote by di i′the random variable d(si,si′). Thus,
E[dii′|fi,fi′]=E[1
m
j≤m
sij ,si′j|fi,fi′]=1
m
j≤m
Pr [sij ,si′j|fi,fi′]
=1
m
j≤m
(1−Pr [sij =si′j|fi,fi′]) =1
m
j≤m
(fi+fi′− (1+θ)fifi′)=fi+fi′− (1+θ)fifi′
For a xed population (i.e. taking expectation only over realized answers):
E[πi|f]=E[1
n−1
i′,i
dii ′|f]=1
n−1
i′,i
E[dii′|fi,fi′]=1
n−1
i′,i
(fi+fi′− (1+θ)fifi′)
=fi+1
n−1
i′,i
fi′− (1+θ)fi
1
n−1
i′,i
fi′=fi+(1− (1+θ)fi)1
n−1
i′,i
fi′(8)
Finally, taking expectation over sampled populations,
E[πi|fi]=Ef−i∼Fn−1[E[πi|f]| fi]=Ef−i∼Fn−1fi+(1− (1+θ)fi)1
n−1
i′,i
fi′|fi
=fi+(1− (1+θ)fi)E[1
n−1
i′,i
fi′]=fi+(1− (1+θ)fi)µ=µ+(1− (1+θ)µ)fi,
as required. □
Theorem 8.
Suppose the support of
F
is a closed subset of
(
0
,
1
)
and
µF<k−1
k
. Then
1
n
-P-EFL is
consistent. That is, for any δ>0,Pr [| ˆ
fi−fi|>δ] → 0for all i∈N, as n→ ∞ and m=ω(log n).
Proof.
The proof proceeds by bounding the probability for the following four events as
n→ ∞
and
m=ω(n)
. We will show that by choosing the parameters in the following three events
appropriately, the output of P-EFL converge to the ground truth fault levels f.
Event A: rP(s)=z. That is, the unweighted plurality rule reveals the ground truth.
Event B: Given T1>0,ˆ
µin Algorithm 5 is no more than T1away from µF.
Event C:
Given
T2>
0, for all
i≤n
and
i′≤n
with
i,i′
,
dii ′
is no more than
T2
away from
fi+fi′−k
k−1fifi′.
We rst show that conditioned on events A, B, C, and D holding simultaneously, the output of
P-EFL is close to
f
. Because event
C
holds, for every
i≤n
,
πi
is no more than
T2
away from
fi+i′,ifi′
n−1−k
k−1fii′,ifi′
n−1
, which is no more than
(
1
+k
k−1)(T2+1
n−1)
away from
fi+µF−k
k−1fiµF
because event Bholds. Therefore, as T1→0and T2→0, we have that for every i≤n,ˆ
fi→fi.
We now show that as
n→ ∞
and
m=ω(n)
, we can choose
T1(n)
and
T2(n)
as functions of
n
such that (1)
T1(n)
and
T2(n)
converge to 0, and (2) the probability for events A, B, and C hold
simultaneously goes to 1.
First, as
n→ ∞
, the probability of event A goes to 1. For any
j≤m
, the expected score
dierence between the ground truth and any given dierent alternative is
1
0(
1
−f) − f
k−1F (f)d f =
1
−k
k−1µF>
0, where the last inequality follows after the assumption that
µF<k−1
k
. Due to the
union bound, as
n→ ∞
, the probability for the plurality rule to reveal the ground truth of all
m
questions goes to 1.
Second, given any ground truth zand any f,n
i=1dH(si,z)/ncan be seen as the average of mn
independent Bernoulli trials: for each
i≤n
, there are
m
i.i.d. Bernoulli trials whose probability of
25
taking 0is fi. It follows from Hoeding’s inequality that for any T1>0, we have:
Pr(| n
i=1dH(si,z)
n−n
i=1fi
n|>T1)<2e−T2
1mn/2(9)
By applying Hoeding’s inequality again to n
i=1fi
n, we have:
Pr(| n
i=1fi
n−µF|>T1/2)<2e−T2
1n/2(10)
We note that the upper bounds go to 0as
n→ ∞
. By the union bound, as
n→ ∞
, with probability
1event A holds and (9) and (10) do not hold, which imply that event Bhold with probability 1.
Third, given any
i,i′
,
dii ′
can be seen as the average of
m
i.i.d. random variables, each of which
takes 1with probability 1
− (
1
−fi)(
1
−fi′) − fifi′/(k−
1
)
and takes 0otherwise. Therefore, by
Hoeding’s inequality, we have:
Pr(|di i′− ( fi+fi′−k
k−1fifi′)| >T2) ≤ 2e−2T2
2m(11)
By the union bound, the probability for (11) to hold for any pairs of
i,i′
is no more than 2
n2e−2T2
2m
.
Therefore, as n→ ∞ and m=ω(log n), the probability for event C to hold goes to 1.
Finally, it follows that for any
δ>
0, we can choose suciently small
T1
and
T2
so that the
estimate of the fault level of any agent by P-EFL is no more than
δ
away from the true level
fi
when events A, B, and C hold simultaneously, which happens with probability that goes to 1as
n→ ∞ and m=ω(log n). This proves the theorem. □
C.2 Aggregation
ALGORITHM 8: Aggregation skeleton (categorical)
Output: Estimated true answers ˆ
z
(ˆ
fi)i∈N←Estimate F aultLevels (S);
∀i∈N, set wi←log (1−ˆ
fi)(k−1)
ˆ
fi;
Set ˆ
z←rP(S,w);
Theorem 15 (Consistency of plurality with Grofman Weights). Plurality with Grofman
weights is consistent for IER with observed fault levels, if and only if F ( k−1
k),1.
Proof.
W.l.o.g. let
m=
1and
z=
0. Let
W∗
(respectively,
W
) denote the random variable for
weighted vote to alternative 0(respectively, alternative 1) under IER with observed fault levels.
In other words, for any 0
<f<
1, conditioned on the fault level being
f
(which happens with
probability
F (f)
), we have
W∗=log (1−f)(k−1)
f
with probability 1
−f
, and
W∗=
0otherwise. For
any 0
<f<
1, conditioned on the fault level being
f
, we have
W=log (1−f)(k−1)
f
with probability
f
k−1, and W=0otherwise. Therefore,
E(W∗−W)=(1−k
k−1f)log (1−f)(k−1)
fdF (f)(12)
We note that 1
−k
k−1f
and
log (1−f)(k−1)
f
always have the same sign (positive, negative, or 0), and
both of them are 0if and only if
f=k−1
k
. Therefore, when
F ( k−1
k)<
1, we have (12)>0. Let
¯
W∗=n
i=1W∗/n
and let
¯
W=n
i=1W∗/n
. By the Law of Large Numbers, we have
limn→∞ Pr(¯
W∗−
¯
W>
0
)=
1. This means that as
n
increases, the total weighted votes of 0is strictly larger than the
26
total weighted votes of 1with probability close to 1. It follows from the union bound that as
n→ ∞
,
the probability for 0to be the winner goes to 1, which proves the consistency of the algorithm.
When
F ( k−1
k)=
1, with probability 1the votes are uniformly distributed with the same weights
regardless of the ground truth. Therefore the algorithm is not consistent. □
Theorem 10.
Suppose the support of
F
is a closed subset of
(
0
,
1
)
and
µF<k−1
k
. Then
1
n
-P-TD is
consistent. That is, for any τ>0,Pr [d(ˆ
z,z)>τ] → 0as n→ ∞ and m=ω(log n).
Proof.
Because the support of
F
is a closed subset of
(
0
,
1
)
, the estimated Grofman weight is
bounded. By Theorem 8, for any
τ>
0and
δ>
0, there exists
n∗
such that for all
n>n∗
and
m=ω(log n),
Pr(| ˆ
wi−w∗
i|<τfor all i≤n)>1−δ,
where
ˆ
wi
and
w∗
i
are the estimated Grofman weight for agent
i
obtained by using the output of
P-EFL the true Grofman weight, respectively. Let
ˆ
w=(ˆ
w1, . . . , ˆ
wn)
and
w∗=(w∗
1, . . . , w∗
n)
. For any
alternative l∈A, let Score(l,ˆ
w)denote the weighted score of alternative lwith weights ˆ
w.
This means that with probability at least 1−δ, for each l∈A, we have
|Score(l,ˆ
w)
n−Score(l,w∗)
n|<τ
The consistency of PTD follows after the fact that rP(·,w∗)is consistent and the gap between the
average score of the winner and the average score of any other alternative is bounded below by
the constant in (12) (which is strictly positive because
µF<k−1
k
) with probability that goes to 1as
n→ ∞.□
D MORE EMPIRICAL RESULTS FOR THE CATEGORIAL DOMAIN
D.1 Datasets
In the datasets we collected, participants were given short instructions, then they had to answer
m=
25 questions. We recruited participants through Amazon Mechanical Turk. We restricted
participation to workers that had at least 50 assignments approved. We planted in each survey a
simple question that can be easily answered by a anyone who understand the instructions of the
experiment (known as Gold Standards questions). Participants who answered correctly the gold
standard question received a payment of $0
.
3. Participants did not receive bonuses for accuracy.
The study protocol was approved by the Institutional Review Board at the Technion.
BinaryDots Subjects were asked which picture had more dots (binary questions).
Buildings
Subjects were asked to mark the height of the building in the picture on a slide bar
(continuous answers).
In addition, we used four datasets from [
34
], in those experiments workers had to identify an
object in pictures. Workers got paid by the Double or Nothing payment scheme described in their
paper.
GoldenGate
subjects were asked whether the golden gate bridge in the picture (binary ques-
tions). 35 workers, 21 questions.
Dogs
subjects were asked to identify the breed of the dog in the picture (10 possible answers).
31 workers, 85 questions. We omitted 4 workers due to missing data in their reports.
HeadsOfCountries
subjects were asked to identify heads of countries (4 possible answers).
32 workers, 20 questions. We omitted 3 workers due to missing data in their report. On this
datasets almost all workers got perfect results, and thus we did not use it in our simulations.
27