ArticlePDF Available

Abstract and Figures

As a new distributed computing model, crowdsourcing lets people leverage the crowd's intelligence and wisdom toward solving problems. This article proposes a framework for characterizing various dimensions of quality control in crowdsourcing systems, a critical issue. The authors briefly review existing quality-control approaches, identify open issues, and look to future research directions. In the Web extra, the authors discuss both design-time and runtime approaches in more detail.
Content may be subject to copyright.
Web-Scale Workflow
Editor: Schahram Dustdar • dustdar@dsg.tuwien.ac.at
76 Publis hed by the I EEE Co mputer Societ y 1089-78 01/13/$31.00 © 2013 IEE E IEEE I NTERNET COM PUTING
C
rowdsourcing has emerged as an effec-
tive way to perform tasks that are easy for
humans but remain difficult for comput-
ers.1,2 For instance, Amazon Mechanical Turk
(MTurk; www.mturk.com) provides on-demand
access to task forces for micro-tasks such as
image recognition and language translation.
Several organizations, including DARPA and
various world health and relief agencies, are
using platforms such as MTurk, CrowdFlower
(http://crowdower.com), and Ushahidi (http://
ushahidi.com) to crowdsource work through
multiple chan nels, including S MS, email , Twitte r,
and the World Wide Web. As Internet and mobile
technologies continue to advance, crowdsourcing
can help organizations increase productivit y,
leverage an external (skilled) workforce in addi-
tion to a core workforce, reduce training costs,
and improve core and support processes for both
public and pr ivate sectors.
On the other hand, the people who contrib-
ute to crowdsourcing might have different lev-
els of skills and exper tise that are sometimes
insufcient for doing certain tasks.3 They might
also have var ious and even biased interests and
incentives.1,4 Indeed, in recent years, crowd-
sourcing systems have been widely subject to
malicious activities such as collusive cam-
paigns to support people or products, and fake
reviews posted to online markets.5 Addit ionally,
ill-dened crowdsourcing tasks that don’t pro-
vide workers with enough information about
the tasks and their requirements can also lead
to low-quality contributions from the crowd.6
Addressing these issues requires fundamentally
understanding the factors that impact quality as
well as quality-control approaches being used in
crowdsourcing systems.
Categorizing Quality Control
To crowdsource a task, its owner, also called the
requester, submits the task to a crowdsourcing plat-
form. People who can accomplish the task, called
workers, can choose to work on it and devise solu-
tions. Workers then submit these contributions to
the requester via the crowdsourcing platform.
Quality Control in
Crowdsourcing Systems
Issues and Directions
Mohammad Allahbakhsh, Boualem Benatallah,
and Aleksandar Ignjatovic • University of New South Wales
Hamid Reza Motahari-Nezhad • Hewlett-Packard Laboratories
Elisa Bertino • Purdue University
Schahram Dustdar • Vienna University of Technology
As a new distributed computing model, crowdsourcing lets people leverage
the crowd’s intelligence and wisdom toward solving problems. This article pro-
poses a framework for characterizing various dimensions of quality control
in crowdsourcing systems, a critical issue. The authors briey review existing
quality-control approaches, identify open issues, and look to future research
directions.
IC-17-02-WSWF.indd 76 3/6/13 3:46 PM
Quality Control in Crowdsourcing Systems
MARCH/APRIL 2013 7 7
The requester assesses the posted con-
tributions’ qualit y and might reward
those workers whose contributions
have been accepted. This reward can
be monetary, material, psychological,
and so on.7 A task’s outcome can be
one or more individual contr ibutions
or a combination of accepted ones.
The requester should choose contr i-
butions that reach his or her accepted
level of qualit y for the outcome.
Quality is a subjective issue in
general. Some efforts have proposed
models and metrics to quantita-
tively and objectively assess quality
along different dimensions of a soft-
ware system, such as reliability, accu-
racy, relevancy, completeness, and
consistency.8 In this survey, we
adopt Crosby’s denition of quality
as a guide to identify quality-control
attributes, including dimensions and
factors.9 This denition emphasizes
“conformance to requirements” as a
guiding principle to dene qualit y-
control models. In other words, we
define the quality of outcomes of a
crowdsourced task as
“the extent to which the prov ided out-
come fulfills the requirements of the
requester.”
The overall outcome quality
depends on the denition of the task
that’s being crowdsourced and the
contributing workers’ attributes.1,2
We characterize qualit y in crowd-
sourcing systems along two main
dimensions: worker proles and task
design. We propose a taxonomy for
quality in crowdsourcing systems,
as Figure 1 illustrates.
Worker Proles
The qualit y of a crowdsourced task’s
outcome can be affected by workers’
abilities and quality.2 As Figure 1a
shows, a worker’s quality is charac-
terized by his or her reputation and
expertise. Note that these attributes
are correlated: a worker with high
expertise is expected to have a high
reputation as well. We distinguish
them because reputation is more gen-
eral in nature. In addition to workers’
expertise (which is reected in the
quality of their contributions), we
might compute reputation based on
several other parameters, such as
the worker’s timeliness or the qual-
ity of evaluators. Also, reputation is
a public, community-wide metr ic,
but expertise is task-dependent. For
example, a Java expert with a high
reputation score might not be quali-
ed to undertake a SQL-related task.
Reputation. The tr ust relationship
between a requester and a par ticular
worker reects the probability that the
requester expects to receive a quality
contribution from the worker. At the
community level, because members
might have no experience or direct
interactions with other members,
they can rely on reputation to indi-
cate the community-wide judgment
on a given worker’s capabilities.10
Reputation scores are mainly built
on community members’ feedback
about workers’ activ ities in the sys-
tem.11 Sometimes, this feedback is
explicit — that is, community mem-
bers explicitly cast feedback on a
worker’s quality or contributions by,
for instance, rating or ranking the
content the worker has created. In
other cases, feedback is cast implicitly,
as in Wikipedia, when subsequent edi-
tors preserve the changes a particular
worker has made.
Expertise. A worker’s exper tise dem-
onstrates how capable he or she
is at doing particular tasks.4 Two
types of indicators point to worker
Figure 1. Taxonomy of quality in crowdsourcing systems. We characterize quality along two main dimensions: (a) worker
proles and (b) task design.
Compensation policyGranularityUser interface
Task design
Quality in crowdsourcing systems
ExpertiseReputation
(a) (b)
Worker’s prole
Denition
IC-17-02-WSWF.indd 77 3/6/13 3:46 PM
Web-Scale Workflow
78 www.computer.org/internet/ IEEE INTERNET COMPUTING
expertise: credentials and experi-
ence. Credentials are documents or
evidence from which the requesters
or crowdsourcing platform can
assess a worker’s capabilities as regards
a particular crowdsourced task. Infor-
mation such as academic certicates
or degrees, spoken languages, or
geographical regions that a worker
is familiar with can be credentials.
Experience refers to knowledge and
skills a worker has gained while work-
ing in the system as well as through
support and training. For instance,
in systems such as MTurk and Stack
Overow, workers can improve their
skills and capabilities over time with
shepherding and support.12
Task Design
Task design is the model under which
the requester descr ibes his or her
task; it consists of several compo-
nents. When the requester designates
a task, he or she provides some infor-
mation for workers. The requester
might put a few criteria in place to
ensure that only eligible people can
do the task, or specif y the evalua-
tion and compensation policies. We
identify four impor tant factors that
contribute to quality as regards this
dimension (see Figure 1b): task de-
nition, user interface, granularity,
and compensation policy.
Task denition. The task denition is
the information the requester gives
potential workers regarding the
crowdsourced task. A main element
is a short descr iption of the task
explaining its nature, time limita-
tions, and so on.6 A second element
is the qualication requirements for
performing the task. These spec-
ify the eligibility criteria by which
the requester will evaluate workers
before accepting their participation.
For example, in MTurk, requesters
can specif y that only workers with
a specied percentage of accepted
works (for example, larger than 90
percent) can participate, or that only
those workers living in the US can take
part in a particular survey. Previous
studies show that the qualit y of the
provided denition (such as its clar-
ity or the instructions’ usef ulness)
for a task affects outcome qualit y.6
User interface. The task UI refers
to the interface through which the
workers access and contribute to
the task. This can be a Web UI, an
API, or any other kind of UI. A user-
friendly interface can attract more
workers and increase the chance of
a high-quality outcome. A simple
interface, such as one with nonveri-
able questions, makes it easier for
deceptive workers to exploit the sys-
tem.1 On the other hand, an unnec-
essarily complicated interface will
discourage honest workers and could
lead to delays.
Granularity. We can divide tasks into
two broad t ypes: simple and complex.
Simple tasks are the self-contained,
appropriately short tasks that usu-
ally need little expertise to be
solved, such as tagging or describ-
ing.13 Complex tasks usually need
to be broken down into simpler sub-
tasks. Solving a complex task (such
as writing an article) might require
more time, costs, and expertise, so
fewer people will be interested or
qualied to perform it. Crowds solve
the subtasks, and their contributions
are consolidated to build the nal
answer.13 A complex task workow
denes how these simple subtasks are
chained together to build the overall
task.14 This workow can be itera-
tive, parallel, or a combination.14,15
Designing workf lows for com-
plex tasks greatly affects outcome
qu a l it y.1,2 ,5 For instance, one study
demonstrated that designing a poor
outline for an essay that the crowd
will write can result in a low-qualit y
essay.13 Improving the quality of an
outline using crowd contributions
increases the corresponding written
essay’s quality.
Incentives and compensation policy.
Choosing suitable incentives and
a compensation policy can affect
the crowd’s performance as well as
outcome quality.7,12 Knowing about
evaluation and compensation poli-
cies helps workers align their work
based on these criteria and produce
contributions with higher quality.12
We broadly categorize incentives into
two types: intrinsic incentives, such
as personal enthusiasm or altruism,
and extrinsic incentives, such as
monetar y reward. Intrinsic incen-
tives in conjunction with extrinsic
ones can motivate honest users
to participate in the task. Moreover,
in some cases, the intrinsic incen-
tivespositive effect on the out-
come ’s quality is more signicant
than the impact of the extrinsic
incentives.16
Looking at monetary rewards,
which are common, the reward
amount attracts more workers and
affects how fast they accomplish the
task, but increasing the amount
doesn’t necessar ily increase outcome
qu a l it y.16 Some research also shows
that the payment method might have
a bigger impact on outcome quality
than the payment amount itself.6,16
For example, in a requested task
that requires nding 10 words in a
puzzle, paying per puzzle will lead
to more solved puzzles than paying
per word.13
Quality-Control Approaches
Researchers and practitioners have
proposed several quality-control
approaches that fall under the afore-
mentioned quality dimensions and
factors. We broadly classify exist-
ing approaches into two categories:
design-time (see Table 1) and run-
time (see Table 2). These two cat-
egories aren’t mutually exclusive. A
task can employ both approaches to
maximize the possibility of receiv-
ing high-quality outcomes.
At design time, the requesters can
leverage techniques for preparing a
IC-17-02-WSWF.indd 78 3/6/13 3:46 PM
Quality Control in Crowdsourcing Systems
MARCH/APRIL 2013 79
well-designed task and just allow a
suitable crowd to contribute to the task.
Although these techniques increase
the possibility of receiving high-
quality contributions from the crowd,
there is still a need to control the
quality of contributions at runtime.
Even high-quality workers might
submit low-quality contributions
because of mistakes or misunder-
standing. Therefore, requesters must
still put in place runtime quality-
control approaches when the task is
running as well as when the crowd
contributions are being collected
and probably aggregated to build the
final task answer. We discuss both
design-time and runtime approaches
in more detail in the Web appendix
at http://doi.ieeecomputersociety.org/
10.1109/MIC.2013.20.
A
lthough researchers have pro-
posed and used several quality-
control approaches so far, many
open issues and challenges remain
for dening, measuring, and man-
aging quality in crowdsourcing sys-
tems, and these issues require further
research and investigation.
One serious limitation of exist-
ing approaches is their reliance
on primitives and hard-wired
quality-control techniques. These
approaches are typically embedded
in their host systems, and requesters
can’t customize them based on their
specic requirements. Dening new
approaches is another challenge that
requesters struggle with. Although
some tools — such as TurKit — that
rely on current crowdsourcing sys-
tems let users dene some quality-
control processes, using these tools
requires programming skills such as
Java or C++.
Endowing crowdsourcing ser-
vices with customizable, rich, and
robust quality-control techniques
is key to crowdsourcing platforms’
Table 1. Existing quality-control design-time approaches.
Quality-control approach Subcategories Description Sample application
Effective task preparation Defensive
design
Provides an unambiguous description of the task;
task design is defensive — that is, cheating isn’t
easier than doing the task; denes evaluation and
compensation criteria
Refere nces 1, 3,6,12
Worker selection Open to all Allows everybody to contribute to the task ESP Game, Thredless.com
Reputation-
based
Lets only workers with prespecied reputation
levels contribute to the task
MTurk, Stack Overow, 4
Credential-
based
Allows only workers with prespecied credentials
to do the task
Wikipedia, Stack Overow, 4
Table 2. Existing quality-control runtime approaches.
Quality-control approach Description Sample application
Exper t review Domain experts check contribution quality. Academic conferences
and journals, Wikipedia, 3
Output agreement If workers independently and simultaneously provide the same
description for an input, they are deemed correct.
ESP Game
Input agreement Independent workers receive an input and describe it to each other.
If they all decided that it’s a same input, it’s accepted as a quality answer.
Tag-A-Tune
Ground truth Compares answers with a gold standard, such as known answers
or common sense facts to check the quality.
CrowdFlower, MTurk
Majority consensus The judgment of a majority of reviewers on the contribution’s quality
is accepted as its real quality.
TurKit, Threadless.com,
MTu r k
Contributor evaluation Assesses a contribution based on the contributor’s quality. Wikipedia, Stack
Ove r o w, MTurk
Real-time support Provides shepherding and support to workers in real time to help
them increase contribution quality.
Refe rence 12
Workow management Designs a suitable workow for a complex task; workow is monitored
to control quality, cost, and so on, on the y.
References 13,14
IC-17-02-WSWF.indd 79 3/6/13 3:46 PM
Web-Scale Workflow
80 www.computer.org/internet/ IEEE INTERNET COMPUTING
wide ranging success — whether it’s
supporting micro and commod-
ity tasks or high-value processes
(such as business processes or intel-
ligence data gathering). Requesters
can achieve this functionality using
a generic quality-control framework
that lets them dene new quality-
control approaches and reuse or
customize existing ones. Such a
framework should also be capable
of being seamlessly integrated
wit h e x i st ing crowdsourcing plat-
forms to let requesters benet from
both crowdsourcing and qualit y-
control systems simultaneously.
Building such a framework can
be an interesting future direction
for research in the crowdsourcing
arena.
Another major limitation of
existing quality-control approaches
comes from the subjective nature
of quality, particularly in crowd-
sourcing systems. The quality of a
task’s outcome might depend on sev-
eral parameters, such as requesters’
requirements, task properties, crowd
interests and incentives, and costs.
Currently, quality-control techniques
are domain-specic — that is, a tech-
nique that performs well for some
tasks might perform poorly on new
and different ones. For instance,
approaches that are suitable for check-
ing a written essay’s quality are dif-
ferent from those to control quality in
an image-processing task. Finding a
suitable approach based on a particu-
lar situation is a challenge that needs
more investigation.
One solution to this limitation is
a recommender system, which gives
requesters a list of adequate quality-
control approaches. Such a recom-
mender could use machine learning
techniques to provide more precise
recommendations. It should offer
the requester a list of approaches
that best suit the situation based
on the requester’s profile (social
relations, history, interests, and so
on), the task’s ty pe and attributes,
the history of the existing crowd,
and the quality requirements of
the task, along with many more
opt ions. Design ing such a system
can be a suitable direction for further
st udy.
Thanks to Web 2.0 technologies
and the rapid growth of mobile com-
puting in the form of smartphones,
tablets, and so on, a tremendous
amount of human computation power
is available for accomplishing jobs
almost for free. On the other hand,
articial intelligence and machine
learning are fast-growing areas in
computer science. We envision that,
in the near future, combining the
crowd and machines to solve prob-
lems will be easily feasible.17 Th is
will raise some interesting research
challenges. Topics such as machine
versus human trustworthiness,
workow design for such tasks, and
conict resolution between human
and machine judgments will all need
to be addressed.
Moreover, people are at the core
of crowdsourcing systems. How-
ever, they’re also distributed among
separated online communities, and a
requester can’t easily employ crowds
from several communities. We fore-
see that this will be simplied in the
near future via service composition
middleware. Building such middle-
ware will require addressing several
issues, including how to share people-
quality indicators such as reputation
and expertise between different com-
munities and how to build a public
global picture for each individual
based on his or her available his-
tory of activities in different possi-
ble crowd communities. Addressing
these issues will be another interest-
ing research direction for crowd-
sourcing systems.
References
1. A. Kitt ur, E. Chi , and B. Su h, “Crowd-
sourcing User Studies with Mechani-
cal Turk,” Proc. 26th Ann. SIGCHI Conf.
Human Factors in Computing Systems,
ACM, 2008, pp. 453–456.
2. R. Khazankin, S. Daniel, and S. Dustdar,
“Predicting QoS in Sc hedu led Cr owd-
sourcing,Advanced Infor mation Sys-
tems Eng., vol. 7328, J. Ralyté et al., eds.,
Springer, 2012, pp. 460–472.
3. A.J. Qui nn and B.B. Bederson, “Human
Computat ion: A Sur vey and Taxonomy of
a Growing Field,” Proc. 2011 Ann. Conf.
Human Factors in Computing Systems,
ACM, 2011, pp. 1403–1412.
4. D. Schall , F. Skopik, and S. Dustdar,
“Exper t Discover y and Inter actions in
Mixed Ser vice-Oriented Systems,” IEEE
Trans. Services Computin g, vol. 5, no. 2,
2012, pp. 233–245.
5. G. Wang et al., “Serf and Turf: Crowd-
turfing for Fun and Profit,” Pr o c. 21st
Int’l Conf. World Wide Web, AC M, 2012,
pp. 679–688.
6. J.J. Chen, N. Menezes, and A. Bradley,
“Opportunities for Crowdsourcing Research
on Ama zon Mechanical Turk,” Pro c.
CHI 2011 Workshop Crowdsourcing
and Human Computat ion, 2011; htt p://
crowdresearc h.org/chi2011-workshop/
papers/chen-jenny.pdf.
7. O. Scekic, H. Tr uong, and S. Dust-
da r, “Mode ling Rewards and Ince ntive
Mechani sms for Socia l BPM,” Busi-
ness Process Management, vol. 7481,
A. Barros et al., eds., Springer, 2012,
pp. 15 0–155.
8. E. Agichtein et a l., “Finding High-Quality
Content in Social Med ia,” Proc. Int’l Conf.
Web Search and Web Data Mining, ACM,
2008, pp. 183–194.
9. P. Crosby, Quality is Free, McGraw-Hill ,
1979.
10. A. Jøsang, R. I smail, and C. Boyd , “A
Sur vey of Tr ust and Reputation Sy stem s
for Onl ine Se rv ice Provision,” Decision
Suppor t Systems, vol. 43, no. 2 , 2007,
pp. 618– 6 44 .
11. L.D. Al faro et al., “Reputation Systems
for Open Collaboration ,” Comm. AC M,
vol. 54, no. 8, 2011, pp. 81–87.
12. S.P. Dow et al., “Shepherd ing t he
Crowd Yields Better Work,” Proc. 2012
ACM Conf. Computer Suppor ted Coop-
erative Work (CSCW 12), ACM, 2012,
pp. 1013–1022 .
IC-17-02-WSWF.indd 80 3/6/13 3:46 PM
Quality Control in Crowdsourcing Systems
13. A. Kit tu r et al., “CrowdForge: Crowd-
sourcing Complex Work ,” Proc. 24th Ann.
ACM Symp. User Interface Software and
Technology, ACM, 2011, pp. 43–52.
14. A . Kul karni , M. Ca n, and B. Har tmann ,
“Collaboratively Crowdsourci ng Work-
flows wit h Turkomat ic,” P r o c. 2 012
ACM Conf. Computer Suppor ted Coop-
erative Work (CSCW 12), ACM, 2012,
pp. 1003 –1012.
15. G. Litt le et al. , “TurK it: Huma n Computat ion
Algor ith ms on Mechanica l Turk,” Proc.
23nd Ann. ACM Symp. User Interfa ce
Software and Technology, ACM, 2010,
pp. 57–66.
16. W. Mason and D.J. Watts, “Financial
Incentives and t he ‘Per formance of
Crowds,’” SIGKDD Explorations News-
letter, vol. 11, 2010, pp. 100–108.
17. H.L. Tr uong, S. Dustdar, a nd K. Bhat-
tacharya, “P rogr amming Hybr id Ser vices
in the Cloud,” Service-Oriented Comput-
ing, vol. 7636, C. Liu et al., eds., Springer,
2012, pp. 96–110.
Mohammad Allahbakhsh is a PhD cand idate in
the Sc hool of Computer Science and Engi-
neer ing at t he University of New Sout h
Wales, Aust ralia. H is research focuses
on quality control in crowdsou rci ng sys-
tems. Allahbakhsh ha s an MS i n software
engi neer ing from Ferdowsi Unive rsity of
Mash had. He’s a student member of IEEE.
Contact h im at mallahba khsh@cse.uns w
.edu.au.
Boualem Benatallah is a profe ssor of com-
puter science at the Univer sit y of New
South Wales, Austral ia. His resear ch
interest s include system a nd data i nte-
grat ion, process modeling, a nd ser vice-
oriented ar chitect ures. Be natallah has a
PhD in computer science from Grenoble
University, Fra nce. He’s a member of
IEEE. Contact him at boua lem@cse.un sw
.edu.au.
Aleksandar Ignjatovic is a senior lec turer
in the School of Computer Science and
Engi neer ing at the Univer sit y of New
South Wales, Austral ia. His cu rrent
research interests include applications
of mathematical logic to computationa l
complex ity theory, sampling theory, and
onli ne communities. Ignjatovic has a PhD
in mathemat ical logic from the Univer-
sity of California, Berkeley. Contact him
at ignjat@cse.unsw.edu.au.
Hamid Reza Motahari-Nezhad is a research
scientist at Hewlett-Packard Laborato-
ries in Palo A lto, Ca lifornia. His research
interests include business process man-
agement, social computing, and ser vice-
oriented computing. Motahar i-Nez had has
a PhD in computer science and engineer-
ing from the Unive rsity of New South
Wales, Aust ralia. He’s a member of the
IEEE Compute r Society. Contact hi m at
hamid.motahari@hp.com.
Elisa Bertino is a profe ssor of computer sci-
ence at Purdue University and ser ves
as research di rec tor for the Center for
Education and Resear ch in I nfor mation
Assura nce and Security (CERIAS). Her
main research interest s include secur ity,
privacy, digital identity management
systems, database systems, distrib-
uted sy stem s, and multimed ia systems.
She’s a fel low of IE EE and ACM and has
been named a Golden Core Member for
her serv ice to the IEEE Computer Soci-
ety. Contact her at bertino@cs.purdue
.edu.
Schahram Dustdar is a full professor of com-
puter science and head of t he Distributed
Syste ms Group, In stitute of Informa-
tion Systems, at the Vienna University
of Technolog y. His r esea rch inter ests
include service-or iented arch itectures
and computing, cloud a nd elastic com-
puting, complex and adaptive systems,
and context-aware computing. Du stda r
is an ACM Distinguished Sc ientist (20 09)
and IBM Faculty Awa rd recipient (2012).
Contact h im at dustdar@dsg.tuw ien
.ac.at.
Selected CS articles and columns
are also available for free at http://
ComputingNow.computer.org.
IC-17-02-WSWF.indd 81 3/6/13 3:46 PM
... Specifically, making the task socially meaningful rather than for profit (Rogstadius et al., 2011;Cappa et al., 2019), or making it seem like a game so that workers can enjoy while working (Hong et al., 2013;Goncalves et al., 2014;Feng et al., 2018;Morschheuser et al., 2019;Uhlgren et al., 2024) are some examples. Other possible approaches include eliminating the participation of workers who do not satisfy the necessary conditions before performing the task (Matsubara and Wang, 2014;Allahbakhsh et al., 2013), and eliminating those who complete the task in an extremely short time by measuring the time taken to perform the task (Cosley et al., 2005). Furthermore, it is possible to set effective rewards that lead to incentives for the tasks performed by workers (Watts and Mason, 2009;Ho et al., 2015;Yin et al., 2013;Feng et al., 2018;Cappa et al., 2019), provide evaluation feedback on the deliverables (Feng et al., 2018), and inform workers about the evaluation criteria to encourage them to take their work more seriously (Dow et al., 2012). ...
... In a crowdsourcing experiment, workers who are demotivated or who have worked on a task in a random manner can be eliminated by a simple rule (Matsubara and Wang, 2014;Allahbakhsh et al., 2013;Cosley et al., 2005), and this experiment was conducted with these eliminated workers as well. The participants were asked about their birth year before starting the task and their zodiac sign (sexagenary cycle traditionally used in East Asia, represented by the name of an animal in Japan) after completing the task. ...
Article
Full-text available
Introduction To create training data for AI systems, it is necessary to manually assign correct labels to a large number of objects; this task is often performed by crowdsourcing. This task is usually divided into a certain number of smaller and more manageable segments, and workers work on them one after the other. In this study, assuming the above task, we investigated whether the deliverable evaluation feedback and provision of additional rewards contribute to the improvement of workers' motivation, that is, the persistence of the tasks and performance. Method We conducted a user experiment on a real crowdsourcing service platform. This provided first and second round of tasks, which ask workers input correct labels to a flower species. We developed an experimental system that assessed the work products of the first-round task performed by a worker and presented the results to the worker. Six hundred forty-five workers participated in this experiment. They were divided into high and low performing groups according to their first-round scores (correct answer ratio). The workers' performance and task continuation ratio under the high and low performance group and with and without evaluation feedback and additional rewards were compared. Results We found that the presentation of deliverable evaluations increased the task continuation rate of high-quality workers, but did not contribute to an increase in the task performance (correct answer rate) for either type of worker. The providing additional rewards reduced workers' task continuation rate, and the amount of reduction was larger for low-quality workers than that for high-quality workers. However, it largely increased the low-quality worker's task performance. Although not statistically significant, the low-quality worker's task performance of the second round was highest for those who were shown both feedback and additional rewards. Discussion It was found that rewards positively affected worker motivation in previous studies. This is inconsistent with the results of our study. One possible reason is that previous studies have examined workers' future engagements on different tasks, whereas our study examined workers' successive tackles on the almost same task. In conclusion, it is better to offer both feedback and additional rewards when the quality of the deliverables is a priority, and to give only feedback when the quantity of deliverables is a priority.
... Prior applications have shown that even non-expert workers can provide meaningful feedback to domain experts for complex tasks [41,70]. However, there are a number of factors related to both the worker and the task itself [2]; to a reasonable extent these can be addressed through the design of applications using human computation. ...
... A key consideration for leveraging human computation is to design an environment and method to manage crowd workers and tasks [2]. While the majority of human computation approaches primarily leverage crowd-working platforms (such as Prolific or Amazon Mechanical Turk) to do so, other avenues exist. ...
Preprint
Full-text available
Traditionally, linters are code analysis tools that help developers by flagging potential issues from syntax and logic errors to enforcing syntactical and stylistic conventions. Recently, linting has been taken as an interface metaphor, allowing it to be extended to more complex inputs, such as visualizations, which demand a broader perspective and alternative approach to evaluation. We explore a further extended consideration of linting inputs, and modes of evaluation, across the puritanical, neutral, and rebellious dimensions. We specifically investigate the potential for leveraging human computation in linting operations through Community Notes -- crowd-sourced contextual text snippets aimed at checking and critiquing potentially accurate or misleading content on social media. We demonstrate that human-powered assessments not only identify misleading or error-prone visualizations but that integrating human computation enhances traditional linting by offering social insights. As is required these days, we consider the implications of building linters powered by Artificial Intelligence.
... Several works have studied the estimation of α in crowdsourcing systems [18], [21], [41], which can be divided into two categories: one studies the behavior of the workers in comparison to the honest control group [21]; the other one learns worker's reputation profile [18], [41], which is stored and updated over time to identify the greedy ones from the crowd. However, both categories of estimation methods are not suitable here due to the anonymous nature of crowd workers. ...
... Several works have studied the estimation of α in crowdsourcing systems [18], [21], [41], which can be divided into two categories: one studies the behavior of the workers in comparison to the honest control group [21]; the other one learns worker's reputation profile [18], [41], which is stored and updated over time to identify the greedy ones from the crowd. However, both categories of estimation methods are not suitable here due to the anonymous nature of crowd workers. ...
Preprint
Consider designing an effective crowdsourcing system for an M-ary classification task. Crowd workers complete simple binary microtasks whose results are aggregated to give the final result. We consider the novel scenario where workers have a reject option so they may skip microtasks when they are unable or choose not to respond. For example, in mismatched speech transcription, workers who do not know the language may not be able to respond to microtasks focused on phonological dimensions outside their categorical perception. We present an aggregation approach using a weighted majority voting rule, where each worker's response is assigned an optimized weight to maximize the crowd's classification performance. We evaluate system performance in both exact and asymptotic forms. Further, we consider the setting where there may be a set of greedy workers that complete microtasks even when they are unable to perform it reliably. We consider an oblivious and an expurgation strategy to deal with greedy workers, developing an algorithm to adaptively switch between the two based on the estimated fraction of greedy workers in the anonymous crowd. Simulation results show improved performance compared with conventional majority voting.
... Crowdsourcing provides a new framework to utilize distributed human wisdom to solve problems that machines cannot perform well, like handwriting recognition, paraphrase acquisition, audio transcription, and photo tagging [1][2][3][4]. In spite of the successful applications of crowdsourcing, the relatively low quality of output remains a key challenge [5][6][7]. ...
... Let W N +G denote the number of workers submitting N + G definitive answers, and W 0 denote the number of workers skipping all the microtasks. Given the numbers of spammers respectively completing and skipping all the microtasks, M A and M 0 , the joint probability distribution function of W N +G and W 0 , f (W N +G , W 0 |M A , M 0 ), is expressed in (6), wherem is the estimated m. ...
Preprint
We explore the design of an effective crowdsourcing system for an M-ary classification task. Crowd workers complete simple binary microtasks whose results are aggregated to give the final decision. We consider the scenario where the workers have a reject option so that they are allowed to skip microtasks when they are unable to or choose not to respond to binary microtasks. We present an aggregation approach using a weighted majority voting rule, where each worker's response is assigned an optimized weight to maximize crowd's classification performance.
... Improving the quality of crowdsourcing label has attracted attention among researchers in recent years, while many algorithms have been proposed. There are two kinds of approaches to improve the label's quality, namely quality control on task designing and quality improvement after data collection [13], [14]. The former is generally known as task assignment, which is used to enhance annotating effectiveness by rationally assigning tasks to appropriate workers. ...
... When employing the crowd in a delivery process, an important issue is controlling the quality of service. Quality in crowdsourcing systems is characterized by two dimensions [43]: the worker's profile (reputation and expertise) and task design (definition, user interface, granularity, and compensation policy). ...
Article
Full-text available
Crowd-shipping presents a new trend in shipment distribution. It is a process in which the crowd is employed to deliver the items. Effective risk prioritization is essential in city logistics and delivery, especially with the emergence of crowd-shipping. As crowd-shipping platforms grow, they bring uncertainties and challenges that can significantly impact operational efficiency and customer confidence. Emphasizing risk prioritization is crucial for many reasons, including trust and security, improving operational efficiency, and ensuring regulatory readiness. Risk prioritization is more than a mere formality; it is a vital element in successfully managing the intricacies of crowd-shipping. By methodically addressing and mitigating risks, providers can strengthen their operational capabilities, foster better customer connections, and ultimately promote the sustainable advancement of crowd-shipping services. This paper prioritizes the risks in crowd-shipping from the crowd-shipping provider’s perspective, using an MCDM approach such as CIMAS. The risks are prioritized in descending order. Comparative analysis with the BWM indicates the high reliability of the results obtained by the CIMAS method.
... Efficiency gains can be achieved either collectively at the level of the entire crowd or by helping individual workers. At the crowd level, efficiency can be gained by assigning tasks to workers in the best order (Tran-Thanh, Venanzi, Rogers, & Jennings, 2013), by filtering out poor tasks or workers, or by best incentivizing workers (Allahbakhsh et al., 2013). At the individual worker level, efficiency gains can come from helping workers craft more accurate responses and complete tasks in less time. ...
Preprint
Creative tasks such as ideation or question proposal are powerful applications of crowdsourcing, yet the quantity of workers available for addressing practical problems is often insufficient. To enable scalable crowdsourcing thus requires gaining all possible efficiency and information from available workers. One option for text-focused tasks is to allow assistive technology, such as an autocompletion user interface (AUI), to help workers input text responses. But support for the efficacy of AUIs is mixed. Here we designed and conducted a randomized experiment where workers were asked to provide short text responses to given questions. Our experimental goal was to determine if an AUI helps workers respond more quickly and with improved consistency by mitigating typos and misspellings. Surprisingly, we found that neither occurred: workers assigned to the AUI treatment were slower than those assigned to the non-AUI control and their responses were more diverse, not less, than those of the control. Both the lexical and semantic diversities of responses were higher, with the latter measured using word2vec. A crowdsourcer interested in worker speed may want to avoid using an AUI, but using an AUI to boost response diversity may be valuable to crowdsourcers interested in receiving as much novel information from workers as possible.
... In the context of (paid) crowdsourcing, assessment is usually conducted in relation to micro-work platforms [19], in which important features are related to cost minimization [20,21] which is out of scope with respect to our work. ...
Preprint
Full-text available
How to take multiple factors into account when evaluating a Game with a Purpose? How is player behaviour or participation influenced by different incentives? How does player engagement impact their accuracy in solving tasks? In this paper, we present a detailed investigation of multiple factors affecting the evaluation of a GWAP and we show how they impact on the achieved results. We inform our study with the experimental assessment of a GWAP designed to solve a multinomial classification task.
... Crowd-sourcing has been widely used to annotate large-scale NLP datasets (Rajpurkar et al., 2016;Williams et al., 2018;Wang et al., 2022) because it enables the rapid collection of labeled data at scale. However, the reliability of crowd-sourced annotations has been questioned, as quality control remains a challenge, with labeling inconsistencies becoming more frequent as dataset complexity increases (Lu et al., 2020;Allahbakhsh et al., 2013). One of the key advantages of crowd-sourcing has traditionally been its ability to handle tasks requiring human creativity or subjective judgment -areas where models have historically struggled. ...
Preprint
Full-text available
NLP benchmarks rely on standardized datasets for training and evaluating models and are crucial for advancing the field. Traditionally, expert annotations ensure high-quality labels; however, the cost of expert annotation does not scale well with the growing demand for larger datasets required by modern models. While crowd-sourcing provides a more scalable solution, it often comes at the expense of annotation precision and consistency. Recent advancements in large language models (LLMs) offer new opportunities to enhance the annotation process, particularly for detecting label errors in existing datasets. In this work, we consider the recent approach of LLM-as-a-judge, leveraging an ensemble of LLMs to flag potentially mislabeled examples. Through a case study of four datasets from the TRUE benchmark, covering different tasks and domains, we empirically analyze the labeling quality of existing datasets, and compare expert, crowd-sourced, and our LLM-based annotations in terms of agreement, label quality, and efficiency, demonstrating the strengths and limitations of each annotation method. Our findings reveal a substantial number of label errors, which, when corrected, induce a significant upward shift in reported model performance. This suggests that many of the LLMs so-called mistakes are due to label errors rather than genuine model failures. Additionally, we discuss the implications of mislabeled data and propose methods to mitigate them in training to improve model performance.
Conference Paper
Full-text available
Popular Internet services in recent years have shown that remarkable things can be achieved by harnessing the power of the masses using crowd-sourcing systems. However, crowd-sourcing systems can also pose a real challenge to existing security mechanisms deployed to protect Internet services. Many of these security techniques rely on the assumption that malicious activity is generated automatically by automated programs. Thus they would perform poorly or be easily bypassed when attacks are generated by real users working in a crowd-sourcing system. Through measurements, we have found surprising evidence showing that not only do malicious crowd-sourcing systems exist, but they are rapidly growing in both user base and total revenue. We describe in this paper a significant effort to study and understand these crowdturfing systems in today's Internet. We use detailed crawls to extract data about the size and operational structure of these crowdturfing systems. We analyze details of campaigns offered and performed in these sites, and evaluate their end-to-end effectiveness by running active, benign campaigns of our own. Finally, we study and compare the source of workers on crowdturfing sites in different countries. Our results suggest that campaigns on these systems are highly effective at reaching users, and their continuing growth poses a concrete threat to online communities both in the US and elsewhere.
Conference Paper
Full-text available
For solving complex problems, we advocate constructing "social computers" which combine software and human services. However, to date, human capabilities cannot be easily programmed into applications in a similar way like software capabilities. Existing approaches exploiting human capabilities via crowds do not support well on-demand, proactive, team-based human computation. In this paper, we explore a new way to virtualize, provision and program human capabilities using cloud computing concepts and service delivery models. We propose novel methods for modeling clouds of human-based services and combine human-based services with software-based services to establish clouds of hybrid services. In our model, we present common APIs, similar to APIs for software services, to access individual and team-based compute units in clouds of human-based services. Based on that, we propose frameworks and programming primitives for hybrid services. We illustrate our concepts via some examples of using our cloud APIs and existing cloud APIs for software.
Article
Full-text available
Many crowdsourcing studies have been conducted that utilize Amazon Mechanical Turk, a crowdsourcing marketplace platform. The Amazon Mechanical Turk team proposes that comprehensive studies in the areas of HIT design, workflow and reviewing methodologies, and compensation strategies will benefit the crowdsourcing field by establishing a standard library of repeatable patterns and protocols.
Article
For solving complex problems, in many cases, software alone might not be sufficient and we need hybrid systems of software and humans in which humans not only direct the software performance but also perform computing and vice versa. Therefore, we advocate constructing «social computers» which combine software and human services. However, to date, human capabilities cannot be easily programmed into complex applications in a similar way like software capabilities. There is a lack of techniques to conceptualize and program human and software capabilities in a unified way. In this paper, we explore a new way to virtualize, provision and program human capabilities using cloud computing concepts and service delivery models. We propose novel methods for conceptualizing and modeling clouds of human-based services (HBS) and combine HBS with software-based services (SBS) to establish clouds of hybrid services. In our model, we present common APIs, similar to well-developed APIs for software services, to access individual and team-based compute units in clouds of HBS. Based on that, we propose a framework for utilizing SBS and HBS to solve complex problems. We present several programming primitives for hybrid services, also covering forming hybrid solutions consisting of software and humans. We illustrate our concepts via some examples of using our cloud APIs and existing cloud APIs for software.
Conference Paper
Crowdsourcing has emerged as a new paradigm for outsourcing simple for humans yet hard to automate tasks to an undefined network of people. Crowdsourcing platforms like Amazon Mechanical Turk provide scalability and flexibility for customers that need to get manifold similar independent jobs done. However, such platforms do not provide certain guarantees for their services regarding the expected job quality and the time of processing, although such guarantees are advantageous from the perspective of Business Process Management. In this paper, we consider an alternative architecture of a crowdsourcing platform, where the workers are assigned to tasks by the platform according to their availability and skills. We propose the technique for estimating accomplishable guarantees and negotiating Service Level Agreements in such an environment.
Conference Paper
Social computing is actively shaping Internet-based business models. Scalability and effectiveness of collective intelligence are becoming increasingly attractive to investors. However, to fully exploit this potential we still have to develop crowd-management frameworks capable of supporting rich collaboration models, smart task division and virtual careers. An important step in this direction is the development of models of rewarding/incentivizing processes. In this paper, we conceptualize and represent rewarding and incentive mechanisms for social business processes. Our techniques enable definition, composition, execution and monitoring of rewarding mechanisms in a generic way.
Article
Content creation used to be an activity pursued either individually, or in closed circles of collaborators. Books, encyclopedias, map collections, had either a single author, or a group of authors who knew each other, and worked together; it was simply too difficult to coordinate the work of large, geographically dispersed groups of people when the main communication means were letters or telephone. The advent of the internet has changed all this: it is now possible for millions of people, from all around the world, to collaborate. The first open-collaboration systems, wikis, focused on text content; the range of content that can be created collaboratively has since expanded to include, for instance, video editing (e.g., MetaVid [5]), documents (e.g., Google Docs 1, ZOHO 2), architectural sketching (e.g., Sketchup 3), and geographical maps (e.g., OpenStreetMaps [10], Map Maker 4). Open collaboration carries immense promise, as shown by the success of Wikipedia, but also carries challenges both to content creators and to content consumers. At the contentcreation end, contributors may be of varying ability and knowledge. Collaborative systems open to all will inevitably be subjected to spam, vandalism, and attempts to influence the information. How can systems be built so that constructive interaction is encouraged and the consequences of vandalism and spam are minimized? How can the construction of high-quality information be facilitated? At the content-The authors like to sign their papers in alphabetical order; thus, the author order does not necessariy reflect the size of the contributions. This is the author’s version of the work. It is posted here by permission of ACM for your personal use.
Article
Web-based collaborations and processes have become essential in today’s business environments. Such processes typically span interactions between people and services across globally distributed companies. Web services and SOA are the defacto technology to implement compositions of humans and services. To support complex interaction scenarios, we introduce a mixed service-oriented system composed of both human-provided and software-based services interacting to perform joint activities or to solve emerging problems. However, competencies of people evolve over time, thereby requiring approaches for the automated management of actor skills, reputation, and trust. We present a novel approach addressing the need for flexible involvement of experts and knowledge workers in distributed collaborations. We argue that the automated inference of trust between members is a key factor for successful collaborations. Instead of following a security perspective on trust, we focus on dynamic trust in collaborative networks. We discuss Human-Provided Services (HPS) and an approach for managing user preferences and network structures. Our main contributions center around a context-sensitive trust-based algorithm called ExpertHITS inspired by the concept of hubs and authorities in Web-based environments. ExpertHITS takes trust-relations and link properties in social networks into account to estimate the reputation of users.
Article
Trust and reputation systems represent a significant trend in decision support for Internet mediated service provision. The basic idea is to let parties rate each other, for example after the completion of a transaction, and use the aggregated ratings about a given party to derive a trust or reputation score, which can assist other parties in deciding whether or not to transact with that party in the future. A natural side effect is that it also provides an incentive for good behaviour, and therefore tends to have a positive effect on market quality. Reputation systems can be called collaborative sanctioning systems to reflect their collaborative nature, and are related to collaborative filtering systems. Reputation systems are already being used in successful commercial online applications. There is also a rapidly growing literature around trust and reputation systems, but unfortunately this activity is not very coherent. The purpose of this article is to give an overview of existing and proposed systems that can be used to derive measures of trust and reputation for Internet transactions, to analyse the current trends and developments in this area, and to propose a research agenda for trust and reputation systems.