ArticlePDF Available

Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines

Authors:
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Marta Sabou,Kalina Bontcheva,Leon Derczynski,Arno Scharl
MODUL University Vienna
Am Kahlenberg 1, Vienna, Austria
{marta.sabou,arno.scharl}@modul.ac.at
University of Sheffield
211 Portobello, Sheffield S1 4DP, UK
{K.Bontcheva,L.Derczynski}@dcs.shef.ac.uk
Abstract
Crowdsourcing is an emerging collaborative approach that can be used for the acquisition of annotated corpora and a wide range of
other linguistic resources. Although the use of this approach is intensifying in all its key genres (paid-for crowdsourcing, games with
a purpose, volunteering-based approaches), the community still lacks a set of best-practice guidelines similar to the annotation best
practices for traditional, expert-based corpus acquisition. In this paper we focus on the use of crowdsourcing methods for corpus
acquisition and propose a set of best practice guidelines based in our own experiences in this area and an overview of related literature.
We also introduce GATE Crowd, a plugin of the GATE platform that relies on these guidelines and offers tool support for using
crowdsourcing in a more principled and efficient manner.
Keywords: Crowdsourcing, Human Computation, Corpus Annotation, Guidelines, Survey
1. Introduction
Over the past ten years, Natural Language Processing
(NLP) research has been driven forward by a growing vol-
ume of annotated corpora, produced by evaluation initia-
tives such as ACE (ACE, 2004), TAC,1SemEval and Sen-
seval,2and large annotation projects such as OntoNotes
(Hovy et al., 2006). These corpora have been essential
for training and domain adaptation of NLP algorithms and
their quantitative evaluation, as well as for enabling algo-
rithm comparison and repeatable experimentation. Thanks
to these efforts, there are now well-understood best prac-
tices in how to create annotations of consistently high qual-
ity, by employing, training, and managing groups of lin-
guistic and/or domain experts. This process is referred to
as “the science of annotation” (Hovy, 2010).
More recently, the emergence of crowdsourcing platforms
(e.g. paid-for marketplaces such as Amazon Mechanical
Turk (AMT) and CrowdFlower (CF); games with a pur-
pose; and volunteer-based platforms such as crowdcraft-
ing), coupled with growth in internet connectivity, moti-
vated NLP researchers to experiment with crowdsourcing
as a novel, collaborative approach for obtaining linguisti-
cally annotated corpora. The advantages of crowdsourcing
over expert-based annotation have already been discussed
elsewhere (Fort et al., 2011; Wang et al., 2012), but in a
nutshell, crowdsourcing tends to be cheaper and faster.
There are now a large and continuously growing number of
papers, which have used crowdsourcing in order to create
annotated data for training and testing a wide range of NLP
algorithms, as detailed in Section 2. and listed in Table 1.
As the practice of using crowdsourcing for corpus annota-
tion has become more widespread, so has the need for a best
practice synthesis, spanning all three crowdsourcing genres
and generalising from the specific NLP annotation task re-
ported in individual papers. The meta-review of (Wang et
1www.nist.gov/tac
2www.senseval.org
al., 2012) discusses the trade-offs of the three crowdsourc-
ing genres, alongside dimensions such as contributor mo-
tivation, setup effort, and human participants. While this
review answers some key questions in using crowdsourc-
ing, it does not provide a summary of best practice in how
to setup, execute, and manage a complete crowdsourcing
annotation project. In this paper we aim to address this
gap by putting forward a set of best practice guidelines
for crowdsourced corpus acquisition (Section 3.) and in-
troducing GATE Crowd, an extension of the GATE NLP
platform that facilitates the creation of crowdsourced tasks
based on best practices and their integration into larger NLP
processes (Section 4.).
2. Crowdsourcing Approaches
Crowdsourcing paradigms for corpus creation can be
placed into one of three categories: mechanised labour,
where workers are rewarded financially; games with a pur-
pose, where the task is presented as a game; and altruistic
work, relying on goodwill.
Mechanised labour has been used to create corpora that
support a broad range of NLP problems (Table 1). Highly
popular are NLP problems that are inherently subjective
and cannot yet be reliably solved automatically, such as
sentiment and opinion mining (Mellebeek et al., 2010),
word sense disambiguation (Parent and Eskenazi, 2010),
textual entailment (Negri et al., 2011), question answer-
ing (Heilman and Smith, 2010). Others create corpora of
special resource types such as emails (Lawson et al., 2010),
twitter feeds (Finin et al., 2010), augmented and alternative
communication texts (Vertanen and Kristensson, 2011).
One advantage of crowdsourcing is “access to foreign mar-
kets with native speakers of many rare languages” (Zaidan
and Callison-Burch, 2011). This feature is particularly use-
ful for those that work on less-resourced languages such as
Arabic (El-Haj et al., 2010) and Urdu (Zaidan and Callison-
Burch, 2011). Irvine and Klementiev (2010) demonstrated
that it is possible to create lexicons between English and 37
out of the 42 low-resource languages they examined.
Games with a purpose (GWAPs) for annotation include
Phratris (annotating sentences with syntactic dependen-
cies) (Attardi, 2010), PhraseDetectives (Poesio et al., 2012)
(anaphora annotations), and Sentiment Quiz (Scharl et al.,
2012) (sentiment). GWAP-based approaches for collecting
speech data include VoiceRace (McGraw et al., 2009), a
GWAP+MTurk approach, where participants see a defini-
tion on a flashcard and need to guess and speak the corre-
sponding word, which is then transcribed automatically by
a speech recognizer; VoiceScatter (Gruenstein et al., 2009),
where players must connect word sets with their definitions;
Freitas et al.’s GWAP (Freitas et al., 2010), where players
speak answers to graded questions in different knowledge
domains; and MarsEscape (Chernova et al., 2010), a two-
player game for collecting large-scale data for human-robot
interaction.
An early example of leveraging volunteer contributions is
Open Mind Word Expert, a Web interface that allows volun-
teers to tag words with their appropriate sense from Word-
Net in order to collect training data for the Senseval cam-
paigns (Chklovski and Mihalcea, 2002). Also, the MNH
(“Translation for all”) platform tries to foster the forma-
tion of a community through functionalities such as social
networking and group definition support (Abekawa et al.,
2010). Lastly, crowdcrafting.org is a community platform
where NLP-based applications can be deployed.
Notably, volunteer projects that have not been conceived
with a primary NLP interest but which delivered results
that are useful in solving NLP problems are (i) Wikipedia,
(ii) The Open Mind Common Sense project for collecting
general world knowledge from volunteers in multiple lan-
guages, a key source for the ConceptNet semantic network
that can enable various text understanding tasks; (iii) or
Freebase a structured, graph-based knowledge repository
offering information about almost 22 million entities con-
structed both by automatic means but also through contri-
butions from thousands of volunteers.
3. Best Practice Guidelines
Conceptually, the process of crowdsourcing language re-
sources can be broken down into four main stages, outlined
in Figure 3. and discussed in the following subsections.
These stages have been identified based on generalising our
experience with crowdsourced corpus acquisition (Rafels-
berger and Scharl, 2009; Scharl et al., 2012; Sabou et al.,
2013a; Sabou et al., 2013b) and a meta-analysis of other
crowdsourcing projects summarized in Table 1.
3.1. Project Definition
The first step is to choose the appropriate crowdsourcing
genre, by balancing cost, required completion timescales,
and the required annotator skills (Wang et al., 2012). Ta-
ble 1 lists mostly mechanised labour based works (using
either AMT or CF) and one GWAP. Secondly, the chosen
NLP problem (e.g. named entity annotation, sentiment lex-
icon acquisition) needs to be decomposed into a set of sim-
ple crowdsourcing tasks, which can be understood and car-
ried out by non-experts with minimal training and compact
PreparationII. Data
2a. Collect and pre-process corpus
Build and2b. or reuse annotator
management interfaces
Run pilot studies2c.
1a. Select NLP Problem and
crowdsourcing genre
Decompose NLP problem into tasks1b.
Design crowdsourcing task1c.
I. Project Definition
3a. screenRecruit and contributors
3b. Train, profile and retain contributors
Manage and monitor crowdsourcing tasks3c.
Data Evaluation and
Aggregation
4a. Evaluate and aggregate annotations
Evaluate overall corpus characteristics4b.
III. Project Execution IV.
Figure 1: Crowdsourcing Stages
guidelines. Reward and budget should also be determined
as part of the project definition.
Many NLP problems can be cast as one of a finite range
of common task types. For example, given the pattern of
selection task – where workers are presented with some in-
formation and required to select one of a list of possible
answers – one can implement word sense disambiguation,
sentiment analysis, entity disambiguation, relation typing,
and so on. Similarly, a sequence marking pattern in which
workers highlight items in a sequence can be applied, inter
alia, to named entity labelling, timex extraction, and actor
identification. Determining the common factors in these
tasks and using them as templates improves efficiency and
makes iterative enhancement of task designs possible. We
investigate two such templates in a generic, reusable NLP
crowdsourcing architecture described in Section 4.
Keeping tasks simple and intuitive is another important
principle, where a simpler design without too much vari-
ance tends to lead to better results. Indeed, a simple, undis-
tracting, clean interface helps even more than, for example,
switching instructions from L2 to a worker’s native lan-
guage (Khanna et al., 2010). With respect to task scope, an-
notating one named entity type at a time, albeit more expen-
sive, places lower cognitive load on workers and makes it
easier to have brief and yet comprehensive instructions (see
e.g., Bontcheva et al. (2014a)). Experience from expert-
based annotation (Hovy, 2010) has shown that annotators
should not be asked to choose from more than 10, ideally
seven, categories. In comparison, crowdsourcing classifi-
cation tasks often present fewer choices – in most cases be-
tween two (binary choice) and five categories, as suggested
by the 5th column of Table 1.
When longer documents are being annotated, one needs to
decide whether to put the entire document into one task, to
split it up into smaller parts – one per crowdsouring task
(e.g. paragraphs or sentences), or to avoid including in
the corpus any documents above a certain size. For many
NLP problems, a sentence provides sufficient context, how-
ever, this is not always the case. For example, Poesio et
al. (2012) annotated Wikipedia articles and books from the
Gutenberg project. Their game splits larger texts into para-
graphs, which each becomes a separate unit. This intro-
duced a problem in cases of long-distance anaphora, where
the antecedent is not present in the current paragraph and
hence cannot be selected by the game player. In general,
given the limited time which contributors spend on each
Project Definition Data Prep. Project Execution (Annottion) Aggregation&Evaluation
Approach Annotation of
Genre
Workers/task
Nr. of cate-
gories
Reward
Amount
Prototype
Recruitment
Training
Screen
Profile
Retain
In task
Quality
Contribution
Evaluation
Aggregation
Resource Eval-
uation
(Finin et al., NEs in Tweets CF, 2 4 $0.05 Y MLP GS, - Y - GS Prof. MV -
2010) AMT instr.
(Voyer et al., 2010) NEs CF 5 2 - - MLP instr - - - GS - CF IAA, Task
(Lawson NEs in Emails AMT* 4, 6, 3 $0.01+ Y MLP instr - - Bonus - MV MV Task
et al., 2010) 7 bonus
(Yetisgen-Yildiz Medical NEs AMT* 4 3 $0.01-0.05 Y MLP instr - - Comm., - - MV IAA,
et al., 2010) +bonus Bonus F-score
(Rosenthal PrepPhrase AMT 3 3 $0.04 Y MLP - LOC - - - - MV Precision
et al., 2010) Attachment
(Jha et al., 2010) PP Attach. AMT 5 varies $0.04 Y MLP instr - - - - - MV IAA, Rec
(Snow Affect, Wrd Sim. AMT 10 range - - MLP instr - Y - - - Avg IAA, Task
et al., 2008) Event&TE, WSD AMT 10 2, 3 - - MLP instr - Y - - - MV IAA
(Yano Bias in AMT 5 3, $0.02, - MLP instr LOC, - - - MV IAA
et al., 2010) polit. blogs 6 $0.04 AR90
(Mellebeek Polarity AMT 3 3, 11 $0.02 Y MLP instr COMP - - - - MV IAA
et al., 2010 range Task
(Laws et al., 2011) NEs&Affect AMT* 2+n -/2 $0.01 - MLP - - Y - AL MV MV Task
(Sayeed Opinion CF 3 4 $0.12 Y MLP instr LOC Y - GS Prof. MV IAA
et al., 2011) GS F-score
(Hong and Word Senses AMT* 10 4-5 $0.15 Y MLP instr COMP, - - GS - MV Acc.
Baker, 2011) CF* LOC,AR75
(Rafelsberger Opinions GWAP 7+5 5 - - SN - - - Levels, - MV Avg. -
et al., 2009) Boards
Table 1: Crowdsourcing corpus collection. Abbreviations: MLP(*)= mechanised labour platform (own interface); ALTR=altruistic crowdsourcing; SN=social network; GS= gold
standard; LOC=geo-location based screening; ARn=average reliability based screening; COMP=competency test based screening; MV=Majority Vote; AL=Active Learning
task, the length of text to be annotated/read needs to be kept
reasonably short, without compromising accuracy.
Determining when to reward contributors and the value
of the reward (in game points or money) have an influ-
ence over the time-completion of the task and the quality
of the gathered data. In terms of what is rewarded, most
straightforwardly, in mechanized labour, workers are paid
for each correctly completed task, and can be refused pay-
ment if they are discovered cheating. If some of the an-
swers are known a-priori, then answers that agree with the
Gold Standard are rewarded through comparative scoring.
The CrowdFlower platform (Biewald, 2012) automatically
mixes gold units in each task and the recommended amount
is 20% gold data per task. Providing the crowdworkers
with the option to comment on gold data annotations is also
a useful motivational mechanism. Otherwise, a common
strategy is to award the answers on which most contribu-
tors agree. In the latter case, however, a higher number of
judgements per task need to be collected (typically between
seven and ten), in order to minimise the effect of cheating.
Determining how much to award is another challenging is-
sue as award quantity influences critical parameters such
as the task completion time and the quality of the obtained
data. Launching a pilot job helps determine the average
time per task in the case of paid work; this in turn enables
one to be sure that at least the minimum wage is being paid
to workers, many of whom rely on crowd work as a pri-
mary source of income (Fort and Sagot, 2010). Current
approaches listed in Table 1 mostly offer 0.01 - 0.05$ per
task. Some variance has been seen in the relation between
result quality and reward (Poesio et al., 2014). While some
initial reports found that high rewards attracted noise that
was detrimental to quality (Mason and Watts, 2010), more
recent research was unable to repeat this finding, reporting
that increased reward only affected the number of workers
attracted to a job (and thus sped up its overall completion)
but not the quality (Aker et al., 2012).
3.2. Data Preparation
In this stage, user interfaces need to be designed and the
data collected and prepared. Interface design can be a ma-
jor task, especially in the case of games with a purpose.
Data processing may involve preliminary annotation with
an automated tool or filtering objectionable content.
Automatic pre-processing of source data can speed up cor-
pus creation, although it can also introduce bias and anno-
tation errors (Hovy, 2010). NLP infrastructures are often
used for bootstrapping manual annotation and iterative de-
velopment of NLP applications (a prototype is developed
and to annotate a set of documents for human annotators to
correct). The corrected annotations generated by the crowd
can then be used to improve the application, and the pro-
cess is repeated. This technique was originally developed
for low-level NLP tasks such as POS tagging, where it is
known to improve annotation speed considerably (Fort and
Sagot, 2010), and it also works well for higher-level anno-
tation (e.g., patent annotation, bio-informatics ontologies,
named entities, events).
We distinguish two kinds of user interface tool. Acquisi-
tion interfaces are designed for and used by the non-expert
human contributors to carry out crowdsourcing tasks. Man-
agement interfaces are required by the person running the
crowdsourcing project, in order to allow them to monitor
progress, assess quality, and manage contributors.
Acquisition interfaces used for solving, primarily, classifi-
cation and content generation problems have been success-
fully created within Mechanical Turk and CrowdFlower, as
listed in the third column of Table 1 (approaches denoted
with “*” build their own interfaces and make use of the
MLPs only as a means to recruit workers). The interface in
such cases is based on the widgets supported by the cho-
sen platform. However, careful consideration must be paid
to designing the tasks in a defensive manner (i.e., defen-
sive task design), which reduces cheating. Enter type inter-
faces used for solving content generation problems can be
cheated by providing random texts as input. To reduce the
risk of text being copied from online sources or from one
task to another, Vertanen and Kristensson (2011) prevent
users from pasting text and allow only typing. A technique
that is more specific to translation is to display the source
sentence as an image rather than text, thus preventing con-
tributors from simply copying the text into an automatic
translator, e.g. (Zaidan and Callison-Burch, 2011). Select
type interfaces are often easy to cheat on, as cheaters can
easily provide random selections. Laws et al. (2011) have
found that their interface based on simple radio button se-
lection have attracted high amount of spam, driving down
the overall classification accuracy to only 55%. Kittur et al.
(2008) designed a task where workers had to rate the quality
of a Wikipedia article against a set of criteria and obtained
48% invalid responses. While many techniques exist to fil-
ter out invalid responses after the completion of the task, it
is preferable to prevent cheating in the first place. Interface
design plays a key role here. Both Laws et al. (2011) and
Kittur et al. (2008) have extended their interfaces with ex-
plicitly verifiable questions which force the users to process
the content and also signal to the workers that their answers
are being scrutinized. This seemingly simple technique has
increased classification accuracy for (Laws et al., 2011) to
75% and reduced the percent of invalid responses to only
2.5% for (Kittur et al., 2008).
Management Interfaces support NLP researchers in mon-
itoring the status of their tasks and in fine-tuning the task
details including the selection and screening of contribu-
tors. Game and volunteer based projects must build these
interfaces from scratch, for example, Poesio et al. (2012)
report on the extenssive management interfaces they built to
support PhraseDetectives. CrowdFlower and MTurk offer
some of this functionality already. For example, Crowd-
Flower supports requesters through the life-cycle of the
crowdsourcing process including acquisition interface de-
sign (Edit page), data and gold standards data management
(Data and Gold pages), calibration of key parameters such
as the number of units per page/HIT, judgements required
per units and pay for unit based on the desired comple-
tion time and accuracy (Calibration page), overview of the
job’s progress and overall status during the process itself
(Dashboard), detailed analysis of the workers that have con-
tributed to the job including their trust level, accuracy (in
relation to a supplied gold standard) and accuracy history.
There are proven benefits to performing a small scale pilot
for testing the task definition, for ensuring that the appropri-
ate task granularity and annotator instructions (e.g., (Verta-
nen and Kristensson, 2011; Feng et al., 2009)) are chosen
and for fine-tuning the key parameters of the crowdsourcing
task (e.g., payment, size). Indeed, several of the approaches
listed in Table 1 make use of an initial prototype system
to fine-tune their crowdsourcing process. Note that, pilot-
ing requires that the complete application is in place, and
therefore it is performed in the “Preparation” rather than
“Project definition” step. If a pilot is not successful how-
ever, the project definition step would need to be revisited.
For example, Negri and Mehdad (2010) devote 10 days and
$100 to experiment with different methodologies for de-
termining the optimal approach for collecting translations
which includes gold-units, verification cycles, worker filter-
ing mechanisms to achieve the right balance between cost,
time and quality. More generally, McCreadie et al. (2012)
advocate an iterative methodology where crowdsourcing
tasks are submitted in multiple batches allowing continu-
ous improvements to the task based on worker feedback and
result quality. In our experience, an iterative methodology
also offers protection from data loss and other problems that
may occur during long-running crowdsourcing projects.
3.3. Project Execution
This is the main phase of each crowdsourcing project,
which on smaller paid-for crowdsourcing projects can be
completed sometimes in a matter of minutes (Snow et al.,
2008), or run for many months or years, when games or
volunteers are used for collecting large datasets.
It consists of three kinds of tasks: task workflow and man-
agement, contributor management (including profiling and
retention), and quality control. Choices that need to be
made include whether the entire corpus is to be annotated
multiple times to allow for a reconciliation and verification
step (higher quality, but higher costs) or whether it is suffi-
cient to have only two or three annotators per document as
long as they agree.
The decentralised nature of crowdsourcing and, potential
relative lack of reusable workflow definition, task manage-
ment and quality assurance interfaces can make this project
stage rather challenging. Additional challenges exist in
handling individual contributors, which involves training,
screening, profiling, retaining and dealing with worker dis-
putes; and in quality control during annotation.
Attracting and retaining a large number of contributors is
key to the success of any crowdsourcing system. There-
fore, a core challenge of all crowdsourcing approaches is
how to motivate contributors to participate. This issue has
been analyzed extensively in recent surveys, e.g. Doan et
al. (2011) focus on two different aspects when considering
motivation, which they refer to as the challenge of “How to
recruit and retain users?”.
Contributor recruitment consists in a set of primarily adver-
tising activities to attract contributors to the crowdsourcing
project. Note that our view on recruitment differs from that
of Doan et al. as we primarily look at mechanisms to attract
contributors and are agnostic to the motivational aspect ad-
ditionally considered by Doan. While these two issues are
inherently related, we choose to examine them in separation
for more clarity.
Most NLP projects recruit their contributors from market-
places that offer a large and varied worker base who mon-
itor newly posted tasks (see Table 1, column “Recruit-
ment”). The idea of a portal bundling multiple crowdsourc-
ing projects is also used, to lesser extent, in the GWAP area,
where gwap.com publishes a collection of games built by
von Ahn and his team, while OntoGame bundles together
games for the semantic web area.
Another strategy is the use of multi-channel advertisements
for attracting users. Chamberlain et al. (2009) advertised
their game through a wealth of channels including local and
national press, science websites, blogs, bookmarking web-
sites, gaming forums, and social networking sites.
Filtering workers prior to the task (based on e.g. prior per-
formance, geographic origin, and initial training) is impor-
tant to improve quality. Extensive screening can however
lead to slower task completion, so filtering through task-
design is preferred to filtering through crowd characteris-
tics. Although a worker’s prior acceptance rate is one of
the key filtering mechanisms offered by Mechanical Turk
and CrowdFlower, sometimes this type of screening cannot
be used on its own reliably and needs to be complemented
with other filters such as geographic location.
Munro et al. (2010) describe a set of methods for assess-
ing linguistic attentiveness prior to the actual task. These
involve showing language constructs that are typically ac-
quired late by first-language learners or stacked pronominal
possessive constructions (e.g., John’s sister’s friend’s car)
and asking workers to select a correct paraphrase thereof.
These techniques not only help identify workers that have a
sufficient command of English, but also prompt for higher
attentiveness during the task.
Screening activities are feasible when using crowdsourcing
marketplaces where (at least some) characteristics of the
workers are known - and indeed, some of the approaches
we surveyed in Table 1 employ location (LOC), prior per-
formance (AR) and competence based (COMP) screening
(see column “Screening”). GWAPs and altruistic crowd-
sourcing projects do not usually have this opportunity since
most often their user community of not a-priory known.
A common approach here is to use training mechanisms
in order to make sure that the contributor qualifies. Many
projects embed positive (and/or negative) gold standard ex-
amples within their tasks to determine the general quality of
data provided by each worker. For example, CrowdFlower
offers immediate feedback to workers when they complete
a “gold” unit, thus effectively training them. In general,
training mechanisms using instructions and gold standard
data are well-spread (see the “Training” in Table 1).
3.4. Data Evaluation and Aggregation
In this phase, the challenge lies in evaluating and aggregat-
ing the multiple contributor inputs into a complete linguis-
tic resource, and in assessing the resulting overall quality.
This stage is required in order to make acquisition tasks
reproducible and therefore scalable, and to ensure good
corpus quality. Sub-tasks include tracking worker perfor-
mance over time, worker agreement, and converting con-
tributed judgments into a consistent set of annotations (see
Section 4. on the latter). Some tools are available for judg-
ing worker accuracy to help smooth this process (Hovy et
al., 2013). As shown in Table 1, contributor aggregation
primarily relies on majority voting or average computa-
tion based algorithms, while the evaluation of the result-
ing corpus is usually performed by computing inter anno-
tator agreement (IAA) within crowd-workers and/or with
a baseline resource provided by an expert; by task-centric
evaluation as well as by Precision, Recall and F-measure.
3.5. Legal and Ethical Issues
The use of crowdsourcing raises, among others, the follow-
ing three issues of legal and ethical nature, which have so
far not received sufficient attention, including: how to prop-
erly acknowledge contributions; how to ensure contributor
privacy and wellbeing; and how to deal with consent and
licensing issues.
No clear guidelines exist for the first issue, how to properly
acknowledge crowd contributions. Some volunteer projects
(e.g., FoldIt, Phylo) already include contributors in the au-
thors’ list (Cooper et al., 2010; Kawrykow et al., 2012).
The second issue is contributor privacy and well-being.
Paid-for marketplaces (e.g. MTurk) go some way towards
addressing worker privacy, although these are far from suf-
ficient and certainly fall short with respect to protecting
workers from exploitation, e.g. having basic payment pro-
tection (Fort et al., 2011). The use of mechanised labour
(MTurk in particular) raises a number of workers’ rights
issues: low wages (below $2 per hour), lack of protec-
tion, and legal implications of using MTurk for longer term
projects. We recommend at the least conducting a pilot task
to see how long jobs take to complete, and ensuring that av-
erage pay exceeds the local minimum wage.
The third issue is licensing and consent, i.e. making it clear
to the human contributors that by carrying out these tasks
they are contributing knowledge for scientific purposes and
that they agree to a well-defined license for sharing and us-
ing their work. Typically, open licenses such as Creative
Commons are used and tend to be prominently stated in
volunteer-based projects/platforms (Abekawa et al., 2010).
In contrast, GWAPs tend to mostly emphasize the scien-
tific purpose of the game, while many fail to state explicitly
the distribution license for the crowdsourced data. In our
view, this lack of explicit consent to licensing could poten-
tially allow the exploitation of crowdsourced resources in a
way which their contributors could find objectionable (e.g.
not share a new, GWAP-annotated corpus freely with the
community). Similarly, almost one third of psychology re-
views on MTurk post no informed consent information at
all (Behrend et al., 2011).
4. The GATE Crowdsourcing plugin
We relied on these best practice guidelines during the
development of GATE Crowd, an open-source plugin
for the GATE NLP platform (Cunningham et al., 2013)
which offers crowdsourcing support to the platform’s
users (Bontcheva et al., 2014b). The plugin contains
reusable task definitions and crowdsourcing workflow tem-
plates which can be used by researchers to commission the
crowdsourcing of annotated corpora directly from within
GATE’s graphical user interface, as well as pre-process the
data automatically with relevant GATE linguistic analysers,
prior to crowdsourcing. Once all parameters are configured,
the new GATE crowdsourcing plugin generates the respec-
tive crowdsourcing tasks automatically, which are then de-
ployed on the chosen platform (e.g. CrowdFlower). On
completion, the collected multiple judgements are imported
back into GATE and the original documents are enriched
with the crowdsourced information, modelled as multiple
annotation layers (one per contributor). GATE’s existing
tools for calculating inter-annotator agreement and corpus
analysis can then be used to gain further insights into the
quality of the collected information.
In the first step, task name, instructions, and classification
choices are provided, in a UI configuration dialog. For
some categorisation NLP annotation tasks (e.g. classify-
ing sentiment in tweets into positive, negative, and neu-
tral), fixed categories are sufficient. In others, where the
available category choices depend on the text that is be-
ing classified (e.g. the possible disambiguations of Paris
are different from those of London), choices are defined
through annotations on each of the classification targets. In
this case, the UI generator then takes these annotations as a
parameter and automatically creates the different category
choices, specific to each crowdsourcing unit.
In sequential selection, sub-units are defined in the UI con-
figuration – tokens, for this example. The annotators are
instructed to click on all words that constitute the desired
sequence (the annotation guidelines are given as a parame-
ter during the automatic user interface generation).
Since the text may not contain a sequence to be annotated,
we also generate an explicit confirmation checkbox. This
forces annotators to declare that they have made the selec-
tion or there is nothing to be selected in this text.
The GATE Crowdsourcing plugin is available for download
now via the developer versions at http://www.gate.ac.uk,
and is bundled with GATE v8 due in 2014.
5. Conclusions
Annotation science and reusable best practice guidelines
have evolved in response to the need for harnessing collec-
tive intelligence for the creation of large, high-quality lan-
guage resources. While crowdsourcing is increasingly re-
garded as a novel collaborative approach to scale up LR ac-
quisition in an affordable manner, researchers have mostly
used this paradigm to acquire small- to medium-sized cor-
pora. The novel contribution of this paper lies in defining a
set of best practice guidelines for crowdsourcing, as the first
step towards enabling repeatable acquisition of large-scale,
high quality LRs, through the implementation of the neces-
sary infrastructural support within the GATE open source
language engineering platform.
A remaining challenge for crowdsourcing projects is that
the cost to define a single annotation project can outweigh
the benefits. Future work should address this by provid-
ing a generic crowdsourcing infrastructure which transpar-
ently combines different crowdsourcing genres (i.e. mar-
ketplaces, GWAPs, and volunteers). Such an infrastruc-
ture should help with sharing meta-information, including
contributor profiles, annotator capabilities, past work, and
history from previously completed projects. Solving this
challenge could help prevent annotator bias and minimise
human oversight required, by implementing more sophis-
ticated crowd-based annotation workflows, coupled with
in-built control mechanisms. Such infrastructure will also
need to implement reusable, automated methods for qual-
ity control and aggregation and make use of the emerging
reusable task definitions and workflow patterns. The reward
is increased scalability and quality from the almost limitless
power of the crowd.
6. Acknowledgements
This work is part of the uComp project (www.ucomp.eu),
which receives the funding support of EPSRC
EP/K017896/1, FWF 1097-N23, and ANR-12-CHRI-
0003-03, in the framework of the CHIST-ERA ERA-NET.
7. References
Abekawa, T., Utiyama, M., Sumita, E., and Kageura, K.
(2010). Community-based Construction of Draft and Fi-
nal Translation Corpus through a Translation Hosting
Site Minna no Hon’yaku (MNH). In Proc. LREC.
ACE, (2004). Annotation Guidelines for Event Detec-
tion and Characterization (EDC), Feb. Available at
http://www.ldc.upenn.edu/Projects/ACE/.
Aker, A., El-Haj, M., Albakour, M.-D., and Kruschwitz, U.
(2012). Assessing crowdsourcing quality through objec-
tive tasks. In Proc. LREC, pages 1456–1461.
Attardi, G. (2010). Phratris – A Phrase Annotation Game.
In INSEMTIVES Game Idea Challenge.
Behrend, T., Sharek, D., Meade, A., and Wiebe, E. (2011).
The viability of crowdsourcing for survey research. Be-
hav. Res., 43(3).
Biewald, L. (2012). Massive multiplayer human computa-
tion for fun, money, and survival. In Current Trends in
Web Engineering, pages 171–176. Springer.
Bontcheva, K., Derczynski, L., and Roberts, I. (2014a).
Crowdsourcing named entity recognition and entity link-
ing corpora. In Handbook of Linguistic Annotation.
Springer.
Bontcheva, K., Roberts, I., and Derczynski, L. (2014b).
The GATE Crowdsourcing Plugin: Crowdsourcing An-
notated Corpora Made Easy. In Proc. EACL.
Callison-Burch, C. and Dredze, M., editors. (2010). Proc.
of the NAACL HLT 2010 Workshop on Creating Speech
and Language Data with Amazon’s Mechanical Turk.
Chamberlain, J., Poesio, M., and Kruschwizt, U. (2009).
A new life for a dead parrot: Incentive structures in
the Phrase Detectives game. In Proc. of the Webcentives
Workshop.
Chernova, S., Orkin, J., and Breazeal, C. (2010). Crowd-
sourcing HRI through Online Multiplayer Games. In Di-
alog with Robots: Papers from the AAAI Fall Symposium
(FS-10-05).
Chklovski, T. and Mihalcea, R. (2002). Building a Sense
Tagged Corpus with Open Mind Word Expert. In Proc.
of the ACL Workshop on Word Sense Disambiguation:
Recent Successes and Future Directions.
Cooper, S., Khatib, F., Treuille, A., Barbero, J., Lee, J.,
Beenen, M., Leaver-Fay, A., Baker, D., Popovic, Z., and
players, F. (2010). Predicting protein structures with a
multiplayer online game. Nature, 466(7307).
Cunningham, H., Tablan, V., Roberts, A., and Bontcheva,
K. (2013). Getting more out of biomedical documents
with GATE’s full lifecycle open source text analytics.
PLoS computational biology, 9(2):e1002854.
Doan, A., Ramakrishnan, R., and Halevy, A. Y. (2011).
Crowdsourcing Systems on the World-Wide Web. Com-
mun. ACM, 54(4), April.
El-Haj, M., Kruschwitz, U., and Fox, C. (2010). Using
Mechanical Turk to Create a Corpus of Arabic Sum-
maries. In Proc. LREC.
Feng, D., Besana, S., and Zajac, R. (2009). Acquiring
High Quality Non-Expert Knowledge from On-Demand
Workforce. In Proc. of The People’s Web Meets NLP:
Collaboratively Constructed Semantic Resources.
Finin, T., Murnane, W., Karandikar, A., Keller, N.,
Martineau, J., and Dredze, M. (2010). Annotating
Named Entities in Twitter Data with Crowdsourcing. In
Callison-Burch and Dredze (Callison-Burch and Dredze,
2010).
Fort, K. and Sagot, B. (2010). Influence of Pre-annotation
on POS-tagged Corpus Development. In Proc. of the
Fourth Linguistic Annotation Workshop.
Fort, K., Adda, G., and Cohen, K. (2011). Amazon Me-
chanical Turk: Gold Mine or Coal Mine? Computa-
tional Linguistics, 37(2):413 –420.
Freitas, J., Calado, A., Braga, D., Silva, P., and Dias, M.
(2010). Crowdsourcing platform for large-scale speech
data collection. Proceedings of FALA, Vigo.
Gruenstein, E., Mcgraw, I., and Sutherl, A. (2009). A
Self-Transcribing Speech Corpus: Collecting Continu-
ous Speech with an Online Educational Game. In Proc.
of The Speech and Language Technology in Education
(SLaTE) Workshop.
Heilman, M. and Smith, N. A. (2010). Rating Computer-
Generated Questions with Mechanical Turk. In Callison-
Burch and Dredze (Callison-Burch and Dredze, 2010).
Hong, J. and Baker, C. F. (2011). How Good is the Crowd
at ”real” WSD? In Proc. of the 5th Linguistic Annotation
Workshop.
Hovy, E., Marcus, M. P., Palmer, M., Ramshaw, L. A., and
Weischedel, R. M. (2006). OntoNotes: The 90% Solu-
tion. In Proc. NAACL.
Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., and Hovy, E.
(2013). Learning Whom to trust with MACE. In Proc.
of NAACL-HLT, pages 1120–1130.
Hovy, E. (2010). Annotation. In Tutorial Abstracts of
ACL.
Irvine, A. and Klementiev, A. (2010). Using Mechanical
Turk to Annotate Lexicons for Less Commonly Used
Languages. In Callison-Burch and Dredze (Callison-
Burch and Dredze, 2010).
Jha, M., Andreas, J., Thadani, K., Rosenthal, S., and McK-
eown, K. (2010). Corpus Creation for New Genres: A
Crowdsourced Approach to PP Attachment. In Callison-
Burch and Dredze (Callison-Burch and Dredze, 2010).
Kawrykow, A., Roumanis, G., Kam, A., Kwak, D., Leung,
C., Wu, C., Zarour, E., and players, P. (2012). Phylo:
A Citizen Science Approach for Improving Multiple Se-
quence Alignment. PLoS ONE, 7(3):e31362.
Khanna, S., Ratan, A., Davis, J., and Thies, W. (2010).
Evaluating and improving the usability of Mechanical
Turk for low-income workers in India. In Proceedings
of the first ACM symposium on Computing for Develop-
ment. ACM.
Kittur, A., Chi, E. H., and Suh, B. (2008). Crowdsourcing
User Studies with Mechanical Turk. In Proc. of the 26th
Conference on Human Factors in Computing Systems.
Laws, F., Scheible, C., and Sch ¨
utze, H. (2011). Ac-
tive Learning with Amazon Mechanical Turk. In Proc.
EMNLP.
Lawson, N., Eustice, K., Perkowitz, M., and Yetisgen-
Yildiz, M. (2010). Annotating Large Email Datasets
for Named Entity Recognition with Mechanical Turk. In
Callison-Burch and Dredze (Callison-Burch and Dredze,
2010).
Mason, W. and Watts, D. J. (2010). Financial incentives
and the performance of crowds. ACM SigKDD Explo-
rations Newsletter, 11(2):100–108.
McCreadie, R., Macdonald, C., and Ounis, I. (2012). Iden-
tifying Top News Using Crowdsourcing. Information
Retrieval. 10.1007/s10791-012-9186-z.
McGraw, I., Gruenstein, A., and Sutherland, A. (2009). A
Self-Labeling Speech Corpus: Collecting Spoken Words
with an Online Educational Game. In Proc. of INTER-
SPEECH.
Mellebeek, B., Benavent, F., Grivolla, J., Codina, J., Costa-
juss`
a, M. R., and Banchs, R. (2010). Opinion Mining of
Spanish Customer Comments with Non-Expert Annota-
tions on Mechanical Turk. In Callison-Burch and Dredze
(Callison-Burch and Dredze, 2010).
Munro, R., Bethard, S., Kuperman, V., Lai, V. T., Mel-
nick, R., Potts, C., Schnoebelen, T., and Tily, H. (2010).
Crowdsourcing and Language Studies: The New Gener-
ation of Linguistic Data. In Callison-Burch and Dredze
(Callison-Burch and Dredze, 2010).
Negri, M. and Mehdad, Y. (2010). Creating a Bi-lingual
Entailment Corpus through Translations with Mechani-
cal Turk : 100 for a 10-day Rush. In Callison-Burch and
Dredze (Callison-Burch and Dredze, 2010).
Negri, M., Bentivogli, L., Mehdad, Y., Giampiccolo, D.,
and Marchetti, A. (2011). Divide and Conquer: Crowd-
sourcing the Creation of Cross-Lingual Textual Entail-
ment Corpora. In Proc. EMNLP.
Parent, G. and Eskenazi, M. (2010). Clustering Dictio-
nary Definitions Using Amazon Mechanical Turk. In
Callison-Burch and Dredze (Callison-Burch and Dredze,
2010).
Poesio, M., Kruschwitz, U., Chamberlain, J., Robaldo, L.,
and Ducceschi, L. (2012). Phrase Detectives: Utilizing
Collective Intelligence for Internet-Scale Language Re-
source Creation. Transactions on Interactive Intelligent
Systems.
Poesio, M., Chamberlain, J., and Kruschwitz, U. (2014).
Crowdsourcing. In Handbook of Linguistic Annotation.
Springer.
Rafelsberger, W. and Scharl, A. (2009). Games with a Pur-
pose for Social Networking Platforms. In Proc. ACM
conference on Hypertext and Hypermedia.
Rosenthal, S., Lipovsky, W., McKeown, K., Thadani, K.,
and Andreas, J. (2010). Towards Semi-Automated An-
notation for Prepositional Phrase Attachment. In Proc.
LREC.
Sabou, M., Bontcheva, K., Scharl, A., and F¨
ols, M.
(2013a). Games with a Purpose or Mechanised Labour?
A Comparative Study. In Proc. International Confer-
ence on Knowledge Management and Knowledge Tech-
nologies.
Sabou, M., Scharl, A., and F¨
ols, M. (2013b). Crowd-
sourced Knowledge Acquisition: Towards Hybrid-Genre
Workflows. International Journal on Semantic Web and
Information Systems, 9(3).
Sayeed, A. B., Rusk, B., Petrov, M., Nguyen, H. C., Meyer,
T. J., and Weinberg, A. (2011). Crowdsourcing syntactic
relatedness judgements for opinion mining in the study
of information technology adoption. In Proc. of the 5th
ACL-HLT Workshop on Language Technology for Cul-
tural Heritage, Social Sciences, and Humanities (LaT-
eCH ’11).
Scharl, A., Sabou, M., Gindl, S., Rafelsberger, W., and We-
ichselbraun, A. (2012). Leveraging the wisdom of the
crowds for the acquisition of multilingual language re-
sources. In Proc. LREC.
Snow, R., O’Connor, B., Jurafsky, D., and Ng, A. Y.
(2008). Cheap and Fast—but is it Good?: Evaluating
Non-Expert Annotations for Natural Language Tasks. In
Proc. EMNLP.
Vertanen, K. and Kristensson, P. O. (2011). The Imagi-
nation of Crowds: Conversational AAC Language Mod-
eling using Crowdsourcing and Large Data Sources. In
Proc. EMNLP.
Voyer, R., Nygaard, V., Fitzgerald, W., and Copperman, H.
(2010). A Hybrid Model for Annotating Named Entity
Training Corpora. In Proc. of the Fourth Linguistic An-
notation Workshop (LAW IV ’10).
Wang, A., Hoang, C., and Kan, M. Y. (2012). Perspec-
tives on Crowdsourcing Annotations for Natural Lan-
guage Processing. Language Resources and Evaluation.
Yano, T., Resnik, P., and Smith, N. A. (2010). Shedding
(a Thousand Points of) Light on Biased Language. In
Callison-Burch and Dredze (Callison-Burch and Dredze,
2010).
Yetisgen-Yildiz, M., Solti, I., Xia, F., and Halgrim, S. R.
(2010). Preliminary Experience with Amazon’s Me-
chanical Turk for Annotating Medical Named Enti-
ties. In Callison-Burch and Dredze (Callison-Burch and
Dredze, 2010).
Zaidan, O. F. and Callison-Burch, C. (2011). Crowd-
sourcing Translation: Professional Quality from Non-
Professionals. In Proc. ACL.
... Crowdsourcing is an emerging method which is used for annotated large training and testing dataset in sentence level sentiment analysis [13].In the proposed system, a novel crowdsourcing technique handles the document metadata with text features. The naïve bayes algorithm with novel crowdsourcing method is an efficient approach for metadata extraction in each line and sentence level opinion mining from the document. ...
... Finally, each of our annotation task relied on two independent annotators and a third annotator if there was a conflict existing in labels from the first two annotators. It has been shown that the optimal number of annotators to obtain reliable results may be around ten (Carvalho et al. 2016), and the quality of annotation results can be improved by providing additional training to annotators (Jha et al. 2010;Sabou et al. 2014;Simperl 2015;Gadiraju et al. 2015;Hube et al. 2019). While the average inner annotator agreement in our data show that these might not be a critical issue to this task, future studies should investigate whether such an empirical value is still valid in health mention labeling tasks. ...
Chapter
The field of consumer health informatics (CHI) is constantly evolving. The literature that supports CHI includes a broad scope of expertise and disciplines, which makes discovering relevant literature a challenge. Through a library and information science lens, we provide foundational familiarity with the structures of information discovery systems and considerations that impact the discovery of CHI literature. We outline the steps included in the design and execution phases of a CHI-related literature search. We also provide an example search using wearable technologies and a case in point that illustrates how terminologies differ across databases. We describe the importance of operationalizing elements of a research question and strategically combining search terms in a query to enhance the findability of CHI literature. The reader will gain a database-agnostic understanding of the structures and factors relevant to the retrieval of CHI literature, which should be particularly useful as the field of CHI and the tools for retrieving literature continuously change.
... Including multiple annotators also leads to disagreement among the labels that have been provided by them. The final or gold annotation is then usually determined by majority voting (Sabou et al., 2014) or by using the label of an "expert" (Waseem and Hovy, 2016). There are also different methodologies which do not use majority voting to select the "ground truth". ...
Preprint
Full-text available
The paper describes the work that has been submitted to the 5th workshop on Challenges and Applications of Automated Extraction of socio-political events from text (CASE 2022). The work is associated with Subtask 1 of Shared Task 3 that aims to detect causality in protest news corpus. The authors used different large language models with customized cross-entropy loss functions that exploit annotation information. The experiments showed that bert-based-uncased with refined cross-entropy outperformed the others, achieving a F1 score of 0.8501 on the Causal News Corpus dataset.
... This presents its own potential problems, particularly those around quality control. There are guidelines and best practice that can be followed to ensure that efforts to crowdsource in digital pathology are rigorous [32,33]. ...
Article
Full-text available
Digital pathology is revolutionising the analysis of histological features and is becoming more and more widespread in both the clinic and research. Molecular pathology extends the tissue morphology information provided by conventional histopathology by providing spatially resolved molecular information to complement the structural information provided by histopathology. The multidimensional nature of the molecular data poses significant challenge for data processing, mining, and analysis. One of the key challenges faced by new and existing pathology practitioners is how to choose the most suitable molecular pathology technique for a given diagnosis. By providing a comparison of different methods, this narrative review aims to introduce the field of molecular pathology, providing a high-level overview of many different methods. Since each pixel of an image contains a wealth of molecular information, data processing in molecular pathology is more complex. The key data processing steps and variables, and their effect on the data, are also discussed.
Chapter
Traditional methods for collecting data in support of clinical research include prospectively collected surveys, retrospective analyses of existing medical records, and a combination of the two. However, these resources are limited in how comprehensively their content covers an individual’s life. Non-traditional information domains (e.g., online forums and social media) have the potential to supplement the view of an individual’s health. In this chapter, we investigate how people disclose their own or others’ health status over a broad range of health issues on Twitter. We applied both traditional and deep learning-based machine learning models effectively to detect such online personal health status mentions. We collected more than 250 million tweets via the Twitter streaming API over a two-month period in 2014 and focused on 34 high-impact health issues that were selected based on the guidance from the Medical Expenditure Panel Survey. We created a labeled corpus of over three thousand tweets via a survey, administered over Amazon Mechanical Turk, that documents when terms correspond to mentions of personal health issues or an alternative (e.g., a metaphor). We found that Twitter users disclosed personal health status for all of the investigated health issues and personal health status was disclosed over 50% of the time for 11 out of 34 (33%) investigated health issues. We also found that the disclosure rate, as well as the likelihood that people disclose their own versus other people’s health status, were dependent on the health issue in a statistically significant manner (p < 0.001). While models based on traditional machine learning frameworks built upon bag-of-word features led to decent performance (AUC = 0.810), models based on deep learning significantly boosted performance (AUC =0.885).KeywordsConsumer healthInformation retrievalMachine learningSocial mediaNatural language processingHealth status mention detection
Article
Humans use language in order to communicate between one another. There exist a number of languages which are either spoken or written. Among these languages, there exists a special type of language called Sign Language (SL). Sign language is a general term which includes any kind of gestural language that makes use of signs and gestures to convey message. Although the deaf community feels comfortable while using Sign Language as their mode of communication, but they face a lot of problems as well. Therefore, in order to help and assist the deaf community a repository of different sign languages are essential for each sign language. This work presents a process to develop a repository by collecting and validating sign language gestures of any language by involving the deaf community and language experts. A small data collection based on a proof-of-concept application has also been presented in this work. Lastly, it highlights the benefits of such corpus by discussing possible applications that can be built to serve the deaf community of the world at large.
Article
Konkani is one of the languages included in the eighth schedule of the Indian constitution. It is the official language of Goa and is spoken mainly in Goa and some places in Karnataka and Kerala. Konkani WordNet or Konkani Shabdamalem (kōṁkanī śabdamālēṁ) as it has been referred to, was developed under the Indradhanush WordNet Project Consortium during the period from August 2010 to October 2013. This project was funded by Technology Development for Indian Languages (TDIL), Department of Electronics & Information Technology (Deity), and Ministry of Communication and Information Technology (MCIT). The work on Konkani WordNet has halted since the end of the project. Currently, the Konkani WordNet contains around 32,370 synsets. However, to make it a powerful resource for NLP applications in the Konkani language, a need is felt for research work toward enhancement of the Konkani WordNet via community involvement. Crowdsourcing is a technique in which the knowledge of the crowd is utilized to accomplish a particular task. In this article, we have presented the details of the crowdsourcing platform named “Konkani Shabdarth” (kōṁkanī śabdārth). Konkani Shabdarth attempts to use the knowledge of Konkani speaking people for creating new synsets and perform the quantitative enhancement of the wordnet. It also intends to work toward enhancing the overall quality of the Konkani WordNet by validating the existing synsets, and adding the missing words to the existing synsets. A text corpus named “Konkani Shabdarth Corpus”, has been created from the Konkani literature while implementing the Konkani Shabdarth tool. Using this corpus, 572 root words that are missing from the Konkani WordNet have been identified which are given as input to Konkani Shabdarth. As of now, total 94 users have registered on the platform, out of which 25 users have actually played the game. Currently, 71 new synsets have been obtained for 21 words. For some of the words, multiple entries for the concept definition have been received. This overlap is essential for automating the process of validating the synsets. Due to the pandemic period, it has been difficult to train and get players to actually play the game and contribute. We studied the impact of adding missing words from other existing Konkani text corpus on the coverage of Konkani WordNet. The expected increase in the percentage coverage of Konkani WordNet has been found to be in the range 20–27 after adding the missing words from the Konkani Shabdarth corpus in comparison to the other corpora for which the increase is in the range 1–10.
Chapter
This chapter describes our experience with crowdsourcing a corpus containing named entity annotations and their linking to DBpedia. The corpus consists of around 10,000 tweets and is still growing, as new social media content is added. We first define the methodological framework for crowdsourcing entity annotated corpora, which combines expert-based and paid-for crowdsourcing. In addition, the infrastructural support and reusable components of the GATE Crowdsourcing plugin are presented. Next, the process of crowdsourcing named entity annotations and their DBpedia grounding is discussed in detail, including annotation schemas, annotation interfaces, and inter-annotator agreement. Where different judgements needed adjudication, we mostly used experts for this task, in order to ensure a high quality gold standard.
Conference Paper
Crowdsourcing is an increasingly popular, collaborative approach for acquiring annotated corpora. Despite this, reuse of corpus conversion tools and user interfaces between projects is still problematic, since these are not generally made available. This demonstration will introduce the new, open-source GATE Crowdsourcing plugin, which offers infrastructural support for mapping documents to crowdsourcing units and back, as well as automatically generating reusable crowdsourcing interfaces for NLP classification and selection tasks. The entire workflow will be demonstrated on: annotating named entities; disambiguating words and named entities with respect to DBpedia URIs; annotation of opinion holders and targets; and sentiment.
Article
Novel social media collaboration platforms, such as games with a purpose and mechanised labour marketplaces, are increasingly used for enlisting large populations of non-experts in crowdsourced knowledge acquisition processes. Climate Quiz uses this paradigm for acquiring environmental domain knowledge from non-experts. The game's usage statistics and the quality of the produced data show that Climate Quiz has managed to attract a large number of players but noisy input data and task complexity led to low player engagement and suboptimal task throughput and data quality. To address these limitations, the authors propose embedding the game into a hybrid-genre workflow, which supplements the game with a set of tasks outsourced to micro-workers, thus leveraging the complementary nature of games with a purpose and mechanised labour platforms. Experimental evaluations suggest that such workflows are feasible and have positive effects on the game's enjoyment level and the quality of its output.
Conference Paper
This paper explores the task of building an accurate prepositional phrase attachment corpus for new genres while avoiding a large investment in terms of time and money by crowd-sourcing judgments. We develop and present a system to extract prepositional phrases and their potential attachments from ungrammatical and informal sentences and pose the subsequent disambiguation tasks as multiple choice questions to workers from Amazon's Mechanical Turk service. Our analysis shows that this two-step approach is capable of producing reliable annotations on informal and potentially noisy blog text, and this semi-automated strategy holds promise for similar annotation projects in new genres.
Conference Paper
Mechanised labour and games with a purpose are the two most popular human computation genres, frequently employed to support research activities in fields as diverse as natural language processing, semantic web or databases. Research projects typically rely on either one or the other of these genres, and therefore there is a general lack of understanding of how these two genres compare and whether and how they could be used together to offset their respective weaknesses. This paper addresses these open questions. It first identifies the differences between the two genres, primarily in terms of cost, speed and result quality, based on existing studies in the literature. Secondly, it reports on a comparative study which involves performing the same task through both genres and comparing the results. The study's findings demonstrate that the two genres are highly complementary, which not only makes them suitable for different types of projects, but also opens new opportunities for building cross-genre human computation solutions that exploit the strengths of both genres simultaneously.
Conference Paper
We present an end-to-end pipeline including a user interface for the production of word-level annotations for an opinion-mining task in the information technology (IT) domain. Our pre-annotation pipeline selects candidate sentences for annotation using results from a small amount of trained annotation to bias the random selection over a large corpus. Our user interface reduces the need for the user to understand the "meaning" of opinion in our domain context, which is related to community reaction. It acts as a preliminary buffer against low-quality annotators. Finally, our post-annotation pipeline aggregates responses and applies a more aggressive quality filter. We present positive results using two different evaluation philosophies and discuss how our design decisions enabled the collection of high-quality annotations under subjective and fine-grained conditions.
Article
The influential Text REtrieval Conference (TREC) retrieval conference has always relied upon specialist assessors or occasionally participating groups to create relevance judgements for the tracks that it runs. Recently however, crowdsourcing has been championed as a cheap, fast and effective alternative to traditional TREC-like assessments. In 2010, TREC tracks experimented with crowdsourcing for the very first time. In this paper, we report our successful experience in creating relevance assessments for the TREC Blog track 2010 top news stories task using crowdsourcing. In particular, we crowdsourced both real-time newsworthiness assessments for news stories as well as traditional relevance assessments for blog posts. We conclude that crowdsourcing not only appears to be a feasible, but also cheap and fast means to generate relevance assessments. Furthermore, we detail our experiences running the crowdsourced evaluation of the TREC Blog track, discuss the lessons learned, and provide best practices.
Article
Being expensive and time consuming, human knowledge acquisition has consistently been a major bottleneck for solving real problems. In this paper, we present a practical framework for acquiring high quality non-expert knowl- edge from on-demand workforce using Ama- zon Mechanical Turk (MTurk). We show how to apply this framework to collect large-scale human knowledge on AOL query classifica- tion in a fast and efficient fashion. Based on extensive experiments and analysis, we dem- onstrate how to detect low-quality labels from massive data sets and their impact on collect- ing high-quality knowledge. Our experimental findings also provide insight into the best practices on balancing cost and data quality for using MTurk.