Content uploaded by Roman Lukyanenko
Author content
All content in this area was uploaded by Roman Lukyanenko on Mar 18, 2016
Content may be subject to copyright.
2016 Collective Intelligence Conference, NYU Stern School of Business from June 1-3, 2016, New York, NY 10012
Is Crowdsourced Attribute Data Useful in Citizen
Science? A Study of Experts and Machines
ROMAN LUKYANENKO, Florida International University
YOLANDA F. WIERSMA and JEFFREY PARSONS, Memorial University of Newfoundland
Online citizen science – a major type of crowdsourcing – is on the rise. However, concerns about data
quality remain a major obstacle to wider adoption of citizen science, particularly in online projects in
which participation is open and anonymous. For example, in a natural history context, members of the
general public may lack the species-identification expertise needed for a particular research project
and, when asked to classify observed organisms at the species level, may resort to guessing or avoid
participating out of concern about providing incorrect classifications. Typical solutions to increase
data quality include training volunteers, providing clear data collection instructions, exploiting
redundancy in the crowd when multiple observers report on the same phenomena, and, more recently,
employing artificial intelligence [Lewandowski and Specht 2015; Wiggins et al. 2011; He and Wiggins
2015].
In contrast, some have argued that part of the challenge of data quality may be rooted in approaching
data collection from the point of view of scientists, rather than of citizens; trying to hold citizens to
high scientific standards may dissuade many from participating. That view has led to the suggestion
that data about phenomena of interest should be collected in terms of attributes of instances, rather
than requiring classification that relies on a high level of domain knowledge. Such an approach has
been referred to as an instance-based model (of citizen science and other similar types of
crowdsourcing) [Lukyanenko et al. 2014b].
Under the instance-based model, representation is based on describing individual objects (termed
instances, e.g., organisms), rather than pre-specifying classes on interest into which individuals are
grouped. As each instance can be classified by the observer in different ways (depending on the
context, expected purpose, domain expertise, and familiarity with classification structures), it is
unrealistic to expect complete agreement between citizen scientists and domain experts on ways to
classify individuals. By eliciting free-form attributes and classes of the observed individual organisms,
non-experts are able to provide information at the level at which they are confident. When data
contributors are given this flexibility, the impetus to guess or opt not to participate will also be
minimized. However, a natural concern arising from collecting data in terms of free attributes and
classes, instead of predetermined classes and attributes, is the extent to which the attributes are
useful to scientists in classifying organisms at the desired level. We present a study with domain
experts in biology and machines to address this question.
To provide evidence of the utility of the instance-based model, we used the dataset generated in the
original paper that proposed the instance-based approach for crowdsourcing [Lukyanenko et al.
2014b]. We conducted an experiment with 390 non-experts in natural history – undergraduate
business and engineering students – and collected free-form attributes and classes consistent with the
instance-based model. We replicated these results in a real context using a live online citizen science
project [Lukyanenko et al. 2014a]. These studies demonstrated the advantages of the instance-based
approach for data quality and contributor participation, but offered no evidence whether generic
1
2 R. Lukyanenko, Y. F. Wiersma, J. Parsons
2016 Collective Intelligence Conference, NYU Stern School of Business from June 1-3, 2016, New York, NY 10012
classes and highly idiosyncratic attributes can be used for inferring scientifically useful species-level
classes.
To answer this second question, we conducted a study with experts in biology (e.g., biology professors,
members of natural history societies) to test whether the classes and attributes provided by non-
experts in our previous study [Lukyanenko et al. 2014b] could be used by experts to reliably identify
the organisms at the species level (the level typically sought in citizen science crowdsourcing). In one-
hour interviews, we asked the experts to provide a “best guess” as to what the organism was after
being given (one at a time) each of the most common attributes provided by the non-experts. We also
asked experts to think out loud and recorded their utterances after each attribute was provided.
We performed a similar procedure using machine learning. We converted the attributes data set used
with the experts into a matrix of attributes where the attributes provided in [Lukyanenko et al.
2014b] were assigned either 0 or 1 if a particular participant used that attribute to describe the
species of interest. We then trained machines using a variety of common data mining algorithms (e.g.,
neural networks) to classify the organisms in the study at the species level.
The natural history experts were able to identify an average of 40.7% (± 10.3% s.d.) of the organisms
at the species level based on the attributes provided – significantly higher than the percentage of
correct species level classifications of non-experts in our previous study [Lukyanenko et al. 2014b].
There was also a high correlation between the confidence reported by natural history experts with
their final guess and the percentage of times the guess was correct (Spearman’s rho = 0.68, p < 0.01).
The result of the initial machine learning is consistent with that of human experts. Using common
machine learning methods, we obtained on average 74.4% classification accuracy of the species based
on the attributes provided by non-experts. This is higher than the average classification accuracy of
40.7% achieved by human experts (however, note that machine learning had the additional benefit of
training from a given and finite set of target classes, whereas experts were only told that the
organisms were those that could be observed in a local area).
The results have important implications for data quality, data modeling, task design and application
of machine learning in crowdsourcing (especially citizen science). To our knowledge, this study is the
first, to offer evidence of the utility of the instance-based model for traditional decision-making tasks.
This result can pave way to broader application of this model in practice. Our results also
demonstrate the potential for using machine-learning to classify organisms based on instance-based
model of citizen science. This can be an effective practical solution as human expertise is often scarce.
This supports the general trend towards hybrid intelligence that combines best abilities of humans
and machines into a more powerful integrated solution.
REFERENCES
Yurong He and Andrea Wiggins. 2015. Community-as-a-Service: Data Validation in Citizen Science.
In ISWC 2015.
Eva Lewandowski and Hannah Specht. 2015. Influence of volunteer and project characteristics on
data quality of biological surveys. Conserv. Biol. 29, 3 (June 2015), 713–723.
DOI:http://dx.doi.org/10.1111/cobi.12481
Roman Lukyanenko, Jeffrey Parsons, and Yolanda Wiersma. 2014a. The Impact of Conceptual
Modeling on Dataset Completeness: A Field Experiment. In International Conference on
Information Systems. 1–18.
Usefulness of Crowdsourced Attribute Data in Citizen Science 3
2016 Collective Intelligence Conference, NYU Stern School of Business from June 1-3, 2016, New York, NY 10012
Roman Lukyanenko, Jeffrey Parsons, and Yolanda Wiersma. 2014b. The IQ of the Crowd:
Understanding and Improving Information Quality in Structured User-generated Content.
Inf. Syst. Res. 25, 4 (2014), 669 – 689.
Andrea Wiggins, Greg Newman, Robert D. Stevenson, and Kevin Crowston. 2011. Mechanisms for
Data Quality and Validation in Citizen Science. In “Computing for Citizen Science” workshop.
Stockholm, SE, 14–19.