WCBA - 2017 Winter Conference on Business Analytics 1
Finding value through instance-based data collection in citizen science
ROMAN LUKYANENKO, University of Saskatchewan
JEFFREY PARSONS and YOLANDA F. WIERSMA, Memorial University of Newfoundland
Online citizen science is a form of crowdsourcing that has received increased attention from
researchers. Despite significant potential, a key challenge in leveraging citizens to provide scientific
information is the quality of citizen-generated data, a form of user-generated content (UGC). In this
work, we present a study in which domain experts in biology were asked to infer classes based on
attributes of observed organisms generated by citizen scientists. In addition, because domain expertise
is a scarce resource and does not scale in large datasets, we also investigated the potential for
classification using machine learning. The results demonstrate that experts generally are able to
leverage the non-expert attributes to infer classes that are more specific than those familiar to the non-
expert participants. Our work provides evidence of the potential usefulness of the novel instance-based
approach to citizen science and suggests several strategies for refining this citizen science model.
KEYWORDS: user-generated content, citizen science, data quality, instance-based data, machine
learning, expert knowledge
Introduction: Research Question
There has been an explosive growth in online citizen science fuelled by rapid proliferation of content
and sharing technologies, including mobile devices. Citizen science is a type of crowdsourcing whereby
organizations (typically academic institutions, but also public agencies, environmental non-
governmental organizations and governments) seek to engage members of the general public in
research –. These initiatives can involve large numbers of participants (which we term “data
contributors”) – some now in the hundreds of thousands (e.g., Zooniverse.org, eBird.org) – and generate
massive data sets.
However, with this growth in projects (particularly online ones in which participation may be
anonymous), there are concerns about information quality (IQ) , . As one researcher put it: “‘You
don't necessarily know who is on the other end of a data point,’ [it] could be a retired botany professor
reporting on wildflowers or a pure amateur with an untrained eye… As a result, it is difficult to
guarantee the quality of the data.” [6, p. 260].
The prevailing approach to data quality in such projects is consumer-centric, and posits that high quality
can be achieved by working with information contributors to address the needs of data consumers .
For example, in biology-related projects this consumer-centric approach is manifested through data
collection that is organized around classes that are useful to scientists – biological species. To ensure
that citizens provide accurate classifications, a variety of techniques have been explored, including
training, improving clarity of data collection protocols, identifying and incentivizing contributions from
WCBA - 2017 Winter Conference on Business Analytics 2
experts within crowds, and employing advanced statistical methods to detect potential errors , ,
The consumer-centric approach, however, has limitations. If members of the general public do not have
the required species-identification expertise, they may resort to guessing or abandon a project out of
concern about providing incorrect classifications –. We advocate an alternative, instance-based
model of citizen science data collection, in which projects do not constrain citizens to the classes of
interest to scientists and, instead, encourage them to provide any classes and attributes of observed
organisms (instances) irrespective of the classification structures needed by scientists. This removes the
constraint of traditional modeling to understand and comply with an a priori defined classification
schema. Consequently, we argue that compared with the traditional approach, the instance-based
alternative should be much easier for the non-experts to use, resulting in a greater quantity of data that
better represent the range of phenomena citizens encounter and wish to contribute to a project.
Notwithstanding this potential, an important concern arising from collecting data in terms of free-form
attributes and classes is the extent to which the resulting sparse and heterogeneous data are useful to
scientists in classifying organisms at the desired classification levels (e.g., species). In this work, we
present a study in which domain experts in biology were asked to infer classes based on attributes of
observed organisms generated by citizen scientists. In addition, because domain expertise is a scarce
resource and does not scale easily in large datasets, we also investigated the potential for classification
using machine learning.
Approach and Findings
To obtain evidence of the potential of the instance-based model to generate useful data to scientists,
we used free form attributes and the most frequent classes (always generic ones, such as bird, tree, fish)
provided by 390 non-experts in a laboratory setting with respect to 16 biological organisms, some
common and others uncommon in the region where the data were collected (the dataset is a subset of
the data reported in ).
Before employing machine learning, we wanted to determine whether human experts could use the
instance-based data to identify species. In one-on-one interviews, we asked 16 local biology experts to
provide a best guess as to what the organism was after sequentially revealing the attributes provided by
the non-experts based on the frequency with which an attribute was reported.1 The natural history
experts identified, on average, 59.4% (± 14.7 s.d.) of the organisms at the species level using only the
attributes provided – significantly higher than the percentage of correct species level classifications of
non-experts in the original study .
We then used machine learning (ML) to simulate the process undertaken by the human experts. Even
after focusing on the top 10 attributes for each of the 16 organisms, the matrix of attributes remained
extremely sparse with only 2.34% attributes coded as “1” – showing the difficulty with using such data
directly by scientists. We used a variety of common ML approaches, including neural networks, support
vector machines, random forests and boosting. The top performing algorithm was a boosted Naïve
Bayes classifier based on AdaBoost , which achieved an average classification accuracy of 70.83% (±
4.04 c.e.) across 16 species (based on 10 fold cross-validation and 50 boosting iterations). This is higher
than the accuracy by human experts. However, a direct comparison between human and machine
performance is not meaningful since ML worked with 16 finite targets, whereas experts had to draw
1 For example, if the attribute “red eyes” was the most reported attribute for that organism, it was revealed first.
WCBA - 2017 Winter Conference on Business Analytics 3
from all possible organisms in a local area. Nonetheless, the results point to the practical possibility of
asking non-experts to provide data at the level they are comfortable with, while automatically inferring
classes of interest to non-experts. Further, the potential to automatically infer classes can be used in a
future machine-to-citizen dialog whereby the artificial agent may ask additional confirmation or
verification questions to further increase the confidence in the ML-generated classification judgements.
The results carry implications for data quality, data modeling, task design and application of machine
learning in crowdsourcing (especially citizen science). It shows that data quality and user participation
can be improved by relaxing constraints on what data can be provided without necessarily sacrificing
data utility – thereby paving the way to data collection processes that are easier for ordinary people to
use. Our results also demonstrate the potential for using machine learning to address data quality issues
in citizen science.
 R. Bonney et al., “Next steps for citizen science,” Science, vol. 343, no. 6178, pp. 1436–1437, 2014.
 D. C. McKinley et al., “Citizen science can improve conservation science, natural resource
management, and environmental protection,” Biol. Conserv., In press 2016.
 A. Wiggins and K. Crowston, “From Conservation to Crowdsourcing: A Typology of Citizen
Science,” in 44th Hawaii International Conference on System Sciences, 2011, pp. 1–10.
 A. Alabri and J. Hunter, “Enhancing the Quality and Trust of Citizen Science Data,” in IEEE
eScience 2010, Brisbane, Australia, 2010, pp. 81–88.
 R. Lukyanenko, “Information Quality Research Challenge: Information Quality in the Age of
Ubiquitous Digital Intermediation,” J Data Inf. Qual., vol. 7, no. 1–2, p. 3:1–3:3, 2016.
 T. Gura, “Citizen science: amateur experts,” Nature, vol. 496, no. 7444, pp. 259–261, 2013.
 S. Engel and R. Voshell, “Volunteer Biological Monitoring: Can It Accurately Assess the
Ecological Condition of Streams?,” Am. Entomol., vol. 48, no. 3, pp. 164–177, 2002.
 M. Allahbakhsh, B. Benatallah, A. Ignjatovic, H. R. Motahari-Nezhad, E. Bertino, and S. Dustdar,
“Quality control in crowdsourcing systems: Issues and directions,” IEEE Internet Comput., no. 2,
pp. 76–81, 2013.
 A. I. Chittilappilly, L. Chen, and S. Amer-Yahia, “A Survey of General-Purpose Crowdsourcing
Techniques,” IEEE Trans. Knowl. Data Eng., vol. 28, no. 9, pp. 2246–2266, 2016.
 A. Wiggins, G. Newman, R. D. Stevenson, and K. Crowston, “Mechanisms for Data Quality and
Validation in Citizen Science,” in “Computing for Citizen Science” workshop, Stockholm, SE,
2011, pp. 14–19.
 R. Lukyanenko, J. Parsons, and Y. Wiersma, “The IQ of the Crowd: Understanding and Improving
Information Quality in Structured User-generated Content,” Inf. Syst. Res., vol. 25, no. 4, pp. 669–
 J. Parsons, R. Lukyanenko, and Y. Wiersma, “Easier citizen science is better,” Nature, vol. 471, no.
7336, pp. 37–37, 2011.
 R. Lukyanenko, J. Parsons, and Y. Wiersma, “The Impact of Conceptual Modeling on Dataset
Completeness: A Field Experiment,” presented at the International Conference on Information
Systems, 2014, pp. 1–18.
 Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an
application to boosting,” J. Comput. Syst. Sci., vol. 55, no. 1, pp. 119–139, 1997.