Conference PaperPDF Available

Abstract

Online citizen science is a form of crowdsourcing that has received increased attention from researchers. Despite significant potential, a key challenge in leveraging citizens to provide scientific information is the quality of citizen-generated data, a form of user-generated content (UGC). In this work, we present a study in which domain experts in biology were asked to infer classes based on attributes of observed organisms generated by citizen scientists. In addition, because domain expertise is a scarce resource and does not scale in large datasets, we also investigated the potential for classification using machine learning. The results demonstrate that experts generally are able to leverage the non-expert attributes to infer classes that are more specific than those familiar to the non-expert participants. Our work provides evidence of the potential usefulness of the novel instance-based approach to citizen science and suggests several strategies for refining this citizen science model.
WCBA - 2017 Winter Conference on Business Analytics 1
Finding value through instance-based data collection in citizen science
ROMAN LUKYANENKO, University of Saskatchewan
JEFFREY PARSONS and YOLANDA F. WIERSMA, Memorial University of Newfoundland
ABSTRACT
Online citizen science is a form of crowdsourcing that has received increased attention from
researchers. Despite significant potential, a key challenge in leveraging citizens to provide scientific
information is the quality of citizen-generated data, a form of user-generated content (UGC). In this
work, we present a study in which domain experts in biology were asked to infer classes based on
attributes of observed organisms generated by citizen scientists. In addition, because domain expertise
is a scarce resource and does not scale in large datasets, we also investigated the potential for
classification using machine learning. The results demonstrate that experts generally are able to
leverage the non-expert attributes to infer classes that are more specific than those familiar to the non-
expert participants. Our work provides evidence of the potential usefulness of the novel instance-based
approach to citizen science and suggests several strategies for refining this citizen science model.
KEYWORDS: user-generated content, citizen science, data quality, instance-based data, machine
learning, expert knowledge
Introduction: Research Question
There has been an explosive growth in online citizen science fuelled by rapid proliferation of content
and sharing technologies, including mobile devices. Citizen science is a type of crowdsourcing whereby
organizations (typically academic institutions, but also public agencies, environmental non-
governmental organizations and governments) seek to engage members of the general public in
research [1][3]. These initiatives can involve large numbers of participants (which we term “data
contributors”)some now in the hundreds of thousands (e.g., Zooniverse.org, eBird.org) and generate
massive data sets.
However, with this growth in projects (particularly online ones in which participation may be
anonymous), there are concerns about information quality (IQ) [4], [5]. As one researcher put it: “‘You
don't necessarily know who is on the other end of a data point,’ [it] could be a retired botany professor
reporting on wildflowers or a pure amateur with an untrained eye… As a result, it is difficult to
guarantee the quality of the data.” [6, p. 260].
The prevailing approach to data quality in such projects is consumer-centric, and posits that high quality
can be achieved by working with information contributors to address the needs of data consumers [7].
For example, in biology-related projects this consumer-centric approach is manifested through data
collection that is organized around classes that are useful to scientists – biological species. To ensure
that citizens provide accurate classifications, a variety of techniques have been explored, including
training, improving clarity of data collection protocols, identifying and incentivizing contributions from
WCBA - 2017 Winter Conference on Business Analytics 2
experts within crowds, and employing advanced statistical methods to detect potential errors [1], [2],
[8][10].
The consumer-centric approach, however, has limitations. If members of the general public do not have
the required species-identification expertise, they may resort to guessing or abandon a project out of
concern about providing incorrect classifications [11][13]. We advocate an alternative, instance-based
model of citizen science data collection, in which projects do not constrain citizens to the classes of
interest to scientists and, instead, encourage them to provide any classes and attributes of observed
organisms (instances) irrespective of the classification structures needed by scientists. This removes the
constraint of traditional modeling to understand and comply with an a priori defined classification
schema. Consequently, we argue that compared with the traditional approach, the instance-based
alternative should be much easier for the non-experts to use, resulting in a greater quantity of data that
better represent the range of phenomena citizens encounter and wish to contribute to a project.
Notwithstanding this potential, an important concern arising from collecting data in terms of free-form
attributes and classes is the extent to which the resulting sparse and heterogeneous data are useful to
scientists in classifying organisms at the desired classification levels (e.g., species). In this work, we
present a study in which domain experts in biology were asked to infer classes based on attributes of
observed organisms generated by citizen scientists. In addition, because domain expertise is a scarce
resource and does not scale easily in large datasets, we also investigated the potential for classification
using machine learning.
Approach and Findings
To obtain evidence of the potential of the instance-based model to generate useful data to scientists,
we used free form attributes and the most frequent classes (always generic ones, such as bird, tree, fish)
provided by 390 non-experts in a laboratory setting with respect to 16 biological organisms, some
common and others uncommon in the region where the data were collected (the dataset is a subset of
the data reported in [11]).
Before employing machine learning, we wanted to determine whether human experts could use the
instance-based data to identify species. In one-on-one interviews, we asked 16 local biology experts to
provide a best guess as to what the organism was after sequentially revealing the attributes provided by
the non-experts based on the frequency with which an attribute was reported.1 The natural history
experts identified, on average, 59.4% (± 14.7 s.d.) of the organisms at the species level using only the
attributes provided – significantly higher than the percentage of correct species level classifications of
non-experts in the original study [11].
We then used machine learning (ML) to simulate the process undertaken by the human experts. Even
after focusing on the top 10 attributes for each of the 16 organisms, the matrix of attributes remained
extremely sparse with only 2.34% attributes coded as “1”showing the difficulty with using such data
directly by scientists. We used a variety of common ML approaches, including neural networks, support
vector machines, random forests and boosting. The top performing algorithm was a boosted Naïve
Bayes classifier based on AdaBoost [14], which achieved an average classification accuracy of 70.83% (±
4.04 c.e.) across 16 species (based on 10 fold cross-validation and 50 boosting iterations). This is higher
than the accuracy by human experts. However, a direct comparison between human and machine
performance is not meaningful since ML worked with 16 finite targets, whereas experts had to draw
1 For example, if the attribute “red eyes” was the most reported attribute for that organism, it was revealed first.
WCBA - 2017 Winter Conference on Business Analytics 3
from all possible organisms in a local area. Nonetheless, the results point to the practical possibility of
asking non-experts to provide data at the level they are comfortable with, while automatically inferring
classes of interest to non-experts. Further, the potential to automatically infer classes can be used in a
future machine-to-citizen dialog whereby the artificial agent may ask additional confirmation or
verification questions to further increase the confidence in the ML-generated classification judgements.
Contributions
The results carry implications for data quality, data modeling, task design and application of machine
learning in crowdsourcing (especially citizen science). It shows that data quality and user participation
can be improved by relaxing constraints on what data can be provided without necessarily sacrificing
data utility thereby paving the way to data collection processes that are easier for ordinary people to
use. Our results also demonstrate the potential for using machine learning to address data quality issues
in citizen science.
References
[1] R. Bonney et al., “Next steps for citizen science,” Science, vol. 343, no. 6178, pp. 1436–1437, 2014.
[2] D. C. McKinley et al., “Citizen science can improve conservation science, natural resource
management, and environmental protection,” Biol. Conserv., In press 2016.
[3] A. Wiggins and K. Crowston, “From Conservation to Crowdsourcing: A Typology of Citizen
Science,” in 44th Hawaii International Conference on System Sciences, 2011, pp. 1–10.
[4] A. Alabri and J. Hunter, “Enhancing the Quality and Trust of Citizen Science Data,” in IEEE
eScience 2010, Brisbane, Australia, 2010, pp. 81–88.
[5] R. Lukyanenko, “Information Quality Research Challenge: Information Quality in the Age of
Ubiquitous Digital Intermediation,” J Data Inf. Qual., vol. 7, no. 1–2, p. 3:1–3:3, 2016.
[6] T. Gura, “Citizen science: amateur experts,” Nature, vol. 496, no. 7444, pp. 259–261, 2013.
[7] S. Engel and R. Voshell, “Volunteer Biological Monitoring: Can It Accurately Assess the
Ecological Condition of Streams?,” Am. Entomol., vol. 48, no. 3, pp. 164–177, 2002.
[8] M. Allahbakhsh, B. Benatallah, A. Ignjatovic, H. R. Motahari-Nezhad, E. Bertino, and S. Dustdar,
“Quality control in crowdsourcing systems: Issues and directions,” IEEE Internet Comput., no. 2,
pp. 76–81, 2013.
[9] A. I. Chittilappilly, L. Chen, and S. Amer-Yahia, “A Survey of General-Purpose Crowdsourcing
Techniques,” IEEE Trans. Knowl. Data Eng., vol. 28, no. 9, pp. 2246–2266, 2016.
[10] A. Wiggins, G. Newman, R. D. Stevenson, and K. Crowston, “Mechanisms for Data Quality and
Validation in Citizen Science,” in “Computing for Citizen Science” workshop, Stockholm, SE,
2011, pp. 14–19.
[11] R. Lukyanenko, J. Parsons, and Y. Wiersma, “The IQ of the Crowd: Understanding and Improving
Information Quality in Structured User-generated Content,” Inf. Syst. Res., vol. 25, no. 4, pp. 669–
689, 2014.
[12] J. Parsons, R. Lukyanenko, and Y. Wiersma, “Easier citizen science is better,” Nature, vol. 471, no.
7336, pp. 37–37, 2011.
[13] R. Lukyanenko, J. Parsons, and Y. Wiersma, “The Impact of Conceptual Modeling on Dataset
Completeness: A Field Experiment,” presented at the International Conference on Information
Systems, 2014, pp. 1–18.
[14] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an
application to boosting,” J. Comput. Syst. Sci., vol. 55, no. 1, pp. 119–139, 1997.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Citizen science has advanced science for hundreds of years, contributed to many peer-reviewed articles, and informed land management decisions and policies across the United States. Over the last 10 years, citizen science has grown immensely in the United States and many other countries. Here, we show how citizen science is a powerful tool for tackling many of the challenges faced in the field of conservation biology. We describe the two interwoven paths by which citizen science can improve conservation efforts, natural resource management, and environmental protection. The first path includes building scientific knowledge, while the other path involves informing policy and encouraging public action. We explore how citizen science is currently used and describe the investments needed to create a citizen science program. We find that: 1. Citizen science already contributes substantially to many domains of science, including conservation, natural resource, and environmental science. Citizen science informs natural resource management, environmental protection, and policymaking and fosters public input and engagement. 2. Many types of projects can benefit from citizen science, but one must be careful to match the needs for science and public involvement with the right type of citizen science project and the right method of public participation. 3. Citizen science is a rigorous process of scientific discovery, indistinguishable from conventional science apart from the participation of volunteers. When properly designed, carried out, and evaluated, citizen science can provide sound science, efficiently generate high-quality data, and help solve problems. http://www.sciencedirect.com/science/article/pii/S0006320716301963
Article
Full-text available
As information technology becomes an integral part of daily life, increasingly, people understand the world around them by turning to digital sources as opposed to directly interacting with objects in the physical world. This has ushered in the age of Ubiquitous Digital Intermediation (UDI). With the explosion of UDI, the scope of Information Quality (IQ) research is due to expand dramatically as the challenge becomes to capture the wealth and nuances of human experience. This article presents three key changes to the IQ landscape brought about by UDI, including expansion of the scope of traditional IQ dimensions, digital to physical mapping challenge, and the increased need to manage content authenticity. UDI generates many novel questions and opportunities for the IQ research community
Conference Paper
Full-text available
This paper investigates the impact of conceptual modeling on the information completeness dimension of information quality in the context of user-generated content. We propose a theoretical relationship between conceptual modeling approaches and information completeness and hypothesize that traditional class-based conceptual modeling negatively affects information completeness. We conducted a field experiment in the context of citizen science in biology. The empirical evidence demonstrates that users assigned to an instantiation that is based on class-based conceptual modeling provide fewer observations than users assigned to an instance-based condition. Users in the instance-based condition also provided a greater number of new classes of organisms. The findings support the proposed hypotheses and establish that conceptual modeling is an important factor in evaluating and increasing information completeness in user-generated content.
Article
Full-text available
User-generated content (UGC) is becoming a valuable organizational resource, as it is seen in many cases as a way to make more information available for analysis. To make effective use of UGC, it is necessary to understand information quality (IQ) in this setting. Traditional IQ research focuses on corporate data and views users as data consumers. However, as users with varying levels of expertise contribute information in an open setting, current conceptualizations of IQ break down. In particular, the practice of modeling information requirements in terms of fixed classes, such as an Entity-Relationship diagram or relational database tables, unnecessarily restricts the IQ of user-generated data sets. This paper defines crowd information quality (crowd IQ), empirically examines implications of class-based modeling approaches for crowd IQ, and offers a path for improving crowd IQ using instance-and-attribute based modeling. To evaluate the impact of modeling decisions on IQ, we conducted three experiments. Results demonstrate that information accuracy depends on the classes used to model domains, with participants providing more accurate information when classifying phenomena at a more general level. In addition, we found greater overall accuracy when participants could provide freeform data compared to a condition in which they selected from constrained choices. We further demonstrate that, relative to attribute-based data collection, information loss occurs when class-based models are used. Our findings have significant implications for information quality, information modeling, and UGC research and practice.
Article
Full-text available
Around the globe, thousands of research projects are engaging millions of individuals—many of whom are not trained as scientists—in collecting, categorizing, transcribing, or analyzing scientific data. These projects, known as citizen science, cover a breadth of topics from microbiomes to native bees to water quality to galaxies. Most projects obtain or manage scientific information at scales or resolutions unattainable by individual researchers or research teams, whether enrolling thousands of individuals collecting data across several continents, enlisting small armies of participants in categorizing vast quantities of online data, or organizing small groups of volunteers to tackle local problems. (Article available "toll-free" at: CitizenScience.org/publications/bonney-et-al-2014-new-directions-for-citizen-science-science/ )
Article
Full-text available
As a new distributed computing model, crowdsourcing lets people leverage the crowd's intelligence and wisdom toward solving problems. This article proposes a framework for characterizing various dimensions of quality control in crowdsourcing systems, a critical issue. The authors briefly review existing quality-control approaches, identify open issues, and look to future research directions. In the Web extra, the authors discuss both design-time and runtime approaches in more detail.
Article
Since Jeff Howe introduced the term Crowdsourcing in 2006, this human-powered problem-solving paradigm has gained a lot of attention and has been a hot research topic in the field of computer science. Even though a lot of work has been conducted on this topic, so far we do not have a comprehensive survey on most relevant work done in the crowdsourcing field. In this paper, we aim to offer an overall picture of the current state of the art techniques in general-purpose crowdsourcing. According to their focus, we divide this work into three parts, which are: incentive design, task assignment, and quality control. For each part, we start with different problems faced in that area followed by a brief description of existing work and a discussion of pros and cons. In addition, we also present a real scenario on how the different techniques are used in implementing a location-based crowdsourcing platform, gMission. Finally, we highlight the limitations of the current general-purpose crowdsourcing techniques and present some open problems in this area.
Article
Involving members of the public can help science projects — but researchers should consider what they want to achieve.
Article
Government agencies have begun to use biological monitoring data collected by volunteers for official purposes, but questions have been raised regarding the validity of conclusions about ecological condition. We conducted a 2-yr study that assessed, modified, and validated the Virginia Save-Our-Streams (SOS) program, a popular volunteer monitoring program that emphasizes benthic macroinvertebrates. The study design consisted of sampling sites using accepted professional methods concurrently with volunteers using the SOS protocol. In addition, sites previously sampled by volunteers were re-sampled using professional methods. The numerical results from volunteer and professional samples were not correlated (r = 0.46) and at times produced different conclusions about ecological condition (65% agreement). The Virginia SOS protocol consistently overrated ecological condition. We determined that the reason for the inaccuracy was the simplistic numerical analysis in the volunteer protocol, which was based solely on presence of taxa. We developed a quantitative multimetric index that was appropriate for use by volunteers, and the SOS sampling protocol was modified to obtain counts of macroinvertebrates in the various taxa. The modified SOS protocol was then evaluated with a different set of concurrent samples taken by volunteers and professionals. The modified SOS protocol proved feasible for volunteers, and the new multimetric index correlated well with a professional multimetric index (r = 0.6923). The conclusions about ecological condition reached by the volunteer and professional protocols agreed very closely (96%). This study demonstrated that volunteer biological monitoring programs can provide reliable information about ecological condition, but every protocol needs to be validated by standard quantitative methods.
Conference Paper
Citizen science is a form of research collaboration involving members of the public in scientific research projects to address real-world problems. Often organized as a virtual collaboration, these projects are a type of open movement, with collective goals addressed through open participation in research tasks. Existing typologies of citizen science projects focus primarily on the structure of participation, paying little attention to the organizational and macrostructural properties that are important to designing and managing effective projects and technologies. By examining a variety of project characteristics, we identified five types-Action, Conservation, Investigation, Virtual, and Education- that differ in primary project goals and the importance of physical environment to participation.