Conference PaperPDF Available

Is Crowdsourced Attribute Data Useful in Citizen Science? A Study of Experts and Machines

Authors:

Abstract

Despite significant potential, a key challenge in citizen science is data quality. Traditionally, citizen science data collection is organized around classes that are useful to scientists. An alternative instance-based model of data collection has been proposed, in which observations are elicited as instances with attributes and classes - independent of the classes needed by scientists. A concern arising from collecting data this way is the extent to which it is useful to scientists. In this paper we compare domain experts in biology and machines when they are asked to infer classes based on attributes of observed organisms generated by citizen scientists. Our work provides evidence of the usefulness of the novel instance-based approach to citizen science.
2016 Collective Intelligence Conference, NYU Stern School of Business from June 1-3, 2016, New York, NY 10012
Is Crowdsourced Attribute Data Useful in Citizen
Science? A Study of Experts and Machines
ROMAN LUKYANENKO, Florida International University
YOLANDA F. WIERSMA and JEFFREY PARSONS, Memorial University of Newfoundland
Online citizen science a major type of crowdsourcing is on the rise. However, concerns about data
quality remain a major obstacle to wider adoption of citizen science, particularly in online projects in
which participation is open and anonymous. For example, in a natural history context, members of the
general public may lack the species-identification expertise needed for a particular research project
and, when asked to classify observed organisms at the species level, may resort to guessing or avoid
participating out of concern about providing incorrect classifications. Typical solutions to increase
data quality include training volunteers, providing clear data collection instructions, exploiting
redundancy in the crowd when multiple observers report on the same phenomena, and, more recently,
employing artificial intelligence [Lewandowski and Specht 2015; Wiggins et al. 2011; He and Wiggins
2015].
In contrast, some have argued that part of the challenge of data quality may be rooted in approaching
data collection from the point of view of scientists, rather than of citizens; trying to hold citizens to
high scientific standards may dissuade many from participating. That view has led to the suggestion
that data about phenomena of interest should be collected in terms of attributes of instances, rather
than requiring classification that relies on a high level of domain knowledge. Such an approach has
been referred to as an instance-based model (of citizen science and other similar types of
crowdsourcing) [Lukyanenko et al. 2014b].
Under the instance-based model, representation is based on describing individual objects (termed
instances, e.g., organisms), rather than pre-specifying classes on interest into which individuals are
grouped. As each instance can be classified by the observer in different ways (depending on the
context, expected purpose, domain expertise, and familiarity with classification structures), it is
unrealistic to expect complete agreement between citizen scientists and domain experts on ways to
classify individuals. By eliciting free-form attributes and classes of the observed individual organisms,
non-experts are able to provide information at the level at which they are confident. When data
contributors are given this flexibility, the impetus to guess or opt not to participate will also be
minimized. However, a natural concern arising from collecting data in terms of free attributes and
classes, instead of predetermined classes and attributes, is the extent to which the attributes are
useful to scientists in classifying organisms at the desired level. We present a study with domain
experts in biology and machines to address this question.
To provide evidence of the utility of the instance-based model, we used the dataset generated in the
original paper that proposed the instance-based approach for crowdsourcing [Lukyanenko et al.
2014b]. We conducted an experiment with 390 non-experts in natural history undergraduate
business and engineering students and collected free-form attributes and classes consistent with the
instance-based model. We replicated these results in a real context using a live online citizen science
project [Lukyanenko et al. 2014a]. These studies demonstrated the advantages of the instance-based
approach for data quality and contributor participation, but offered no evidence whether generic
1
2 R. Lukyanenko, Y. F. Wiersma, J. Parsons
2016 Collective Intelligence Conference, NYU Stern School of Business from June 1-3, 2016, New York, NY 10012
classes and highly idiosyncratic attributes can be used for inferring scientifically useful species-level
classes.
To answer this second question, we conducted a study with experts in biology (e.g., biology professors,
members of natural history societies) to test whether the classes and attributes provided by non-
experts in our previous study [Lukyanenko et al. 2014b] could be used by experts to reliably identify
the organisms at the species level (the level typically sought in citizen science crowdsourcing). In one-
hour interviews, we asked the experts to provide a best guess as to what the organism was after
being given (one at a time) each of the most common attributes provided by the non-experts. We also
asked experts to think out loud and recorded their utterances after each attribute was provided.
We performed a similar procedure using machine learning. We converted the attributes data set used
with the experts into a matrix of attributes where the attributes provided in [Lukyanenko et al.
2014b] were assigned either 0 or 1 if a particular participant used that attribute to describe the
species of interest. We then trained machines using a variety of common data mining algorithms (e.g.,
neural networks) to classify the organisms in the study at the species level.
The natural history experts were able to identify an average of 40.7% (± 10.3% s.d.) of the organisms
at the species level based on the attributes provided significantly higher than the percentage of
correct species level classifications of non-experts in our previous study [Lukyanenko et al. 2014b].
There was also a high correlation between the confidence reported by natural history experts with
their final guess and the percentage of times the guess was correct (Spearman’s rho = 0.68, p < 0.01).
The result of the initial machine learning is consistent with that of human experts. Using common
machine learning methods, we obtained on average 74.4% classification accuracy of the species based
on the attributes provided by non-experts. This is higher than the average classification accuracy of
40.7% achieved by human experts (however, note that machine learning had the additional benefit of
training from a given and finite set of target classes, whereas experts were only told that the
organisms were those that could be observed in a local area).
The results have important implications for data quality, data modeling, task design and application
of machine learning in crowdsourcing (especially citizen science). To our knowledge, this study is the
first, to offer evidence of the utility of the instance-based model for traditional decision-making tasks.
This result can pave way to broader application of this model in practice. Our results also
demonstrate the potential for using machine-learning to classify organisms based on instance-based
model of citizen science. This can be an effective practical solution as human expertise is often scarce.
This supports the general trend towards hybrid intelligence that combines best abilities of humans
and machines into a more powerful integrated solution.
REFERENCES
Yurong He and Andrea Wiggins. 2015. Community-as-a-Service: Data Validation in Citizen Science.
In ISWC 2015.
Eva Lewandowski and Hannah Specht. 2015. Influence of volunteer and project characteristics on
data quality of biological surveys. Conserv. Biol. 29, 3 (June 2015), 713723.
DOI:http://dx.doi.org/10.1111/cobi.12481
Roman Lukyanenko, Jeffrey Parsons, and Yolanda Wiersma. 2014a. The Impact of Conceptual
Modeling on Dataset Completeness: A Field Experiment. In International Conference on
Information Systems. 118.
Usefulness of Crowdsourced Attribute Data in Citizen Science 3
2016 Collective Intelligence Conference, NYU Stern School of Business from June 1-3, 2016, New York, NY 10012
Roman Lukyanenko, Jeffrey Parsons, and Yolanda Wiersma. 2014b. The IQ of the Crowd:
Understanding and Improving Information Quality in Structured User-generated Content.
Inf. Syst. Res. 25, 4 (2014), 669 689.
Andrea Wiggins, Greg Newman, Robert D. Stevenson, and Kevin Crowston. 2011. Mechanisms for
Data Quality and Validation in Citizen Science. In “Computing for Citizen Science” workshop.
Stockholm, SE, 1419.
... One can make this process more effective by leveraging recent advances in artificial intelligence, including natural language processing. In particular, we showed that one can use the attributes collected from non-expert volunteers who observe generally unfamiliar plants and animals in following the proposed guidelines to train machines to predict species of interest to scientists (captured in the TOM) (Lukyanenko, Wiersma, & Parsons, 2016c). For example, using common text mining methods (Larsen et al., 2008;Provost & Fawcett, 2013;Weiss, Indurkhya, & Zhang, 2010), we could obtain classification accuracy of species identification a high as 74.4 percent (Lukyanenko et al., 2016c) based on the attributes that the non-experts provided (a percentage notably higher than the percentage of correct species-level classifications by non-experts reported in the literature (Lukyanenko et al., 2014b)). ...
... In particular, we showed that one can use the attributes collected from non-expert volunteers who observe generally unfamiliar plants and animals in following the proposed guidelines to train machines to predict species of interest to scientists (captured in the TOM) (Lukyanenko, Wiersma, & Parsons, 2016c). For example, using common text mining methods (Larsen et al., 2008;Provost & Fawcett, 2013;Weiss, Indurkhya, & Zhang, 2010), we could obtain classification accuracy of species identification a high as 74.4 percent (Lukyanenko et al., 2016c) based on the attributes that the non-experts provided (a percentage notably higher than the percentage of correct species-level classifications by non-experts reported in the literature (Lukyanenko et al., 2014b)). One can use a similar approach to obtain other classification targets of interest (e.g., nocturnal vs. diurnal, marine vs. terrestrial, poisonous vs. edible) that are missing from the original data input via inferring them based on a training sample. ...
... First, as we mention above, the TOM strongly suggested that contributors report certain species, and they voluntarily classified over half of the instances reported at that level. Second, when contributors reported no biological species, one could often identify a species from an observation using machine learning provided they reported enough attributes and textual description to produce a positive identification (Lukyanenko et al., 2016c). When required, scientists could assemble a dynamic classification based on the collection of attributes that are of interest at a given moment. ...
Article
Full-text available
The increasing reliance of organizations on externally produced information, such as online user-generated content (UGC) and crowdsourcing, challenges common assumptions about conceptual modeling in information systems (IS) development. We demonstrate the societal importance of UGC, analyze the distinguishing characteristics of UGC and identify specific conceptual modeling challenges in this setting, evaluate traditional and recently proposed approaches to modeling UGC, propose a set of conceptual modeling guidelines to be used in development of IS that harness structured UGC; and demonstrate how to implement and evaluate the proposed using a case of development of a real crowdsourcing (citizen science) IS. We conclude by considering implications for conceptual modeling research and practice.
... This augments the machine intelligence engine with creativity and insight of humans from the same local area (i.e., likely from people of similar cultural and ethnic background). This results in a hybrid intelligence solution (combined wisdom of humans and machines) which has been shown to be superior to a machine-only solution [7,8]. The customer remains in control over how much helpers and AI know about them. ...
Conference Paper
Full-text available
Much of IS design theory makes an assumption that if data is required by data consumers , it needs to be captured by the information systems (IS) explicitly and directly. With the pervasiveness of direct representation in IS development and research, academia has offered little support for cases where not all data elements required by data consumers are collected and stored directly. With the explosive proliferation of business analytics, it is becoming increasingly effective to mine existing data sources for unanticipated and novel insights. We developed and implemented GroceryBer-an online prototype for the crowdsourcing grocery delivery service in which we intentionally designed information as a by-product of the grocery delivery process.
Conference Paper
Full-text available
This paper investigates the impact of conceptual modeling on the information completeness dimension of information quality in the context of user-generated content. We propose a theoretical relationship between conceptual modeling approaches and information completeness and hypothesize that traditional class-based conceptual modeling negatively affects information completeness. We conducted a field experiment in the context of citizen science in biology. The empirical evidence demonstrates that users assigned to an instantiation that is based on class-based conceptual modeling provide fewer observations than users assigned to an instance-based condition. Users in the instance-based condition also provided a greater number of new classes of organisms. The findings support the proposed hypotheses and establish that conceptual modeling is an important factor in evaluating and increasing information completeness in user-generated content.
Article
Full-text available
User-generated content (UGC) is becoming a valuable organizational resource, as it is seen in many cases as a way to make more information available for analysis. To make effective use of UGC, it is necessary to understand information quality (IQ) in this setting. Traditional IQ research focuses on corporate data and views users as data consumers. However, as users with varying levels of expertise contribute information in an open setting, current conceptualizations of IQ break down. In particular, the practice of modeling information requirements in terms of fixed classes, such as an Entity-Relationship diagram or relational database tables, unnecessarily restricts the IQ of user-generated data sets. This paper defines crowd information quality (crowd IQ), empirically examines implications of class-based modeling approaches for crowd IQ, and offers a path for improving crowd IQ using instance-and-attribute based modeling. To evaluate the impact of modeling decisions on IQ, we conducted three experiments. Results demonstrate that information accuracy depends on the classes used to model domains, with participants providing more accurate information when classifying phenomena at a more general level. In addition, we found greater overall accuracy when participants could provide freeform data compared to a condition in which they selected from constrained choices. We further demonstrate that, relative to attribute-based data collection, information loss occurs when class-based models are used. Our findings have significant implications for information quality, information modeling, and UGC research and practice.
Conference Paper
Full-text available
Data quality is a primary concern for researchers employing a public participation in scientific research (PPSR) or ``citizen science'' approach. This mode of scientific collaboration relies on contributions from a large, often unknown population of volunteers with variable expertise. In a survey of PPSR projects, we found that most projects employ multiple mechanisms to ensure data quality and appropriate levels of validation. We created a framework of 18 mechanisms commonly employed by PPSR projects for ensuring data quality, based on direct experience of the authors and a review of the survey data, noting two categories of sources of error (protocols, participants) and three potential intervention points (before, during and after participation), which can be used to guide project design.
Article
Volunteer involvement in biological surveys is becoming common in conservation and ecology, prompting questions on the quality of data collected in such surveys. In a systematic review of the peer-reviewed literature on the quality of data collected by volunteers, we examined the characteristics of volunteers (e.g., age, prior knowledge) and projects (e.g., systematic vs. opportunistic monitoring schemes) that affect data quality with regards to standardization of sampling, accuracy and precision of data collection, spatial and temporal representation of data, and sample size. Most studies (70%, n = 71) focused on the act of data collection. The majority of assessments of volunteer characteristics (58%, n = 93) examined the effect of prior knowledge and experience on quality of the data collected, often by comparing volunteers with experts or professionals, who were usually assumed to collect higher quality data. However, when both groups' data were compared with the same accuracy standard, professional data were more accurate in only 4 of 7 cases. The few studies that measured precision of volunteer and professional data did not conclusively show that professional data were less variable than volunteer data. To improve data quality, studies recommended changes to survey protocols, volunteer training, statistical analyses, and project structure (e.g., volunteer recruitment and retention). © 2015, Society for Conservation Biology.
Community-as-a-Service: Data Validation in Citizen Science
  • Yurong He
  • Andrea Wiggins
Yurong He and Andrea Wiggins. 2015. Community-as-a-Service: Data Validation in Citizen Science. In ISWC 2015.