Content uploaded by Roman Lukyanenko
Author content
All content in this area was uploaded by Roman Lukyanenko on Feb 06, 2015
Content may be subject to copyright.
Understanding Information Quality in Crowdsourced Data
Jeffrey Parsons1, Roman Lukyanenko1, Yolanda Wiersma2
1Faculty of Business Administration
Memorial University of Newfoundland
St. John’s, NL, Canada
jeffreyp@mun.ca,
2Department of Biology
Memorial University of Newfoundland
St. John’s, NL, Canada
ywiersma@mun.ca
Research Question
The rise of user-generated content (UGC, or crowdsourced data) has created important avenues to use new kinds
of data in business decision making. This paper examines information quality (IQ) in crowdsourced data.
Traditionally, IQ research has focused on the fitness of data for particular uses within an organization, implying the
intended uses and data requirements are known prior to use (Lee 2003, Lee et al. 2004, Wang and Strong 1996, Zhu
and Wu 2011). In traditional settings, IQ can be maintained using approaches such as access restrictions, training,
and input controls (Redman 2001).
UGC breaks down organizational boundaries, challenging traditional approaches to IQ by opening information
collection to the general public. Access restrictions may severely inhibit the amount of UGC that can be collected
(Parsons et al., 2011). Training is sometimes used to maintain accuracy (Dickinson et al. 2010, Foster-Smith and
Evans 2003), but presumes both high motivation among crowd contributors and clearly defined information
requirements. In UGC applications, the motivation level of potential contributors can be low and information
requirements may evolve.
Input controls are widely used in UGC applications. For example, in citizen science (Hand, 2010), the task of the
crowd is often to classify phenomena, such as galaxies (www.galaxyzoo.org) or birds (www.ebird.org), according to
a predetermined schema (e.g., galaxy types or bird species). We argue such an approach (1) often requires a level of
domain knowledge that members of the general public generally lack (resulting in contributed data of dubious
accuracy), and (2) constrains data collection based on the predetermined schema, thereby excluding potentially
useful, but unpredictable, data that contributors may wish to report.
In view of these issues, we seek to answer the following general research questions:
1) Does constraining crowdsourced data using a fixed schema affect data accuracy and completeness?
2) How can these effects be mitigated? Research Approach
Our work is grounded in two theoretical frames. First, we adopt the ontological position that reality is comprised
of “things” that can be described in terms of properties (Bunge, 1977). Things do not inherently belong to
predetermined classes, and no class fully describes the properties of any particular thing. Second, psychological
research on classification holds that classes are useful abstractions of phenomena that support communication and
reasoning (Posner, 1993, Rosch, 1978). Alternative classification schemes are equally valid and can be useful for
different purposes (Parsons and Wand, 2008). Additionally, basic-level classes are generally preferred by non-
experts in a domain (Rosch et al., 1976). The basic level is typically the first category non-experts think of when
they encounter an instance (Jolicoeur et al. 1984) and are the most common categories in ordinary speech
(Wisniewski and Murphy, 1989). For example, in biology the basic level is a more general level (e.g., bird or fish)
than the species level that is of interest to biologists.
Two general implications follow from these theoretical principles. First, a schema-based approach to collecting
UGC may lead to information loss, as crowd members will observe and be capable of reporting information about
observations beyond what is implied by any particular schema. Second, requiring crowd members to report data
according to a fixed set of classes that does not match contributors’ conceptual schemata of a domain will negatively
affect classification accuracy.
We tested these propositions in two laboratory experiments in a citizen science project. In these experiments,
biology non-experts were exposed to a set of images of plants and animals.
2
In Experiment 1 (free-form data collection), participants were asked to either (a) name the organism in each
image, or (b) name and describe the organism in each image. Based on the principles above, we hypothesized:
H1 (Accuracy): Participants will make fewer errors classifying at the basic level than at the species level.
H2 (Information Loss): Participants will describe instances in terms of attributes subordinate to the classification
level at which they can identify instances.
In Experiment 2 (constrained data collection), participants were asked to classify instances by selecting from a
predetermined schema, either (a) a species list, or (b) a multilevel list that included higher order classes in addition
to species. We hypothesized:
H3 (Accuracy): Participants will make fewer errors classifying in the multilevel condition than in the single
(species) level condition. Main Findings & Expected Contributions
We conducted the studies above with several hundred students. Our findings provide strong support for all
hypotheses. Accuracy is contingent on classification level. In the free-form task, we found greater accuracy for
basic-level classification than for species level classification. In the constrained task, we found that, except for
familiar organisms, accuracy is higher when data collection is guided by classes at multiple levels (including the
basic level) as opposed to a single level.
Collectively, the experiments highlight the negative impact of class-based models on accuracy, as well as the
information loss that results when data collection focuses on classification tasks (at any level). The studies
demonstrate a data quality dilemma arising from the use of class-based models to capture UGC. The classes non-
experts are comfortable using tend to be general ones. However, for many applications more specific classes are
required. Thus, there is potentially low accuracy in real-world UGC datasets relying on specialized classification.
However, participants can contribute substantial amounts of information (attributes) beyond what is implied by the
high-level classes to which they can assign an observed phenomenon.
We have identified novel IQ challenges in UGC and shown how they can be addressed by collecting data in a
way that accommodates crowd capabilities. Addressing these challenges is important for collecting externally
generated information to support decision making and BI. In short, to the extent possible data modeling for UGC
needs to be both expertise- and use-agnostic.
Current Status of Manuscript
The manuscript is currently under review at a major journal. Feedback from conference attendees will be useful
in refining implications of the research and in guiding future work. In addition to discussing the laboratory
experiments during the presentation, we will also outline an ongoing field experiment that examines IQ in a real
citizen science application. References
Bunge, M. 1977. Treatise on basic philosophy: Ontology I: the furniture of the world. Reidel, Boston, MA.
Dickinson, J. L., B. Zuckerberg, D. N. Bonter. 2010. Citizen science as an ecological research tool: challenges and
benefits. Annual Review of Ecology, Evolution, and Systematics 41 112-149.
Foster-Smith, J., S. M. Evans. 2003. The value of marine ecological data collected by volunteers. Biological
Conservation 113 (2) 199-213.
Hand, E. 2010. Citizen Science: People power. Nature 466 (7307) 685-687.
Jolicoeur, P., M. A. Gluck, S. M. Kosslyn. 1984. Pictures and names: Making the connection. Cognitive Psychology
16 (2) 243-275.
Lee, Y. W. 2003. Crafting rules: context-reflective data quality problem solving. Journal of Management
Information Systems 20 (3) 93-119.
Lee, Y. W., L. Pipino, D. M. Strong, R. Y. Wang. 2004. Process-embedded data integrity. Journal of Database
Management 15 (1) 87-103.
Parsons, J., R. Lukyanenko, Y. Wiersma. 2011. Easier citizen science is better. Nature 471 (7336) 37-37.
Parsons, J., Y. Wand. 2008. Using cognitive principles to guide classification in information systems modeling. MIS
Quarterly 32 (4) 839-868.
Posner, M. I. 1993. Foundations of cognitive science. MIT Press, MIT.
Redman,T.C. 2001. Data Quality: The Field Guide. Digital Press, Woburn, MA.
Rosch, E. 1978. Principles of categorization. E. Rosch and B. Lloyd, eds. Cognition and Categorization. John Wiley
& Sons Inc, 27-48.
3
Rosch, E., C. B. Mervis, W. D. Gray, D. M. Johnson, P. Boyesbraem. 1976. Basic objects in natural categories.
Cognitive Psychology 8 (3) 382-439.
Wang, R. Y., D. M. Strong. 1996. Beyond accuracy: what data quality means to data consumers. Journal of
Management Information Systems 12 (4) 5-33.
Wisniewski, E. J., G. L. Murphy. 1989. Superordinate and basic category names in discourse: A textual analysis.
Discourse Processes 12 (2) 245-261.
Zhu, H., H. Wu. 2011. Quality of data standards: framework and illustration using XBRL taxonomy and instances.
Electronic Markets 21 (2) 129-139.