Conference PaperPDF Available

Understanding Information Quality in Crowdsourced Data

Authors:

Abstract

The rise of user-generated content (UGC, or crowdsourced data) has created important avenues to use new kinds of data in business decision making. This paper examines information quality (IQ) in crowdsourced data. Traditionally, IQ research has focused on the fitness of data for particular uses within an organization, implying the intended uses and data requirements are known prior to use (Lee 2003, Lee et al. 2004, Wang and Strong 1996, Zhu and Wu 2011). In traditional settings, IQ can be maintained using approaches such as access restrictions, training, and input controls (Redman 2001). UGC breaks down organizational boundaries, challenging traditional approaches to IQ by opening information collection to the general public. Access restrictions may severely inhibit the amount of UGC that can be collected (Parsons et al., 2011). Training is sometimes used to maintain accuracy (Dickinson et al. 2010, Foster-Smith and Evans 2003), but presumes both high motivation among crowd contributors and clearly defined information requirements. In UGC applications, the motivation level of potential contributors can be low and information requirements may evolve. Input controls are widely used in UGC applications. For example, in citizen science (Hand, 2010), the task of the crowd is often to classify phenomena, such as galaxies (www.galaxyzoo.org) or birds (www.ebird.org), according to a predetermined schema (e.g., galaxy types or bird species). We argue such an approach (1) often requires a level of domain knowledge that members of the general public generally lack (resulting in contributed data of dubious accuracy), and (2) constrains data collection based on the predetermined schema, thereby excluding potentially useful, but unpredictable, data that contributors may wish to report. In view of these issues, we seek to answer the following general research questions: 1) Does constraining crowdsourced data using a fixed schema affect data accuracy and completeness? 2) How can these effects be mitigated? Research Approach Our work is grounded in two theoretical frames. First, we adopt the ontological position that reality is comprised of " things " that can be described in terms of properties (Bunge, 1977). Things do not inherently belong to predetermined classes, and no class fully describes the properties of any particular thing. Second, psychological research on classification holds that classes are useful abstractions of phenomena that support communication and reasoning (Posner, 1993, Rosch, 1978). Alternative classification schemes are equally valid and can be useful for different purposes (Parsons and Wand, 2008). Additionally, basic-level classes are generally preferred by non-experts in a domain (Rosch et al., 1976). The basic level is typically the first category non-experts think of when they encounter an instance (Jolicoeur et al. 1984) and are the most common categories in ordinary speech (Wisniewski and Murphy, 1989). For example, in biology the basic level is a more general level (e.g., bird or fish) than the species level that is of interest to biologists. Two general implications follow from these theoretical principles. First, a schema-based approach to collecting UGC may lead to information loss, as crowd members will observe and be capable of reporting information about observations beyond what is implied by any particular schema. Second, requiring crowd members to report data according to a fixed set of classes that does not match contributors' conceptual schemata of a domain will negatively affect classification accuracy. We tested these propositions in two laboratory experiments in a citizen science project. In these experiments, biology non-experts were exposed to a set of images of plants and animals.
Understanding Information Quality in Crowdsourced Data
Jeffrey Parsons1, Roman Lukyanenko1, Yolanda Wiersma2
1Faculty of Business Administration
Memorial University of Newfoundland
St. John’s, NL, Canada
jeffreyp@mun.ca,
2Department of Biology
Memorial University of Newfoundland
St. John’s, NL, Canada
ywiersma@mun.ca
Research Question
The rise of user-generated content (UGC, or crowdsourced data) has created important avenues to use new kinds
of data in business decision making. This paper examines information quality (IQ) in crowdsourced data.
Traditionally, IQ research has focused on the fitness of data for particular uses within an organization, implying the
intended uses and data requirements are known prior to use (Lee 2003, Lee et al. 2004, Wang and Strong 1996, Zhu
and Wu 2011). In traditional settings, IQ can be maintained using approaches such as access restrictions, training,
and input controls (Redman 2001).
UGC breaks down organizational boundaries, challenging traditional approaches to IQ by opening information
collection to the general public. Access restrictions may severely inhibit the amount of UGC that can be collected
(Parsons et al., 2011). Training is sometimes used to maintain accuracy (Dickinson et al. 2010, Foster-Smith and
Evans 2003), but presumes both high motivation among crowd contributors and clearly defined information
requirements. In UGC applications, the motivation level of potential contributors can be low and information
requirements may evolve.
Input controls are widely used in UGC applications. For example, in citizen science (Hand, 2010), the task of the
crowd is often to classify phenomena, such as galaxies (www.galaxyzoo.org) or birds (www.ebird.org), according to
a predetermined schema (e.g., galaxy types or bird species). We argue such an approach (1) often requires a level of
domain knowledge that members of the general public generally lack (resulting in contributed data of dubious
accuracy), and (2) constrains data collection based on the predetermined schema, thereby excluding potentially
useful, but unpredictable, data that contributors may wish to report.
In view of these issues, we seek to answer the following general research questions:
1) Does constraining crowdsourced data using a fixed schema affect data accuracy and completeness?
2) How can these effects be mitigated? Research Approach
Our work is grounded in two theoretical frames. First, we adopt the ontological position that reality is comprised
of “things” that can be described in terms of properties (Bunge, 1977). Things do not inherently belong to
predetermined classes, and no class fully describes the properties of any particular thing. Second, psychological
research on classification holds that classes are useful abstractions of phenomena that support communication and
reasoning (Posner, 1993, Rosch, 1978). Alternative classification schemes are equally valid and can be useful for
different purposes (Parsons and Wand, 2008). Additionally, basic-level classes are generally preferred by non-
experts in a domain (Rosch et al., 1976). The basic level is typically the first category non-experts think of when
they encounter an instance (Jolicoeur et al. 1984) and are the most common categories in ordinary speech
(Wisniewski and Murphy, 1989). For example, in biology the basic level is a more general level (e.g., bird or fish)
than the species level that is of interest to biologists.
Two general implications follow from these theoretical principles. First, a schema-based approach to collecting
UGC may lead to information loss, as crowd members will observe and be capable of reporting information about
observations beyond what is implied by any particular schema. Second, requiring crowd members to report data
according to a fixed set of classes that does not match contributors’ conceptual schemata of a domain will negatively
affect classification accuracy.
We tested these propositions in two laboratory experiments in a citizen science project. In these experiments,
biology non-experts were exposed to a set of images of plants and animals.
2
In Experiment 1 (free-form data collection), participants were asked to either (a) name the organism in each
image, or (b) name and describe the organism in each image. Based on the principles above, we hypothesized:
H1 (Accuracy): Participants will make fewer errors classifying at the basic level than at the species level.
H2 (Information Loss): Participants will describe instances in terms of attributes subordinate to the classification
level at which they can identify instances.
In Experiment 2 (constrained data collection), participants were asked to classify instances by selecting from a
predetermined schema, either (a) a species list, or (b) a multilevel list that included higher order classes in addition
to species. We hypothesized:
H3 (Accuracy): Participants will make fewer errors classifying in the multilevel condition than in the single
(species) level condition. Main Findings & Expected Contributions
We conducted the studies above with several hundred students. Our findings provide strong support for all
hypotheses. Accuracy is contingent on classification level. In the free-form task, we found greater accuracy for
basic-level classification than for species level classification. In the constrained task, we found that, except for
familiar organisms, accuracy is higher when data collection is guided by classes at multiple levels (including the
basic level) as opposed to a single level.
Collectively, the experiments highlight the negative impact of class-based models on accuracy, as well as the
information loss that results when data collection focuses on classification tasks (at any level). The studies
demonstrate a data quality dilemma arising from the use of class-based models to capture UGC. The classes non-
experts are comfortable using tend to be general ones. However, for many applications more specific classes are
required. Thus, there is potentially low accuracy in real-world UGC datasets relying on specialized classification.
However, participants can contribute substantial amounts of information (attributes) beyond what is implied by the
high-level classes to which they can assign an observed phenomenon.
We have identified novel IQ challenges in UGC and shown how they can be addressed by collecting data in a
way that accommodates crowd capabilities. Addressing these challenges is important for collecting externally
generated information to support decision making and BI. In short, to the extent possible data modeling for UGC
needs to be both expertise- and use-agnostic.
Current Status of Manuscript
The manuscript is currently under review at a major journal. Feedback from conference attendees will be useful
in refining implications of the research and in guiding future work. In addition to discussing the laboratory
experiments during the presentation, we will also outline an ongoing field experiment that examines IQ in a real
citizen science application. References
Bunge, M. 1977. Treatise on basic philosophy: Ontology I: the furniture of the world. Reidel, Boston, MA.
Dickinson, J. L., B. Zuckerberg, D. N. Bonter. 2010. Citizen science as an ecological research tool: challenges and
benefits. Annual Review of Ecology, Evolution, and Systematics 41 112-149.
Foster-Smith, J., S. M. Evans. 2003. The value of marine ecological data collected by volunteers. Biological
Conservation 113 (2) 199-213.
Hand, E. 2010. Citizen Science: People power. Nature 466 (7307) 685-687.
Jolicoeur, P., M. A. Gluck, S. M. Kosslyn. 1984. Pictures and names: Making the connection. Cognitive Psychology
16 (2) 243-275.
Lee, Y. W. 2003. Crafting rules: context-reflective data quality problem solving. Journal of Management
Information Systems 20 (3) 93-119.
Lee, Y. W., L. Pipino, D. M. Strong, R. Y. Wang. 2004. Process-embedded data integrity. Journal of Database
Management 15 (1) 87-103.
Parsons, J., R. Lukyanenko, Y. Wiersma. 2011. Easier citizen science is better. Nature 471 (7336) 37-37.
Parsons, J., Y. Wand. 2008. Using cognitive principles to guide classification in information systems modeling. MIS
Quarterly 32 (4) 839-868.
Posner, M. I. 1993. Foundations of cognitive science. MIT Press, MIT.
Redman,T.C. 2001. Data Quality: The Field Guide. Digital Press, Woburn, MA.
Rosch, E. 1978. Principles of categorization. E. Rosch and B. Lloyd, eds. Cognition and Categorization. John Wiley
& Sons Inc, 27-48.
3
Rosch, E., C. B. Mervis, W. D. Gray, D. M. Johnson, P. Boyesbraem. 1976. Basic objects in natural categories.
Cognitive Psychology 8 (3) 382-439.
Wang, R. Y., D. M. Strong. 1996. Beyond accuracy: what data quality means to data consumers. Journal of
Management Information Systems 12 (4) 5-33.
Wisniewski, E. J., G. L. Murphy. 1989. Superordinate and basic category names in discourse: A textual analysis.
Discourse Processes 12 (2) 245-261.
Zhu, H., H. Wu. 2011. Quality of data standards: framework and illustration using XBRL taxonomy and instances.
Electronic Markets 21 (2) 129-139.
... As contributors may hold different views about domain phenomena, each contributor should be free to provide his/her own attributes and classes rather than be constrained by the classes defined in advance (even if these reflect intended uses of UGC). We thus advocate a use-agnostic approach to IQ management (Lukyanenko et al. in press;Parsons et al. 2014) and an instance-based approach to information modeling (Lukyanenko and Parsons 2012;Parsons and Wand 2000) in UGC settings. ...
Conference Paper
Full-text available
The rise and increased ubiquity of online interactive technologies such as social media or crowdsourcing (Barbier et al. 2012; de Boer et al. 2012; Doan et al. 2011; Whitla 2009) creates a fertile environment for field experimentation, affording researchers the opportunity to develop, test and deploy innovative design solutions in a live setting. In this research, we use a real crowdsourcing project as an experimental setting to evaluate innovative approaches to conceptual modeling and improve quality of user-generated content (UGC). Organizations are increasingly looking to harness UGC to better understand customers, develop new products, and improve quality of services (e.g., healthcare or municipal) (Barwise and Meehan 2010; Culnan et al. 2010; Whitla 2009). Scientists and monitoring agencies sponsor online UGC systems -citizen science information systems -that allow ordinary users to provide observations of local wildlife, report on weather conditions, track earthquakes and wildfires, or map their neighborhoods (Flanagin and Metzger 2008; Haklay 2010; Hand 2010; Lukyanenko et al. 2011). Despite the growing reliance on UGC, a pervasive concern is the quality of data produced by ordinary people. Online users are typically volunteers, resulting in a user base with diverse motivations and variable domain knowledge (Arazy et al. 2011; Coleman et al. 2009). When
Conference Paper
Full-text available
Network effect is a social and economic added value from a new user to the existing community. Every new user can potentially reenergize the remaining ones and keep them engaged. The success of a VGI project ultimately depends on users, and high levels of user activity translate into more data and have been shown to increase data quality of volunteered projects through a peer review process. This research presents a case study of harnessing the network effect on the example of www.nlnature.com, a citizen science project where non-experts record geo-referenced observations of local wildlife. Based on the case study, two segments of design that affect participation have been identified: the constraining and conducive elements. Some constraining elements include registration form, data collection form [1–3], and the mapping tool itself. Similarly, design and presentation tend to induce participation, as does perceived interactivity, and perceived impact of the user's actions. For both constraining and conducive participation elements, specific design and conceptual choices appear to moderate the two effect types. The case study concludes by identifying the need for a consistent theoretical approach to designing participatory VGI tools..: Easier citizen science is better. Nature. 471, 37–37 (2011). 3. Lukyanenko, R.: Harnessing collective intelligence: techno-ontological model. Memorial University of Newfoundland (2010).
Article
Full-text available
Categorizations which humans make of the concrete world are not arbitrary but highly determined. In taxonomies of concrete objects, there is one level of abstraction at which the most basic category cuts are made. Basic categories are those which carry the most information, possess the highest category cue validity, and are, thus, the most differentiated from one another. The four experiments of Part I define basic objects by demonstrating that in taxonomies of common concrete nouns in English based on class inclusion, basic objects are the most inclusive categories whose members: (a) possess significant numbers of attributes in common, (b) have motor programs which are similar to one another, (c) have similar shapes, and (d) can be identified from averaged shapes of members of the class. The eight experiments of Part II explore implications of the structure of categories. Basic objects are shown to be the most inclusive categories for which a concrete image of the category as a whole can be formed, to be the first categorizations made during perception of the environment, to be the earliest categories sorted and earliest named by children, and to be the categories most codable, most coded, and most necessary in language.
Article
Full-text available
Citizen science, the involvement of volunteers in research, has increased the scale of ecological field studies with continent-wide, centralized monitoring efforts and, more rarely, tapping of volunteers to conduct large, coordinated, field experiments. The unique benefit for the field of ecology lies in understanding processes occurring at broad geographic scales and on private lands, which are impossible to sample extensively with traditional field research models. Citizen science produces large, longitudinal data sets, whose potential for error and bias is poorly understood. Because it does not usually aim to uncover mechanisms underlying ecological patterns, citizen science is best viewed as complementary to more localized, hypothesis-driven research. In the process of addressing the impacts of current, global “experiments” altering habitat and climate, large-scale citizen science has led to new, quantitative approaches to emerging questions about the distribution and abundance of organisms across space and time.
Article
Full-text available
P> Motivated by the growing importance of data quality in data-intensive, global business environments and by burgeoning data quality activities, this study builds a conceptual model of data quality problem solving. The study analyzes data quality activities at five organizations via a five-year longitudinal study. The study finds that experienced practitioners solve data quality problems by reflecting on and explicating knowledge about contexts embedded in, or missing from, data. Specifically, these individuals investigate how data problems are framed, analyzed, and resolved throughout the entire information discourse. Their discourse on contexts of data, therefore, connects otherwise separately managed data processes, that is, collection, storage, and use. Practitioners' context-reflective mode of problem solving plays a pivotal role in crafting data quality rules. These practitioners break old rules and revise actionable dominant logic embedded in work routines as a strategy for crafting rules in data quality problem solving.</P
Article
Recent work on the categorization of objects in scenes and on the acquisition of children's concepts suggests that superordinate concepts represent groups of concept members and the relations among them. In contrast, basic concepts typically represent the characteristics of single objects (e.g., chairs have four legs, a back, and are for sitting on). If language use mirrors conceptual structure, one would expect to find differences in the use of basic and superordinate category names in discourse. In particular, people may use superordinate terms more often to refer to multiple objects rather than individual objects. Basic category terms may be used more often to refer to individual objects. The analysis presented here addressed this hypothesis by examining a large sample of references to superordinates and their basic categories. The proportion of references to single versus multiple objects was calculated for superordinates and their basic categories. Results showed that superordinates were more often used to refer to groups and classes of objects. In contrast, basic category terms were most often used to refer to single objects. The results suggest qualitative differences in the use of basic and superordinate categories in discourse and in the representation of their corresponding concepts.
Article
this chapter treats rather concisely a range of topics that help define the scope of cognitive science and the numerous dimensions along which the field can be explored goals of cognitive science / principal contributing disciplines / architecture of intelligent systems / two approaches: reasoning and search / methods of cognitive research / invariance of the laws of cognition (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
Volunteers are potentially a huge scientific resource but there is scepticism among some scientists about the reliability of data collected by inexperienced people. An assessment was therefore made of the ability of a group of 13 volunteers, recruited by the Earthwatch Institute, to collect valid data in a project that aimed to map the distribution and abundance of common littoral organisms on shores of the Isle of Cumbrae, Scotland. The volunteers ranged considerably in age, educational background, knowledge and experience. They were capable of performing straight-forward tasks, such as learning to identify species, recording their occurrence on specific parts of the shore and making length measurements of samples of some gastropods. They made some recording errors during the fieldwork but similar errors were also made by experienced scientists and it is recommended therefore that all ecological studies should include quality control of data whether or not they involve volunteers. The assessment of abundance was problematic. Volunteers’ assessments for some species were inconsistent and there is evidence that individuals interpreted the scale in different ways. It is suggested that these problems stemmed from: (1) a lack of field experience in the volunteers; (2) inadequate guidelines on the use of the abundance scale; and (3) insufficient training before field surveys commenced. However, projects themselves may benefit in unexpected ways from the input of volunteers. They contributed taxonomic and computing skills to the current project. Members of the group also offered new insights by developing hypotheses relating to populations of gastropods during their fieldwork. These were tested and could have formed the basis of longer-term research programmes. There were also educational benefits for the volunteers who were involved in the project. These included increased knowledge of marine issues and clearer understanding of the ways in which scientific studies are undertaken.
Article
Poor data quality (DQ) can have substantial social and economic impacts. Although firms are improving data quality with practical approaches and tools, their improvement efforts tend to focus narrowly on accuracy. We believe that data consumers have a much broader data quality conceptualization than IS professionals realize. The purpose of this paper is to develop a framework that captures the aspects of data quality that are important to data consumers.A two-stage survey and a two-phase sorting study were conducted to develop a hierarchical framework for organizing data quality dimensions. This framework captures dimensions of data quality that are important to data consumers. Intrinsic DQ denotes that data have quality in their own right. Contextual DQ highlights the requirement that data quality must be considered within the context of the task at hand. Representational DQ and accessibility DQ emphasize the importance of the role of systems. These findings are consistent with our understanding that high-quality data should be intrinsically good, contextually appropriate for the task, clearly represented, and accessible to the data consumer.Our framework has been used effectively in industry and government. Using this framework, IS managers were able to better understand and meet their data consumers' data quality needs. The salient feature of this research study is that quality attributes of data are collected from data consumers instead of being defined theoretically or based on researchers' experience. Although exploratory, this research provides a basis for future studies that measure data quality along the dimensions of this framework.