ArticlePDF Available

Improving Usability of Social and Behavioral Sciences' Evidence: A Call to Action for a National Infrastructure Project for Mining Our Knowledge



Over the last century, the social and behavioral sciences have accumulated a vast storehouse of knowledge with the potential to transform society and all its constituents. Unfortunately, this knowledge has accumulated in a form (e.g., journal articles) that makes it extremely difficult to search, categorize, analyze and integrate across studies due to the scale of publications. In this panel report from a National Science Foundation-funded workshop, we describe the social and behavioral sciences' knowledge management problem. We focus on a scale of knowledge that is too big for any one person or team to read and integrate, lack of a common language, lack of a common format for knowledge representation, lack of a means for automated analysis and summarization, and missing approaches for large-scale visualization of knowledge. We will then describe a required program of collaborative research between on one side, information systems, information science, and computer science (IICS) researchers, working together with social and behavioral science (SBS) researchers. Together, these teams would develop information system artifacts to address this problem that is common across scientific disciplines, but uniquely unaddressed in the social and behavioral sciences.
Improving Usability of Social and Behavioral Sciences’ Evidence: A Call to
Action for a National Infrastructure Project for Mining Our Knowledge
Kai R. Larsen
Eric B. Hekler
Michael J. Paul1
Bryan S. Gibson
Over the last century, the social and behavioral sciences have accumulated a vast storehouse of
knowledge with the potential to transform society and all its constituents. Unfortunately, this knowledge
has accumulated in a form (e.g., journal articles) that makes it extremely difficult to search, categorize,
analyze and integrate across studies due to the scale of publications. In this panel report from a National
Science Foundation-funded workshop, we describe the social and behavioral sciences' knowledge
management problem. We focus on a scale of knowledge that is too big for any one person or team to
read and integrate, lack of a common language, lack of a common format for knowledge representation,
lack of a means for automated analysis and summarization, and missing approaches for large-scale
visualization of knowledge. We will then describe a required program of collaborative research between
on one side, information systems, information science, and computer science (IICS) researchers, working
together with social and behavioral science (SBS) researchers. Together, these teams would develop
information system artifacts to address this problem that is common across scientific disciplines, but
uniquely unaddressed in the social and behavioral sciences.
Keywords: Social and Behavioral Sciences, ontologies, National Science Foundation, knowledge bases,
reproducibility, organizing evidence.
1 University of Colorado, Boulder
2 University of California, San Diego
3 University of Utah
Accepted for publication in Communications of the Association for Information Systems on June 6th, 2019.
2 | Page
1" Introduction"
The social and behavioral sciences (SBS)a set of disciplines studying human behaviors and social
processesaddress critical questions such as how to use technology for useful purposes, how to prevent
chronic diseases, how to reduce poverty, how to tackle climate change, how to design more livable
cities, and how to help people get along better and work more productively. As such, it is not hyperbolic
to say that this knowledge plays a foundational role for helping individuals, practitioners, and policy-
makers make decisions that have real-world impact on all of our lives. If this knowledge were readily
accessible, it could be used to transform society for the better. Based on this, there is a strong societal
need to advance IS tools for supporting the SBS. While there is great need, there are numerous
challenges that make this a truly daunting task.
We describe the two primary reasons
why SBS has a knowledge management
problem. First, with SBS research now
tallying well above a million published
if the appropriate information
infrastructure can be created to unlock
the evidence and information, we may
achieve plausible generalizability of
insights across individuals and contexts
and, simultaneously, an understanding
on potential gaps in the literature. This
could be advanced, in part, by
developing methods of extracting,
organizing, and making readily
accessible and usable, the evidence and
information within publications, which
could enable more rapid and,
potentially even automated strategies
for systematically reviewing and
analyzing the SBS knowledge base. If
done well, this could allow far more
robust understanding of gaps in
evidence and understanding and clarity
on open questions and debates
grounded in data. While great strides
are being made to improve our methods
for future studies and for improving the
We are aware of no thorough evaluations of the total set of articles published in SBS, but state this as a highly conservative estimate given that
Google Scholar searches for “Likert Scale” and “Likert” on 3/4/2019 resulted in 500,000 and 1,000,000 articles, respectively . An examination of
theoretical construct research in the INN database (Larsen and Bong 2019) found that 18% of articles contained the term “Likert Scale ” and 32%
contained “Likert.” Given that Likert scales represent a subset of the research examining theoretical constructs, which is again a subset of SBS,
we are confident that a million papers is a highly conservative lower bound, and that it is more likely that the subset of SBS theoretical construct
research alone constitutes around three million papers.
Accepted for publication in Communications of the Association for Information Systems on June 6th, 2019.
3 | Page
synergies and testing of issues such as reproducibility of findings (Open Science Collaboration 2015), it
would be a waste to not also commit to, integrate, and better understand that which was learned from
past research.
Creating robust systems for finding, integrating, and facilitating understanding of past work is no small
task. This problem becomes only more complex when these relationships require weighting of the
relative evidentiary value for a given relationship and questions (e.g., weighting insights on a relationship
more strongly if it was explicitly tested in a randomized controlled trial compared to one of many
correlations gleaned in an observational study or similarly upweighting research evaluating the
predictive validity of results). Beyond these design requirements, great hurdles in usability also need to
be overcome. For example, it is critical that these literature reviews could be conducted in a matter of
minutes over the current timescale of months to years and improve their performance metrics
drastically over current comprehensiveness rates below 10% (Larsen and Bong 2016; Larsen et al. in
The purpose of this commentary is to summarize why organizing the SBS literature is important, what
some of the key complexities are that are unique to the SBS literature and how information systems,
information science, and computer science (IICS) researchers can support this effort towards the
creation of a literature review tool that supports rapid, iterative interrogation of the SBS literature.
With SBS research now tallying well above a million articles, it is important to acknowledge that most
authors who published an empirical article during the last decade likely did so with knowledge of less
than one percent of the past findings relevant to and likely considerably overlapping their ideas.
Unfortunately, the way articles are structured and the rapidly increasing pace of publishing in
combination with the constancy of available time and human memory have long ago overwhelmed
researchers’ capacity to absorb the knowledge that is relevant to them. The academic article structure,
arguably, was optimized for human-readability of each article over the integration of insights across
studies. Optimizing for human-readability makes sense when the literature is relatively small, and
academics are largely monitoring only key journals, theories, or professional societies relevant to their
work. However, optimizing for human-readability has the unintended consequence of making it harder
to integrate insights across studies, as evidenced by the incredible effort it takes to do robust systematic
reviews that, even at the time of publication, are already out of date. To achieve the goal of robust
knowledge collection, organization, and use to improve society, knowledge transfer must be optimized
for the integration of knowledge across articles.
A typical theory-contributing article may contain 50 references in spite of on average paper examining more than a dozen hypotheses involving
as many as 20-40 distinct relationships when mediating and moderating relationships are included, each of which may have been examined by
dozens to tens of t housands of pas t articles. For example, any of the thousa nds of annual arti cles on th e Technology Acceptance Model (Davis
1989; Davis et al. 1989) have over 40,000 papers to draw on, of w hich 27% are going to be empi rical pa pers containing at least one of the core
TAM relationships (Larsen et al. in press). This means that to draw on 1% of the existing knowledge of even one of the core TAM relationships
would require a minimum of 100 references. This, of course, would assume that only this one theory could ever contain relevant past
knowledge, which can never be true given the many highly similar and overlapping theories related to TAM (Venkatesh et al. 2003). Similar
theoretical and article overlaps exist in all major SBS disciplines (Watts 2017).
Accepted for publication in Communications of the Association for Information Systems on June 6th, 2019.
4 | Page
While the need is clear, and the approach to advance is understandable, there are unique issues with
SBS work. In particular, the SBS must start to collect their “things” and organize them in a computable
format. The need for these tasks became clear from the workshop presentation by William Riley, the
Director of the National Institute of Health’s Office of Behavioral and Social Sciences Research (Riley
2017). The key question is, what are the things that the SBS study? While there are many plausible
answers to this question, arguably, SBS scientists create or discover constructs that represent how and
why people behave and interact with one another within contexts (Deci and Ryan 2010). There is need
for a taxonomy of behaviors (how people behave), of proposed mediators of behaviors (why people
behave in a certain way), and how these behaviors interact with each other. These things are
encapsulated in the myriad theories, constructs, and hypotheses within articles, and are studied using a
wide range of research methods, such as randomized experiments, observation, and surveys. The goal of
which is often to estimate the probability that the concept meaningfully describes some aspect of how
people behave and interact within a given context. These concepts, the methods used to study them,
and the context whereby the concept was studied, are, arguably, the key things that must be extracted
from SBS articles.
Another “thing,” theories encapsulate and represent current knowledge in terms of key constructs and
the relationships between them. It is not unusual for a theory to be tested and later extended in
thousands of follow-up studies wherein findings are believed to build incrementally towards a greater
understanding of a set of behavioral phenomena. However, recent findings cast shadows on this belief,
as even individual theories have expanded beyond comprehension (Larsen et al. in press). Theory review
studies are unable to find and integrate more than a small percent of current findings. Further, dozens of
theories exist that address the same phenomenon in almost identical ways. Davis et al. (2014) found 82
theories of behavior and behavior change that had substantial overlap, only three of which were
integrative. Yet as new and possibly redundant theories are proposed and as they gain adherents, old
theories continue to thrive. Without an ontology of behavioral knowledge-embeddedness and -
integration, researchers remain largely unaware of related findings, especially outside their own
discipline, but also within a discipline or even within narrow research areas. This lack of awareness
prevents the behavioral disciplines from becoming a front-player in new sciences, such as behavioral big
data. While studies building on big data are becoming more frequentfrom examination of millions of
Facebook users, for exampleno single existing behavioral theory is capable of serving an integrative
role, leading to a risk that big data research cannot build on our enormous theoretical base and will be
forced to ignore it, a risk that increases by the day. Further, the SBS study phenomena that are complex
in that the concepts are, often, highly dynamic, multi-causal, and, likely most difficult, manifest
idiosyncratically (Hekler et al. 2019). What this means is that the SBS are dealing with fundamental
phenomena that are extremely difficult to model over time, study causally, and measure with
standardized measures. While the “things” that SBS study are complex, insights are being gleaned from
this line of work and used to drive decision-making, such as the selection of evidence-based behavioral
interventions that are more effective than pharmaceuticals, such as cognitive behavioral therapy for
insomnia (Trauer et al. 2015). Extracting the signal from the noise from such a complex phenomenon will
only be possible if the appropriate IS infrastructure is advanced.
Even when focusing only on English-language research, our ability to communicate is severely limited.
First, human behavior is inherently messy and therefore difficult to categorize and quantify, which has
resulted in a lack of standardization of concepts across the literature. The lack of standardization creates
Accepted for publication in Communications of the Association for Information Systems on June 6th, 2019.
5 | Page
challenges for synthesizing new findings. Second, there are structural issues that impede progress on the
first point. We believe that the SBS would benefit from adopting the practices of openness that are
becoming widespread in the computational sciences, with movement in this direction coming from
within the SBS (Foster and Deardorff 2017; Nosek et al. 2015), though this must be done carefully and
with an understanding of the inherent complexities of the phenomenon under study. As of March 2017,
while still in a period of rapid growth, the Open Science Framework had 103,000 active users and
142,000 projects ( 2017), suggesting a strong movement towards open science. However, open
science while making detailed information available on research projects and often enabling better
replication, does little to specify the progress of SBS as a whole.
Without both cultural changes towards openness and seeking standardization when possible for a given
phenomenon and corresponding IS tools, the process of knowledge organization across the SBS will
remain fragmented and likely driven by currently available search algorithms, particularly Google
Scholar. GS has in an unprecedented way combined what seems like a vast majority of existing research,
including the SBS. This ability to see allin one search has created levels of trust in GS that are out of
proportion with its actual performance in terms of both precision and recall (Boeker et al. 2013; Bramer
et al. 2016; Gehanno et al. 2013). We are aware of few if any equally opaque literature systems. It has no
application programming interface, and as those of us doing systematic reviews have experienced,
employs anti-robot algorithms so aggressive that they often mistake researchers for robots. Further, like
the general Google search algorithm, GS employs network centrality measures to improve user
perception of high precision in search, which steers results towards older and primarily positive results at
the expense of new and negative findings. For progress, the research community must take back
ownership of its research content and create transparent search engines. As we will argue, search
engines may well be the solution to our problems, but until the content of articles are better delineated
and tagged with a context, for example specifying when the term trust refers to a construct rather than
an organization or making available an ontology of different names for the same construct to address
the lack of shared language across the SBS.
The IICS could play a significant role in helping to create the knowledge infrastructure needed to address
both the common and unique issues of the SBS, using the prior literature as a foundation. We believe the
Information Systems discipline is especially critical to this.
The existence of vast sets of overlapping and undiscovered research may be seen as a weakness, but is
actually a source of future strength for the SBS. Current SBS methods often rely on self-report and
human observation and interpretation, leading to low accuracy of any specific finding, but also a high
degree of variability if looked at across the literature, which could be valuable for unpacking the
complexity of dynamic, multi-causal and idiosyncratically manifesting phenomena. A central pathway
forward is to build tools to support examination of hundreds of results on any phenomenon or
relationship that enables this wide variance to be modeled and visualized when conducting iterative
literature reviews. Collecting and analyzing statistics on an unbiased set of hundreds of past
relationships between even two variables is a major research project in itself, and our evaluations
suggest that an average SBS paper often tests, on average, 148 relationships, with more than a dozen of
these being specified in hypotheses. Without support from IICS, understanding what has been learned
from prior work represents an insurmountable problem. With sufficient support from IICS, this high
degree of variance could, feasibly, be translated into tools that enable a researcher to rapidly understand
and explore how variations in constructs, construct definitions, operationalizations of constructs,
Accepted for publication in Communications of the Association for Information Systems on June 6th, 2019.
6 | Page
attributes of study participants, context of use, study methods, all co-influence to define potentially
trustworthy relationships and, simultaneously, systematic gaps in the literature from these myriad
relationships already tested and hidden in the current literature.
Table 1 shares a reasonable starting point for the kinds of knowledge that exists in especially empirical
SBS articles. A particularly difficult challenge of the SBS is to determine what the “things” are of the
social and behavioral sciences that should be collected. In Table 1, we described the types of knowledge
typically found in empirical SBS articles, define these, provide examples, and review the state-of-the-art
in ontology learning.
Accepted for publication in Communications of the Association for Information Systems on June 6th, 2019.
7 | Page
Table 1. Details on Knowledge Types in Behavioral Theories and Past Work.
Knowledge type (KT)
Past work
1. Theory
Ontologically, it consists of constructs (Knowledge
Type; KT 4), their associations (KT 6, 10), and the
states it covers.
A theory is generally specified and named in an
initial article or set of articles. Often represented
through diagrams or hypotheses (KT 6) depicting
constructs (KT 4), construct relationships (KT 10)
and research context.
Soper et al. (2014) used n-gram (1-5) analysis to detect theory names
from an existing list of 87 theories. The methodology was improved in
Soper and Turmel (2015) through detection of instances of “theory”
and “model.” The authors applied the n-gram analysis to Wikipedia
titles under the assumption that “nearly a ny theory of even moderate
renown would be likely to have an associated article in the English
language Wikipedia” (p. 4950).
A scoping review study was conducted by Davis et al. (2014) finding 82
theories of behavior and behavior change.
2. Theory instance
Successful theories (knowledge type 1) are
extende d in fol low-up articles citing (knowledge
type 11) the original article or set of articles.
In an article, it is represented as name of theory
extende d in the focal article, diagrams or
hypotheses (KT 6) depicting construct (KT 4),
construct relationships (KT 10). and research
context (KT 3).
Larsen et al. proposed an approach for tracking and detecting articles
containing theoretical instances (Larsen et al. 2014b).
3. Research context
Study’s boundary: Industry, functional area,
respondent backgrounds, sample size, survey focus.
4. Constructs
Construct name, synonyms, and definition.
Larsen and Bong (2016) examined construct synonymy by examining
item similarity using NLP alg orithms.
The MetaBUS team manually devised a construct taxonomy for the
organizational behavior discipline (Bosco et al. 2014).
The Human Behavior Project manually created a construct taxonomy
for the MIS discipline (Larsen and Bong 2016).
The Theoretical Domains Framework (Cane et al. 2012) was proposed
as a framework for the Behavioral Medicine discipline.
5. Items
Construct measurement items (items).
Larsen and Bong (2016) applied NLP similarity algorithms to detect the
similarity among items. They applied Latent Semantic Analysis, Latent
Dirichlet Allocation, WordNet-based approaches, and propos ed a
combination; the Construct Identity Detector (CID1) algorithm.
Accepted for publication in Communications of the Association for Information Systems on June 6th, 2019.
8 | Page
6. Hypotheses
The supposition or proposed explanatio n for the
relationship between constructs (KT 10).
Li and Larsen (2011) developed algorithms to automatically extract
hypothes es from pa pers.
7. Construct
The individual construct relationships at a pairwise
level. Some relationships are between constructs
and existing relationships as is the case for
moderating relationships. M ediating relations hips
should be broken up into two pairwise
relationships, enabling a recreation of the original
Li and Larsen (2011) developed algorithms to automatically extract
relationships between constructs as they appeared inside hypotheses.
8. Methods
The type of approac h used to analyze the data. For
the theoretical approaches we selected for this
article, these will generally be statistical methods.
9. Descriptive
Quantitative descriptions
10. Relationshi p
The statistical findings about relationships between
the constructs (KT 4).
The MetaBUS project employs light-weight NLP approaches to extract
correlation matrices from academic papers and has developed a large-
scale automatic meta-analysis approach on the resulting one million
effect si zes (Bosco et al. 2014).
Accepted for publication in Communications of the Association for Information Systems on June 6th, 2019.
9 | Page
11. Citations
References to past work that informed the focal
paper. Citation analysis an d access to citations is
common, but can only connect one article to
another, a level of analysis that is not granular
enough for most ontology learning applications.
The real benefits from detecting citatio ns and their
location in a focal article are that they may now be
connected to key content within the article. For
example , connecting a given citation to a specified
Li and Larsen (2013) proposed an “automatic construct-level citation
extracti on system to r efine citations from the paper level to the
construct level.” They were able to extract citation mentions with an F1-
measure of .92.6
12. Quality
Evidence of the extent to which the authors
incorporated various standard quality evaluations.
13. Behavior C hange
Techniques (BCTs)
BCTs are theory-based approaches for changing
determina nts of behaviors.
BCTTv1 taxonomy (Michie et al. 2013; Wood et al. 2015) provides a
hierarchy of behavi or chang e techniques. The BC TTv1 Int erventions
Database c ontains 300+ articles coded using BCTTv1 at
14. Behaviors
Behaviors are sometimes considered a sub-class of
constructs. They are observable actions by human
beings that have consequenc es for that person’s life
or life quality. These are often the depende nt
variable in a theory.
Larsen et al. are currently engaged in an interdisciplinary study creating
an ontology of behaviors building on the W.H.O. ICF taxonomy.
The F1-measure is calculated as the harmonic mean of precision (true positives / true positives + false positives) and recall (true positives / true positives + false negatives).
Accepted for publication in Communications of the Association for Information Systems on June 6th, 2019.
10 | Page
While Table 1 is arguably a good starting point for defining the things that need to be collected, an initial
challenge is to critically examine the assumptions made here and to ensure the right things are collected
and organized effectively to enable them to be readily usable for more rapid, iterative, literature reviews;
our key goal and aspiration motivating these other efforts.
For tools to succeed in enabling proper literature reviews of prior work it is essential that the iterative
feedback loop be tightened. For example, it is likely that robust literature review tools should enable rapid
exploration and calibration of questioning of the literature in a matter of minutes, instead of the current
default of months to years, which, when done, are already out of date. This is true because, at present,
literatures of individuals cannot be done at a sufficient pace compared to the pace of publications. A
second reason is because a researcher will need to be able to adjust thinking about a phenomenon to be
able to calibrate and understand their question(s) within the context of prior work, which requires rapid
iterative testing and feedback. For example, one could imagine a person exploring issues such as ‘what is
the strength of the relationship when tested via randomized experiments compared to observational
work?’ followed by a different, but complementary question such as, ‘How does the relationship seem to
differ with one population compared to another?’ Both questions, and many more, are valid and, if it were
possible to quickly ask them in succession, one could start to gain a more consilient understanding of prior
knowledge and evidence.
In the next section, we further explore and unpack the challenges for IICS to support SBS and provide
some initial directions on available resources.
Other sciences have recognized that one central task of their science is the collection of “things” (objects
or processes of interest) and that, with a large enough collection of things, they can create organizational
structures of those things, such as taxonomies or ontologies (Ashburner et al. 2000). For example, starting
with Aristotle, Naturalists (biologists) famously collected organisms and specimens from their work. With a
large enough collection of things, Linnaeus, who reportedly personally collected 40,000 specimens, built
on Aristotle’s structure to create a taxonomy of living organisms that was flexible enough to allow the
addition of undiscovered species. Over time it has been improved to cover domains, kingdoms, phyla,
class, order, family, genus, and, ultimately, species as a way to organize life. This collection of things and
its subsequent organization has vastly expanded the absorptive capacity of researchers in the otherwise
overwhelmingly complex biological sciences by providing order and structure to the thingsof life. From
this point of view, the challenge for SBS may seem quite a bit easier given the focus on only one species.
There are also some existing resources that will be quite helpful in moving an effort like this forward:
The first type of resource is controlled vocabulary and developed ontologies. A controlled vocabulary
constitutes an agreed-upon set of words and phrases that can be used to tag information units or to
develop a shared understanding of the entities of interest. Ontologies include controlled vocabularies but
also specify their inter-relations. For example, in controlled vocabulary, an entity such as exercise may be
defined as "Activity that requires physical or mental exertion, especially when performed to develop or
maintain fitness." In an ontology that definition could be used along with additional information. For
example, the entity swimming is a type of exercise; exercise is a type of mobility; and mobility is part of
activities and participation. Relevant controlled vocabularies include the Psychology Ontology available on
the NIH Bioportal as well as the National Cancer Institute’s Thesaurus (NCIt). Parts of the Systematized
Accepted for publication in Communications of the Association for Information Systems on June 6th, 2019.
11 | Page
Nomenclature of Medicine--Clinical Terms (SNOMED-CT), such as the social context section, also qualify, as
do the Medical Subject Headings (MeSH)’s Psychiatry and Psychology sections.
A second type of available resource is what may be termed ontology portals. These sites provide key
content from theoretical articles, such as constructs, their definitions, relationships, and related measures.
The best-known theory portal in behavioral medicine is likely the NCI Grid-Enabled Measures (GEM; portal. One goal of the site is quite in line with ontological thinking, namely
the creation of harmonized measures and constructs through community editing and voting. The site also
provides links to datasets and a team collaboration area. The TheoryMaps (
site enables the manual drawing of a theory’s constructs and their hypothesized relationships, with visual
results when searching for theories containing a given construct. The final resource comes from the
University of Colorado’s Human Behavior Project (est. 2008), the InterNomological Network (INN) portal
( provides access to constructs and measures from ten different disciplines and
fields including behavioral medicine. It provides synonymy and citation search and contains taxonomic
structure for some fields. An evolved version of INN, TheoryOn ( shows how search
results may be improved and enriched with automatically extracted theory networks. This is done through
automatic detection of hypotheses, construct and relationship extraction, and visualization of the paper’s
construct network.
A third resource type is here termed data portals. By far the largest and most relevant is likely the Inter-
university Consortium for Political and Social Research (ICPSR). ICPSR maintains over 500,000 research
relevant files such as surveys and their results. It maintains over four million variables and enables search
for studies containing a pair of variables. Another data portal is the MetaBUS project
(, which contains over one million effect sizes from studies in the management field.
The project uses a taxonomy to tie together the variables for which data is collected. The goal is to enable
automatic meta-analysis of relationships between variables.
The fourth resource type focuses on translational and practice-focused work, and includes the
psychometric instruments carefully developed for the NIH PROMIS project to cover patient-reported
outcomes in the three domains of physical, mental, and social health. Each instrument bank was created
based on a careful review of thousands of questionnaire items. Another valuable resource with behavioral
science implications is the International Classification of Functioning, Disability, and Health (ICF), which
provides a classification of health and health-related domains with a focus on body, individual, and
societal perspectives. Finally, the Nursing Care Coordination Ontology at the NIH Bioportal addresses the
coordination, people, places, and problems of delivering care.
With these resources, we propose that the time is right to advance a process of SBS ontology learning and
offer an initial framework on how this work could advance. Future progress in the SBS, we argue, depends
partially on two interrelated tasks: 1) moving the SBS toward computational, integrative approaches that
can capture knowledge embeddedness and illuminate existing knowledge, and 2) automatically
integrating new knowledge into knowledge bases. These new approaches will enable knowledge
integration across disciplines and identify scientifically grounded feedback on issues such as appropriate
behavior change techniques for a given person in context to produce the desired health goal or any of the
Accepted for publication in Communications of the Association for Information Systems on June 6th, 2019.
12 | Page
other myriad plausible societal benefits insights from the social and behavioral sciences could be used for.
To accomplish these objectives, the experts must bring together various types of behavioral theory
integration, ontology development, and natural language processing skills to coordinate endeavors to
move behavior sciences efforts to the next level. Big data is now impacting practically every area of
behavior change, and unless big data findings are integrated with existing theories of behavior change,
both the theories and the big data findings will continue to overwhelm researchers and lead to fractured
A project targeting those two tasks should cover all the SBS and could consist of a) careful sampling and
collection of a large set of manuscripts (100,000+) to be representative of all research in the social and
behavioral sciences and b) creating a virtual lab where ontology learning researchers may be credentialed
to access the manuscripts and develop the open-source software to extract and place into knowledge
bases all the individual knowledge types and instances in a manuscript. Developed software should be
executable in the virtual lab, results stored in NoSQL databases, accuracy measures automatically
reported, and software ranked in a set of leaderboards to stimulate the ability to compare different
approaches to solving the same problems. Any time new software outperforms past software on the
evaluative sets, the code would be executed on the entire sample or on subsets of the sample for which
the new code outperforms past code (for example, for the literature from Psychology, or even as focused
as a specific journal in the Psychology discipline).
Figure 1. Project Overview
A project of this scope could be advanced in a variety of ways. Indeed, there are complementary efforts
Accepted for publication in Communications of the Association for Information Systems on June 6th, 2019.
13 | Page
taking place in the UK where the Human Behavior Change Project (Michie et al. 2017) was built on ideas
from the Information Systems-based Human Behavior Project (Larsen 2010). Here is one possible
formulation, across three stages as outlined in Figure 1. In stage 1, the project establishes the
infrastructure necessary for the success of such projects in the future, as well as for fostering collaboration
and the dissemination of findings. Once this infrastructure is complete, stage 2 begins. In this stage,
information and computer scientists are invited to work within the infrastructure to address ontology
learning (for example, most effectively extracting constructs from the sample and the targeted sources
containing such constructs). Private and public funding agencies are brought on board, as well, to support
such efforts. Different teams may compete to provide the best software for a given problem. Evaluations
are automatic against gold standards and results ranked in a set of leaderboards. Once stage 2 has proven
the code capable of extracting high-quality knowledge types, stage 3 begins. In stage 3, the software is
developed that will enable publishers and owners of manuscripts to extract knowledge locally and develop
their own products. Alternatively, it will enable them to collaborate with the virtual lab team to create
software that will allow them to pool their knowledge types into large knowledge bases. These large
knowledge bases will then support existing business models and enable new ones while also making the
knowledge bases available to researchers through the kinds of contractual arrangements already in place
between content owners and academic libraries.
List of tasks likely associated with each stage:
Stage 1. Determine inclusion criteria: This stage starts with a knowledge graph of a set of SBS sources that
is as complete as possible. Sources are defined here to include gray literature, conference publications,
books, and journal articles, regardless of the citation scores of the journal. Alternatively, this work could
start with only journal articles to shortcut the process under the assumption that highest-quality
knowledge exists in journal articles. To integrate existing ontology learning resources, the ~20,000 articles
covered by the Human Behavior Project, the metaBUS project, and any other large projects should be
among the selected articles. It is here important to work with content owners, librarians, and lawyers to
develop the legal framework under which this project may exist.
Stage 2. Collect articles and set up infrastructure: This step of the first stage would focus on collecting
machine-readable copies of the articles meeting the inclusion criteria. Under this setting, consistent
workflow apps and APIs should be set up for future collections of articles as the first round of collection
should over time transition into a flow of new articles as they become available in the source databases.
All these articles should then be stored in a database (e.g., NoSQL database). The same or a set of related
big data-capable databases should be set up for each of the defined knowledge types in Table 1. An
infrastructure allowing the sharing of code designed to extract knowledge types and the ability to run
these for all or subsets of journals with leaderboards indicating which code units performed the best for
specific knowledge types. For example, it is possible that a team may have created a system that extracts
hypotheses from Journals in Psychology better than any other code, but that this overall improvement in
performance due to heavy reliance on supervised machine learning decreases the performance on
disciplines like Nursing, which adheres to a non-standard hypothesis format. The new software may then
be scaled back to only focus on journals from the disciplines it works best for, and the previously best
software may be adjusted to address only the non-standard disciplines, such as Nursing. Over time, we
would expect to see solutions that rely on existing knowledge bases for other types to improve ontology
learning. For example, the task of finding the constructs/variables within a paper may be approached in a
number of different ways. 1) They may be extracted from hypotheses extracted by other teams. 2) They
may be extracted from correlation tables. 3) They may be extracted from construct definitions sections.
And 4) They may be inferred based on the citation structure to existing theories. As the accuracy of
Accepted for publication in Communications of the Association for Information Systems on June 6th, 2019.
14 | Page
ontology learning for any one of these approaches increases, others that use the findings from those
algorithms are also likely to improve through a virtuous cycle.
One promising direction is to further improve that virtuous cycle and that may integrate with machine
learning and expert ontology development is the use of citizen science during the ontology development
process. Citizen science is an area of work specified by IS researchers (Levy and Germonprez 2017;
Lukyanenko et al. 2019), where a scientific task is split into smaller pieces and parcel it out to citizens
(Lukyanenko and Parsons 2019). For example, the process of evaluating the success of algorithms engaged
in the extraction of knowledge types may be divided into small tasks and shared (via a specially-designed
online platform) across a network of citizen scientists. Many of the tasks involved in the process are
tedious and difficult but are of a nature that lend themselves to evaluation by non-experts. Citizen science
is accumulating an impressive track record of tackling complex and esoteric problems. For example,
classification of galaxies, folding proteins (See, Clery 2011; Khatib et al. 2011; Lintott et al. 2008), through
innovative approaches to platform and task design. For a review from an IS perspective, see Lukyanenko
et al. (in press).
In our domain, for example, a table may be displayed to an individual and that person asked to specify
whether this table contains statistical results or not. If it does, another citizen may be given that table and
asked to identify instances of variable names. A third person may be given the marked-up table and asked
to draw a grid around the statistical results allowing the algorithms better training material for extraction
of correlation matrices and other types of statistics.
Finally, due to the value of articles and the requirements of journal owners, a process for approval of
research teams would need to be set up. As part of this, a shared Virtual Private Network (VPN) may be
needed as well as a security regime to control access to workflow systems and databases.
Stage 3. Tools for Ontologies. In this stage, processes are developed to help publishers and journal
owners use the results of the ontology learning in their databases and products. To enable this, processes
are needed to simplify the process of a publisher wanting to use the code developed under the auspices of
the project, which would be likely in the case of large publishers with access to teams of programmers. To
ensure the success of the program, research teams may want to work with journal owners on creating
cutting-edge tools for next-generation science integration and search. Doing so would generate worthy
research projects likely to make a real difference for the social and behavioral sciences in and of
themselves, on top of any benefits derived from publishers improving their tools. One potential success
would be if small, medium, and even some large publishers decided to create combined search engines
built on top of the ontology knowledge developed for the project. Thereby providing end-user access to
products created for micro-information and automatic payment to the relevant article owners a
proportion based on the usefulness of their knowledge types in queries such as “create a network graph of
all hypothesized relationships in articles that test versions of the constructs self-efficacy and gender in the
context of health care information processing.
Through this, we attempt to make a difficult process possible and increasingly understandable, including
the need for collaborators from IICS and SBS. We see information systems researchers as the natural
leaders for such efforts due to the discipline’s focus on the intersection of technology adoption, systems
development, design science, and broad social and behavioral research.
Accepted for publication in Communications of the Association for Information Systems on June 6th, 2019.
15 | Page
An effort such as the project we justify in the main body of the article could transform society’s ability to
build on prior knowledge when making decisions that could be impacted by the SBS. It would do this by
tackling the absorptive capacity problem; making it possible for one person to stay up-to-date on all work
relevant to their interests. For example, it would enable far easier (possibly fully automated) literature
reviews, meta-analyses, and synthesis of insights across studies and scientific domains for advancing the
understanding of complex systems in the social and behavioral sciences. Experiences from smaller projects
in this domain, such as metaBUS (Bosco et al. 2015) and the Human Behavior Project (Larsen 2010; Larsen
and Bong 2016), have shown that collecting SBS thingsallows new modes of analysis and enables us to
answer questions we do not currently even know to ask. We believe this is the time to advance an SBS
national infrastructure project for improving the usability of SBS evidence and we believe the information
systems discipline must act as an integrator and collaborator in such work.
Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K.,
Dwight, S. S., and Eppig, J. T. 2000. "Gene Ontology: Tool for the Unification of Biology," Nature
genetics (25:1), pp. 25-29.
Boeker, M., Vach, W., and Motschall, E. 2013. "Google Scholar as Replacement for Systematic Literature
Searches: Good Relative Recall and Precision Are Not Enough," BMC Medical Research
Methodology (13:1), pp. 1-12.
Bosco, F. A., Steel, P., Oswald, F. L., Uggerslev, K., and Field, J. G. 2015. "Cloud-Based Meta-Analysis to
Bridge Science and Practice: Welcome to Metabus," Personnel Assessment and Decisions (1:1).
Bramer, W. M., Giustini, D., and Kramer, B. M. R. 2016. "Comparing the Coverage, Recall, and Precision of
Searches for 120 Systematic Reviews in Embase, Medline, and Google Scholar: A Prospective
Study," Systematic Reviews (5:1), p. 39.
Clery, D. 2011. "Galaxy Zoo Volunteers Share Pain and Glory of Research," Science (333:6039), pp. 173-
175. 2017. "Vital Stats." Retrieved May 18, 2019, 2019
Davis, F. D. 1989. "Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information
Technology," MIS Quarterly (13:3), pp. 319-340.
Davis, F. D., Bagozzi, R. P., and Warshaw, P. R. 1989. "User Acceptance of Computer Technology: A
Comparison of Two Theoretical Models," Management Science (35:8), pp. 982-1003.
Davis, R., Campbell, R., Hildon, Z., Hobbs, L., and Michie, S. 2014. "Theories of Behaviour and Behaviour
Change across the Social and Behavioural Sciences: A Scoping Review," Health psychology
review:ahead-of-print), pp. 1-22.
Deci, E. L., and Ryan, R. M. 2010. Self-Determination. Wiley Online Library.
Foster, E. D., and Deardorff, A. 2017. "Open Science Framework (Osf)," Journal of the Medical Library
Association: JMLA (105:2), p. 203.
Gehanno, J.-F., Rollin, L., and Darmoni, S. 2013. "Is the Coverage of Google Scholar Enough to Be Used
Alone for Systematic Reviews," BMC Med Inform Decis Mak (13).
Hekler, E. B., Hekler, K., P., Chevance, G., Golaszewski, N. M., Lewis, D., and Sim, I. 2019. "Why We Need a
Small Data Paradigm." Manuscript Submitted for Publication.
Khatib, F., DiMaio, F., Cooper, S., Kazmierczyk, M., Gilski, M., Krzywda, S., Zabranska, H., Pichova, I.,
Accepted for publication in Communications of the Association for Information Systems on June 6th, 2019.
16 | Page
Thompson, J., and Popović, Z. 2011. "Crystal Structure of a Monomeric Retroviral Protease Solved
by Protein Folding Game Players," Nature structural & molecular biology (18:10), p. 1175.
Larsen, K. R. 2010. "Human Behavior Project." Retrieved 3/3/2019, 2019
Larsen, K. R., and Bong, C. H. 2016. "A Tool for Addressing Construct Identity in Literature Reviews and
Meta-Analyses," MIS Quarterly (40:3), pp. 529-551; A521-A521.
Larsen, K. R., and Bong, C. H. 2019. "Inter-Nomological Network Search Engine." Retrieved January 1st,
2019, from
Larsen, K. R., Hovorka, D. S., West, J. D., and Dennis, A. R. in press. "Understanding the Elephant: A
Discourse Approach to Corpus Identification for Theory Review Articles," Journal of the Association
for Information Systems).
Levy, M., and Germonprez, M. 2017. "The Potential for Citizen Science in Information Systems Research,"
Communications of the Association for Information Systems (40).
Lintott, C. J., Schawinski, K., Slosar, A., Land, K., Bamford, S., Thomas, D., Raddick, M. J., Nichol, R. C.,
Szalay, A., and Andreescu, D. 2008. "Galaxy Zoo: Morphologies Derived from Visual Inspection of
Galaxies from the Sloan Digital Sky Survey," Monthly Notices of the Royal Astronomical Society
(389:3), pp. 1179-1189.
Lukyanenko, R., and Parsons, J. 2019. "Beyond Micro-Tasks: Research Opportunities in Observational
Crowdsourcing," in Crowdsourcing: Concepts, Methodologies, Tools, and Applications. IGI Global,
pp. 1510-1535.
Lukyanenko, R., Parsons, J., Wiersma, Y. F., and Maddah, M. 2019. "Expecting the Unexpected: Effects of
Data Collection Design Choices on the Quality of Crowdsourced User-Generated Content," Press
MIS Q.(Forthcoming)).
Lukyanenko, R., Wiggins, A., and Rosser, H. K. in press. "Citizen Science: An Information Quality Research
Frontier," Information Systems Frontiers), pp. 1-23.
Michie, S., Thomas, J., Johnston, M., Mac Aonghusa, P., Shawe-Taylor, J., Kelly, M. P., Deleris, L. A.,
Finnerty, A. N., Marques, M. M., and Norris, E. 2017. "The Human Behaviour-Change Project:
Harnessing the Power of Artificial Intelligence and Machine Learning for Evidence Synthesis and
Interpretation," Implementation Science (12:1), p. 121.
Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., Buck, S., Chambers, C. D.,
Chin, G., and Christensen, G. 2015. "Promoting an Open Research Culture," Science (348:6242), pp.
Open Science Collaboration. 2015. "Estimating the Reproducibility of Psychological Science," Science
(349:6251), p. aac4716.
Riley, W. T. 2017. "Keynote on Establishing the Fundamental Scientific Challenges That Must Be Resolved
to Advance Behavioral Science Research," in: NSF International Workshop on Behavioral Ontology
Learning. Estes Park, CO.
Trauer, J. M., Qian, M. Y., Doyle, J. S., Rajaratnam, S. M., and Cunnington, D. 2015. "Cognitive Behavioral
Therapy for Chronic Insomnia: A Systematic Review and Meta-Analysis," Annals of internal
medicine (163:3), pp. 191-204.
Venkatesh, V., Morris, M. G., Davis, G. B., and Davis, F. D. 2003. "User Acceptance of Information
Technology: Toward a Unified View," MIS Quarterly (27:3), pp. 425-478.
Watts, D. J. 2017. "Should Social Science Be More Solution-Oriented?," Nature Human Behaviour (1:1), p.
... Due to the increasing number of scientific articles it is difficult and time-consuming to follow the latest developments and to get an overview of a field of research that is not one's own. The latter, however, is often necessary as transdisciplinary research projects are common in Information Systems (IS) research [18]. As a consequence, attempts to automatically classify scientific publications and extract important concepts have been made. ...
... DeepCause [26] extends and improves CauseMiner by using different deep learning architectures for this task. A recent call for action in the journal CAIS [18] emphasizes the need for better tools to automatically extract evidences out of IS papers. They also present different knowledge types that could be extracted from papers. ...
... The results may contribute to the discussion on meta models and create a common ground for automatically analyzing and summarizing scientific insights. This helps to better promote relevant insights and find open research questions [18]. ...
... Presently, scientific knowledge is largely organized and digested via scientific articles. Searching the literature successfully may have occurred when a person paid attention to a few key journals on focused topics [139]. Precision health synergizes efforts across a wide range of stakeholders, which makes the classic strategy of staying informed untenable. ...
... While, of course, scientific articles will always have value for sharing information within scientific communities, other forms of knowledge organization and curation are necessary for communicating information and tools that will be necessary for precision health. One concept is development of codifiable knowledge representation tools that can effectively search, organize, and curate large amounts of information to facilitate translation of evidence and insights gleaned from one discipline or stakeholder to another discipline or stakeholder [139]. Without innovative tools to support cross-sector and disciplinary access and curation, the vision of precision health may become a proverbial Tower of Babel that is largely inaccessible and, therefore, of questionable value. ...
... Organizing and curating knowledge as complex as that associated with precision health is no small task with no simple answers. Space precludes a full discussion, but interested readers looking for steps forward could review the Human Behaviour Change Project out of University College London [140] and results from a recent National Science Foundation workshop in the USA focused on advancing knowledge curation in the social and behavioral sciences [139]. ...
Full-text available
Background In 2015, Collins and Varmus articulated a vision for precision medicine emphasizing molecular characterization of illness to identify actionable biomarkers to support individualized treatment. Researchers have argued for a broader conceptualization, precision health. Precision health is an ambitious conceptualization of health, which includes dynamic linkages between research and practice as well as medicine, population health, and public health. The goal is a unified approach to match a full range of promotion, prevention, diagnostic, and treatment interventions to fundamental and actionable determinants of health; to not just address symptoms, but to directly target genetic, biological, environmental, and social and behavioral determinants of health. Purpose The purpose of this paper is to elucidate the role of social and behavioral sciences within precision health. Main body Recent technologies, research frameworks, and methods are enabling new approaches to measure, intervene, and conduct social and behavioral science research. These approaches support three opportunities in precision health that the social and behavioral sciences could colead including: (a) developing interventions that continuously “tune” to each person’s evolving needs; (b) enhancing and accelerating links between research and practice; and (c) studying mechanisms of change in real-world contexts. There are three challenges for precision health: (a) methods of knowledge organization and curation; (b) ethical conduct of research; and (c) equitable implementation of precision health. Conclusions Precision health requires active coleadership from social and behavioral scientists. Prior work and evidence firmly demonstrate why the social and behavioral sciences should colead with regard to three opportunity and three challenge areas.
... They referred to Ramaprasad and Syn (2015) who developed an approach to identify under-researched areas by mapping literature to an ontological framework of the IS discipline. This relates to recent calls for novel approaches to improve the discoverability of knowledge in IS. Larsen et al. (2020) and Wagner et al. (2021) underline the importance of ontological indexing to enable automated knowledge mining from scientific articles. ...
Conference Paper
Full-text available
Manually analyzing large collections of research articles is a time-and resource-intensive activity, making it difficult to stay on top of the latest research findings. Limitations of automated solutions lie in limited domain knowledge and not being able to attribute extracted key terms to a focal article, related work, or background information. We aim to address this challenge by (1) developing a framework for classifying sentences in scientific publications, (2) performing several experiments comparing state-of-the-art sentence transformer algorithms with a novel few-shot learning technique and (3) automatically analyzing a corpus of articles and evaluating automated knowledge extraction capabilities. We tested our approach for combining sentence classification with ontological annotations on a manually created dataset of 1,000 sentences from Information Systems (IS) articles. The results indicate a high degree of accuracy underlining the potential for novel approaches in analyzing scientific publications.
... Although there is a need as well as an increase in solutions and approaches that support automated data extraction and synthesis (K. R. Larsen et al., 2020;Mueller et al., 2022;Mueller & Huettemann, 2018;Wagner et al., 2022), to the best of our knowledge no one yet developed a QA system for scientific publications in the IS discipline. ...
Conference Paper
Full-text available
Question Answering (QA) systems can significantly reduce manual effort of searching for relevant information. However, challenges arise from a lack of domain-specificity and the fact that QA systems usually retrieve answers from short text passages instead of long scientific articles. We aim to address these challenges by (1) exploring the use of transformer models for long sequence processing, (2) performing domain adaptation for the Information Systems (IS) discipline and (3) developing novel techniques by performing domain adaptation in multiple training phases. Our models were pre-trained on a corpus of 2 million sentences retrieved from 3,463 articles from the Senior Scholars' Basket and fine-tuned on SQuAD and a manually created set of 500 QA pairs from the IS field. In six experiments, we tested two transfer learning techniques for fine-tuning (TANDA and FANDO). The results show that fine-tuning with task-specific domain knowledge considerably increases the models' F1-and Exact Match-scores.
... The multiplicity of overlapping terms and measures creates a knowledge management problem that is increasingly intractable for the individual researcher to solve. Larsen, Hekler, Paul and Gibson (2020) argued that a solution is manageable, but we need a sustained ''collaborative research program between information systems, information science, and computer science researchers and social and behavioral science researchers to develop information system artifacts to address the problem'' (Larsen et al., 2020: 1). Once we have an organized system of knowledge, they concluded: ''it would enable scholars to more easily conduct (possibly in a fully automated manner) literature reviews, meta-analyses, and syntheses across studies and scientific domains to advance our understanding about complex systems in the social and behavioral sciences'' (Larsen et al., 2020: 9). ...
Full-text available
Meta-analyses summarize a field’s research base and are therefore highly influential. Despite their value, the standards for an excellent meta-analysis, one that is potentially award-winning, have changed in the last decade. Each step of a meta-analysis is now more formalized, from the identification of relevant articles to coding, moderator analysis, and reporting of results. What was exemplary a decade ago can be somewhat dated today. Using the award-winning meta-analysis by Stahl et al. (Unraveling the effects of cultural diversity in teams: A meta-analysis of research on multicultural work groups. Journal of International Business Studies, 41(4):690–709, 2010) as an exemplar, we adopted a multi-disciplinary approach (e.g., management, psychology, health sciences) to summarize the anatomy (i.e., fundamental components) of a modern meta-analysis, focusing on: (1) data collection (i.e., literature search and screening, coding), (2) data preparation (i.e., treatment of multiple effect sizes, outlier identification and management, publication bias), (3) data analysis (i.e., average effect sizes, heterogeneity of effect sizes, moderator search), and (4) reporting (i.e., transparency and reproducibility, future research directions). In addition, we provide guidelines and a decision-making tree for when even foundational and highly cited meta-analyses should be updated. Based on the latest evidence, we summarize what journal editors and reviewers should expect, authors should provide, and readers (i.e., other researchers, practitioners, and policymakers) should consider about meta-analytic reviews.
... In accordance with recent calls (e.g., Larsen et al. 2017;Larsen et al. 2020), we therefore propose a design framework named behavioral ontology learning from text (BOLT), to guide the development of systems for extracting behavioral knowledge encompassed in large-scale, multidisciplinary publication databases. According to the design science paradigm (Hevner et al. 2004;Walls et al. 1992), design is both a product and a process. ...
The accumulated literature base in the behavioral sciences represents the IS discipline’s greatest source of knowledge, and yet the same literature has grown beyond human comprehension. An experiment is conducted showing the inability of experts to retrieve relevant constructs using full-text search. To address this inability to access the body of theoretical behavioral science research we propose a novel IT artifact built on an information extraction approach to nomological network discovery. Based on the design science paradigm we develop a three-step process for extraction and assembly of nomological networks proceeding through article download, hypothesis extraction, variable extraction, and finally to variable integration. Rule-based vs. machine learning algorithms are evaluated and compared to determine the best approach for the extraction steps. A dataset of all the relevant behavioral studies from two top journals in Information Systems and Psychology is used to evaluate the approach in comparison to expert decisions, leading into a discussion of limitations and possible extensions.
Full-text available
The volume of scientific publications in organizational research becomes exceedingly overwhelming for human researchers who seek to timely extract and review knowledge. This paper introduces natural language processing (NLP) models to accelerate the discovery, extraction, and organization of theoretical developments (i.e., hypotheses) from social science publications. We illustrate and evaluate NLP models in the context of a systematic review of stakeholder value constructs and hypotheses. Specifically, we develop NLP models to automatically 1) detect sentences in scholarly documents as hypotheses or not (Hypothesis Detection), 2) deconstruct the hypotheses into nodes (constructs) and links (causal/associative relationships) (Relationship Deconstruction ), and 3) classify the features of links in terms causality (versus association) and direction (positive, negative, versus nonlinear) (Feature Classification). Our models have reported high performance metrics for all three tasks. While our models are built in Python, we have made the pre-trained models fully accessible for non-programmers. We have provided instructions on installing and using our pre-trained models via an R Shiny app graphic user interface (GUI). Finally, we suggest the next paths to extend our methodology for computer-assisted knowledge synthesis.
Full-text available
The emergence of crowdsourcing as an important mode of information production has attracted increasing research attention. In this article, the authors review crowdsourcing research in the data management field. Most research in this domain can be termed tasked-based, focusing on micro-tasks that exploit scale and redundancy in crowds. The authors' review points to another important type of crowdsourcing – which they term observational – that can expand the scope of extant crowdsourcing data management research. Observational crowdsourcing consists of projects that harness human sensory ability to support long-term data acquisition. The authors consider the challenges in this domain, review approaches to data management for crowdsourcing, and suggest directions for future research that bridges the gaps between the two research streams.
Full-text available
The rapid proliferation of online content producing and sharing technologies resulted in an explosion of user-generated content (UGC), which now extends to scientific data. Citizen science, in which ordinary people contribute information for scientific research, epitomizes UGC. Citizen science projects are typically open to everyone, engage diverse audiences, and challenge ordinary people to produce data of highest quality to be usable in science. This also makes citizen science a very exciting area to study both traditional and innovative approaches to information quality management. With this paper we position citizen science as a leading information quality research frontier. We also show how citizen science opens a unique opportunity for the information systems community to contribute to a broad range of disciplines in natural and social sciences and humanities.
Full-text available
The goal of a review article is to present the current state of knowledge in a research area. Two important initial steps in writing a review article are boundary identification (identifying a body of potentially relevant past research) and corpus construction (selecting research manuscripts to include in the review). We present a theory-as-discourse approach which a) creates a theory ecosystem of potentially relevant prior research using a citation-network approach to boundary identification; and b) identifies manuscripts for consideration using machine learning or random selection. We demonstrate an instantiation of the theory as discourse approach through a proof-of-concept, which we call the Automated Detection of Implicit Theory (ADIT) technique. ADIT improves performance over the conventional approach as practiced in past Technology Acceptance Model reviews (i.e., keyword search, sometimes manual citation chaining); it identifies a set of research manuscripts that is more comprehensive and at least as precise. Our analysis shows that the conventional approach failed to identify a majority of past research. Like the three blind men examining the elephant, the conventional approach distorts the totality of the phenomenon. ADIT also enables researchers to statistically estimate the number of relevant manuscripts which were excluded from the resulting review article, thus enabling an assessment of the review article's representativeness.
Full-text available
As crowdsourced user-generated content becomes an important source of data for organizations, a pressing question is how to ensure that data contributed by ordinary people outside of traditional organizational boundaries is of suitable quality to be useful for both known and unanticipated purposes. This research examines the impact of different information quality management strategies, and corresponding data collection design choices, on key dimensions of information quality in crowdsourced user-generated content. We conceptualize a contributor-centric information quality management approach focusing on instance-based data collection. We contrast it with the traditional consumer-centric fitness-for-use conceptualization of information quality that emphasizes class-based data collection. We present laboratory and field experiments conducted in a citizen science domain that demonstrate trade-offs between the quality dimensions of accuracy, completeness (including discoveries), and precision between the two information management approaches and their corresponding data collection designs. Specifically, we show that instance-based data collection results in higher accuracy, dataset completeness and number of discoveries, but this comes at the expense of lower precision. We further validate the practical value of the instance-based approach by conducting an applicability check with potential data consumers (scientists, in our context of citizen science). In a follow-up study, we show, using human experts and supervised machine learning techniques, that substantial precision gains on instance-based data can be achieved with post-processing. We conclude by discussing the benefits and limitations of different information quality and data collection design choice for information quality in crowdsourced user-generated content.
Full-text available
Background Behaviour change is key to addressing both the challenges facing human health and wellbeing and to promoting the uptake of research findings in health policy and practice. We need to make better use of the vast amount of accumulating evidence from behaviour change intervention (BCI) evaluations and promote the uptake of that evidence into a wide range of contexts. The scale and complexity of the task of synthesising and interpreting this evidence, and increasing evidence timeliness and accessibility, will require increased computer support. The Human Behaviour-Change Project (HBCP) will use Artificial Intelligence and Machine Learning to (i) develop and evaluate a ‘Knowledge System’ that automatically extracts, synthesises and interprets findings from BCI evaluation reports to generate new insights about behaviour change and improve prediction of intervention effectiveness and (ii) allow users, such as practitioners, policy makers and researchers, to easily and efficiently query the system to get answers to variants of the question ‘What works, compared with what, how well, with what exposure, with what behaviours (for how long), for whom, in what settings and why?’. Methods The HBCP will: a) develop an ontology of BCI evaluations and their reports linking effect sizes for given target behaviours with intervention content and delivery and mechanisms of action, as moderated by exposure, populations and settings; b) develop and train an automated feature extraction system to annotate BCI evaluation reports using this ontology; c) develop and train machine learning and reasoning algorithms to use the annotated BCI evaluation reports to predict effect sizes for particular combinations of behaviours, interventions, populations and settings; d) build user and machine interfaces for interrogating and updating the knowledge base; and e) evaluate all the above in terms of performance and utility. Discussion The HBCP aims to revolutionise our ability to synthesise, interpret and deliver evidence on behaviour change interventions that is up-to-date and tailored to user need and context. This will enhance the usefulness, and support the implementation of, that evidence.
Full-text available
In this paper, we explicate citizen science in information systems research. Citizen science in IS research is a partnership between IS researchers and people in their everyday lives. Citizen science projects in the IS field are defined by phenomenon that interest both citizens and scientists and by the intervention of citizens in scientific processes for the purposes of scientific literacy and a more informed citizenry. We make the case for citizen science as part of a movement in IS research towards societally impactful research at the confluence of human behavior, technology, society, and environmental sustainability. We discuss the origins of citizen involvement in science and contemporary notions of citizen science from sociological, natural science, and public policy perspectives to build a working definition for the IS field. We provide examples of how one can leverage citizen science in IS research and discuss larger ideas for the possibility of citizen science.
Full-text available
The Open Science Framework (OSF) is a free, open source,r esearch workflow web application developed and maintained by the Center for Open Science (COS).
Full-text available
Background: Previously, we reported on the low recall of Google Scholar (GS) for systematic review (SR) searching. Here, we test our conclusions further in a prospective study by comparing the coverage, recall, and precision of SR search strategies previously performed in Embase, MEDLINE, and GS. Methods: The original search results from Embase and MEDLINE and the first 1000 results of GS for librarian-mediated SR searches were recorded. Once the inclusion-exclusion process for the resulting SR was complete, search results from all three databases were screened for the SR's included references. All three databases were then searched post hoc for included references not found in the original search results. Results: We checked 4795 included references from 120 SRs against the original search results. Coverage of GS was high (97.2 %) but marginally lower than Embase and MEDLINE combined (97.5 %). MEDLINE on its own achieved 92.3 % coverage. Total recall of Embase/MEDLINE combined was 81.6 % for all included references, compared to GS at 72.8 % and MEDLINE alone at 72.6 %. However, only 46.4 % of the included references were among the downloadable first 1000 references in GS. When examining data for each SR, the traditional databases' recall was better than GS, even when taking into account included references listed beyond the first 1000 search results. Finally, precision of the first 1000 references of GS is comparable to searches in Embase and MEDLINE combined. Conclusions: Although overall coverage and recall of GS are high for many searches, the database does not achieve full coverage as some researchers found in previous research. Further, being able to view only the first 1000 records in GS severely reduces its recall percentages. If GS would enable the browsing of records beyond the first 1000, its recall would increase but not sufficiently to be used alone in SR searching. Time needed to screen results would also increase considerably. These results support our assertion that neither GS nor one of the other databases investigated, is on its own, an acceptable database to support systematic review searching.
Over the past 100 years, social science has generated a tremendous number of theories on the topics of individual and collective human behaviour. However, it has been much less successful at reconciling the innumerable inconsistencies and contradictions among these competing explanations, a situation that has not been resolved by recent advances in ‘computational social science’. In this Perspective, I argue that this ‘incoherency problem’ has been perpetuated by an historical emphasis in social science on the advancement of theories over the solution of practical problems. I argue that one way for social science to make progress is to adopt a more solution-oriented approach, starting first with a practical problem and then asking what theories (and methods) must be brought to bear to solve it. Finally, I conclude with a few suggestions regarding the sort of problems on which progress might be made and how we might organize ourselves to solve them.