ArticlePDF Available

Utilizing Big Data Analytics for Information Systems Research: Challenges, Promises and Guidelines


Abstract and Figures

This essay discusses the use of big data analytics (BDA) as a strategy of inquiry for advancing information systems (IS) research. In broad terms, we understand big data analytics as the statistical modelling of large, diverse, and dynamic datasets of user-generated content and digital traces. BDA, as a new paradigm for utilising big data sources and advanced analytics, has already found its way into some social science disciplines. Sociology and economics are two examples that have successfully harnessed BDA for scientific inquiry. Often, BDA draws on methodologies and tools that are unfamiliar for some IS researchers (e.g., predictive modelling, natural language processing). Following the phases of a typical research process, this article is set out to dissect BDA’s challenges and promises for IS research, and illustrates them by means of an exemplary study about predicting the helpfulness of 1.3 million online customer reviews. In order to assist IS researchers in planning, executing, and interpreting their own studies, and evaluating the studies of others, we propose an initial set of guidelines for conducting rigorous BDA studies in IS.
Content may be subject to copyright.
Utilizing big data analytics for information
systems research: challenges, promises and
Oliver Müller
, Iris Junglas
Jan vom Brocke
and Stefan
Institute of Information Systems, University of
Liechtenstein, Vaduz, Liechtenstein;
College of
Business, Florida State University, Tallahassee,
Correspondence: Oliver Müller, Institute of
Information Systems, University of
Liechtenstein, Fürst-Franz-Josef-Strasse,
Vaduz 9490, Liechtenstein
Tel: +423 265 13 00;
Received: 18 November 2014
Revised: 7 December 2015
Accepted: 1 January 2016
This essay discusses the use of big data analytics (BDA) as a strategy of enquiry
for advancing information systems (IS) research. In broad terms, we understand
BDA as the statistical modelling of large, diverse, and dynamic data sets of user-
generated content and digital traces. BDA, as a new paradigm for utilising big
data sources and advanced analytics, has already found its way into some social
science disciplines. Sociology and economics are two examples that have
successfully harnessed BDA for scientific enquiry. Often, BDA draws on meth-
odologies and tools that are unfamiliar for some IS researchers (e.g., predictive
modelling, natural language processing). Following the phases of a typical
research process, this article is set out to dissect BDAs challenges and promises
for IS research, and illustrates them by means of an exemplary study about
predicting the helpfulness of 1.3 million online customer reviews. In order to
assist IS researchers in planning, executing, and interpreting their own studies,
and evaluating the studies of others, we propose an initial set of guidelines for
conducting rigorous BDA studies in IS.
European Journal of Information Systems advance online publication,
9 February 2016; doi:10.1057/ejis.2016.2
Keywords: big data; analytics; data source; methodology; information systems research
The online version of this article is available Open Access
Why worry about big data analytics (BDA)?
The proliferation of the web, social media, mobile devices, and sensor
networks, along with the falling costs for storage and computing resources,
has led to a near ubiquitous and ever-increasing digital record of computer-
mediated actions and communications a record that has been termed big
data. Studies agree (e.g., Hilbert & López, 2011; IDC, 2011, 2014) that the
volume of data being generated and stored today is growing exponentially
(Kitchin, 2014) but, big data is not just about volume (Constantiou &
Kallinikos, 2015; Yoo, 2015). Accompanying the increased size of data sets is
a growing variety of data sources and formats. In fact, IDC (2011) claims that
more than 80% of all digital content is unstructured and that two-third is
generated by individuals, rather than enterprises. The velocity of data has
increased too, resulting in reduced latency between the occurrence of a real-
world event and its mirroring digital footprint (vom Brocke et al, 2014).
As volume, variety, and velocity (Laney, 2001) increase, the veracity of
data is drawn into question (Bendler et al, 2014). Unlike research data
collected with a specic research question in mind and measured using
validated instruments, big data often just happens. Private and government
European Journal of Information Systems (2016), 114
© 2016 Operational Research Society Ltd. All rights reserved 0960-085X/16
organisations increasingly collect big data without a concrete
purpose in mind, hoping that it mig ht provide value some-
day in the future (Constantiou & Kallinikos, 2015; Yoo,
2015). Marton et al (2013) see this shift to ex-post ordering
as a dening characteristic of big data:
[Big data] can be ordered, analysed and thus made
potentially informative in ways that are not pre-
dened. The potential of big data to inform is based
on ordering as much data as possible through auto-
mated computational operations after its collection;
i.e., in an ex-post fashion (p. 6).
Adopting such a data collection and analysis approach
for research purposes is questionable as the validity and
reliability of the data cannot be assumed (Agarwal & Dhar,
2015). Furthermore, the use of big data for research raises
ethical questions as individuals may not be aware that
their digital footprint is being analysed for scientic
purposes or that it is recorded at all (Zimmer, 2010).
Analysing data characterised by the 4Vs (i.e., volume,
velocity, variety, and veracity) necessitates advanced ana-
lytical tools. Big data is not self-explanatory and without
the application of inferential computational techniques to
identify patterns in the data, it is just noise (Boyd &
Crawford, 2012; Dhar, 2013; Agarwal & Dhar, 2015; Yoo,
2015). But applying standard statistical methods to big
data sets or using advanced analytics for its analysis bears
potential pitfalls. For example, some data mining or
machine-learning techniques aim at prediction instead of
explanation (Shmueli & Koppius, 2011), and are fre-
quently labelled as black boxes that hide their inner
workings and make it difcult to interpret their outputs
(Breiman, 2001b). For IS researchers, adopting such meth-
ods therefore can be challenging.
The rst scientic disciplines to add BDA to their
methodological repertoire were natural sciences (see, e.g.,
Hey et al, 2009). Slowly, the social sciences also started
capitalising on the computational analysis of digital foot-
prints (Lazer et al, 2009). For instance, researchers studied
cultural trends, such as the adoption of new technologies,
by sequencing word frequencies in digitised texts (Michel
et al, 2011), used social media to predict daily swings of the
stock market (Bollen et al , 2011), and tested the effect of
manipulating the emotionality of social network news
feeds in a eld experiment with hundreds of thousands of
participants (Kramer et al, 2014). Recently, the IS eld has
also published its rst BDA studies (e.g., Ko & Dennis,
2011; Kumar & Telang, 2012; Oestreicher-Singer &
Sundararajan, 2012; De et al, 2013; Goh et al, 2013;
Stieglitz & Dang-Xuan, 2013; Zeng & Wei, 2013). Still, the
novelty of BDA necessitates new norms and practices to be
established on how to successfully apply BDA in IS
research. Researchers who want to add BDA to their
methodological toolbox must be capable of planning,
executing, and interpreting their studies, and be equipped
to understand and evaluate the quality of others (Agarwal
& Dhar, 2015). In addition, following the arguments laid
out in Sharma et al (2014), researchers have to
acknowledge that insights do not emerge automatically
out of mechanically applying analytical tools to data (p.
435) and that [i]nsights
need to be leveraged by
analysts and managers into strategic and operational
decisions to generate value (p. 436). Just like the applica-
tion of big data and advanced analytics in business does
not automatically lead to better decisions and increased
business value, researchers have to overcome a number of
challenges in order to leverage big data and advanced
analytics for generating new insights and translating them
into theoretical contributions.
This essay is set out to shed light on the challenges that
the application of BDA might bring, as well as the poten-
tials it may hold when IS researchers choose to utilise big
data and advanced analytical tools. Both are illustrated by
an exemplary study using natural language processing
(NLP) and machine-learning techniques to predict the
helpfulness of 1.3 million online customer reviews col-
lected from On the basis of the discussion
and illustration of the challenges and promises of BDA, we
develop an initial set of guidelines for conducting rigorous
BDA studies in IS.
The remainder of this article roughly mirrors the phases
of a typical research process. We discuss issues related to
appropriately framing a BDA research question, the nature
of data collection, its computational analysis, and the
interpretation of results. For each phase, we discuss the
challenges and the promises of BDA for IS research. Subse-
quently, we provide an exemplary illustration of a BDA
study and a set of initial guidelines for IS researchers who
choose to apply BDA as part of their research endeavour.
Research question: appropriately framing a BDA
Analytics methodologies
The literature on research methods and statistics tradition-
ally distinguishes two broad types of approaches to model-
ling (see, e.g., Breiman, 2001b; Gregor, 2006). Explanatory
modelling aims to statistically test theory-driven hypoth-
eses using empirical data, while predictive models aim to
make predictions about future or unknown events, using
models trained on historical data. In general, the quantita-
tive realm of IS research has a strong emphasis on develop-
ing and testing explanatory causal models in fact,
compared with predictive approaches, the [e]xtant IS litera-
ture relies nearly exclusively on explanatory statistical
modeling (Shmueli & Koppius, 2011, p. 553). In contrast
to explanatory studies, predictive studies allow for an
inductive and data-driven discovery of relationships
between variables in a given data set. A researcher does not
have to explicitly formulate testable apriorihypotheses
about causal relationships, but can leave it to the algorithm
to uncover correlations between variables (Breiman, 2001b;
Shmueli, 2010; Shmueli & Koppius, 2011; Dhar, 2013;
Agarwal & Dhar, 2015). In this process, theory merely plays
the role of an analytical lens providing a broad direction for
the data collection and analysis process.
Utilizing big data analytics for IS research Oliver Müller et al2
European Journal of Information Systems
BDA studies can follow an explanatory or predictive
approach. Yet, researchers need to be aware of the chal-
lenges and potential pitfalls that can arise from the use of
big data sets or data-driven analytical methods. For exam-
ple, in explanatory studies researchers have to take care
when interpreting P-values of statistical hypothesis tests
on very large samples. As Lin et al (2013) outline, when
using traditional hypothesis tests in large samples
(n > 10,000), the increase in statistical power lets P-values
converge quickly to 0 so that even minuscule effects can
become statistically signicant (p. 1). This may lead to
ndings that have high statistical, but little practical
signicance. Hence, when interpreting explanatory BDA
studies [t]he question is not whether differences are
signicant (they nearly always are in large samples), but
whether they are interesting (Chateld, 1995, p. 70; as
quoted in Lin et al, 2013).
When it comes to predictive studies, many researchers
apply data mining or machine-learning algorithms. Owing
to their data-driven nature, they are especially well suited to
autonomously sift through data sets that are characterised
by massive volume, a broad variety of attributes, and high
velocity (e.g., real-time data streams). However, such a data
dredging approach is not quite in line with the traditional
scientic method. Hence, predictive analytics has been
described as unscienticandxated on applied utility
(Breiman, 2001b; Shmueli, 2010). In fact, many of the
algorithms used in BDA were designed for practical applica-
tions, such as credit risk scoring or recommending products
to individual customers, but science, unlike advertising, is
not about nding patterns although that is certainly part
of the process it is about nding explanations for those
patterns (Pigliucci, 2009; p. 534). Relying solely on correla-
tion instead of causality might lead to the identication of
patterns that not only provide little theoretical contribu-
tion, but are also vulnerable to changes and anomalies in
the underlying data. The errors in u prediction made by
Google Flu Trends are a prime example of this pitfall. What
used to be praised as a ground-breaking example of BDA,
missed the A-H1N1 pandemic in 2009 and dramatically
overestimated u prevalence in 2012 and 2013. Google Flu
Trends was over-tting historical data, because the statistical
model behind its predictions exclusively relied on induc-
tively identied correlations between search terms and
levels and not on biological cause-and-effect relationships; it
also had only been trained on seasonal data containing no
anomalies (e.g., pandemics), and it had not been updated
on an ongoing basis to reect changes in users information
seeking behaviours (Lazer et al, 2014). Since the raison dêtre
of the IS eld is to study socio-technical systems and not
systems governed by natural laws it therefore seems
unreasonable to solely rely on inductively uncovered corre-
lations that can change over time and contexts.
BDA for theory development
Despite the challenges, however, BDA might still be a
valuable tool for IS research especially for theory
development. The massive size of big data sets allows for
the detection of small effects, the investigation of complex
relationships (e.g., interactions), a comparison of sub-
samples with one another (e.g., different geographies or
time frames), the incorporation of control variables into
the analysis, and the study of effects so rare that they are
seldom found in small samples (e.g., fraud) (Lin et al,
2013). Data-driven predictive algorithms can also help to
advance theory (Dhar, 2013; Agarwal & Dhar, 2015) as
patterns [often] emerge before reasons for them become
apparent (Dhar & Chou, 2001, p. 907). In the medical
eld, for example, the C-Path (Computational Pathologist)
system makes data-driven predictions about malignant
breast tissue by examining microscopic images (Beck &
Sangoi, 2011). Its predictions are not only as accurate
as that of experts, but C-Path has also been able to
discover new markers that are indicative of breast cancer
(stromal morphologic structures) and that have been
ignored by human pathologists thus far (Rimm, 2011;
Martin, 2012).
IS researchers considering BDA might have to grow
comfortable with the idea that research can start with data
or data-driven discoveries, rather than with theory. For
qualitative-interpretive researchers and those familiar with
methodologies such as grounded theory, this way of
thinking is not unfamiliar. Grounded theory, per deni-
tion, is concerned with the inductive generation of new
theory from data, and thus less concerned about existing
theory (Glaser & Strauss, 1967). It utilises a rigid methodo-
logical approach to arrive at new theory and is predomi-
nantly based upon the empirical analysis of (mostly)
qualitative data, sometimes termed as empirically
grounded induction (Berente & Seidel, 2014, p. 4). IS
researchers choosing to apply a BDA approach might want
to consider some of the principles of grounded theory. Like
grounded theorists, BDA researchers will spent an extra-
ordinary amount of time on understanding the nature of
the data, particularly if they have not collected them
themselves. While they do not have to manually code the
data, the ideas behind open, iterative, and selective coding
still apply (Yu et al, 2011). Researchers need to be open to
surprises in the data, iteratively analyse and re-analyse it,
and zoom in on relevant concepts that can make a
contribution. Likewise, theoretical coding, where codes
and concepts are weaved into hypotheses in order to form
a model, seems equally applicable for BDA researchers. As a
model is said to emerge freely (and not by force), applying
the principles of grounded theories can aid researchers in
turning a research interest into a suitable research ques-
tion. A dedicated exploratory phase where a researcher can
probe the feasibility and vitality of his or her interest is not
only helpful, but ensures a rigorously chosen research
objective. Apart from an exploratory phase, BDA research-
ers, like grounded theorists, might also want to consider a
phase of theoretical triangulation in which the various
concepts identied are compared against extant theory.
Testing and re-testing the emerging research question
against existing theory can help to identify overlaps, as
Utilizing big data analytics for IS research Oliver Müller et al 3
European Journal of Information Systems
well as white spots and will ensure that the emerging
research question is truly unique and novel in nature.
Data collection: the nature of big data
More than size
There is more to big data than the 4Vs (Yoo, 2015). Big
data tries to be exhaustive in breadth and depth and more
ne-grained in resolution than traditional data, often
indexing individual persons or objects instead of aggregat-
ing data at the group level (Kitchin, 2014). In economics,
for example, big data collected by the Billion Prices Project
( complements traditional economic
statistics by providing faster and more ne-granular
data about economic activities (Einav & Levin, 2014).
The projects daily price index, considering more than
5 million items from more than 300 online shops in more
than 70 countries, not only correlates highly with the
traditional consumer price index, but also provides pricing
data for countries in which ofcial statistics are not readily
available (Cavallo, 2012).
Another characteristic that sets big d ata apart is related
to its purpose (Marton et al, 2013; Constantiou &
Kallinikos, 2015; Yoo, 2015). Big data is a by-product of
our technological advancements and, more often than
not, coll ected without a speci c purpose in mind. Ama-
zons CEO Jeff Bezos, for ins tance, is known for saying we
never throw away data (Davenport & Kim, 2013). But for
researchers there are big differences in access levels to
those data sources. F or example, Twitter grants access to
plete stream of tweets for selected partner businesses and
a handful of research groups only (Boyd & Crawford,
2012). Others have access to o nly small and idiosyncrati-
cally selected sub-samples of tweets (about 10% of all
public tweets). Hence, for the broader scientic commu-
nity it is virtually impossible to replicate studies using
Twitter d ata. Also, differences in accessibility could possi-
bly create a data di vide (Marton et al, 2013) and split the
research community into those that have access to big
data and those that have not. Already today, the authors
of many of the best known BDA studies are afliated with
big data companies, including Google (Ginsberg et al,
2009; Michel et al, 2011 ) and Facebook (Kramer et al,
2014). Finally, the use of big data for research purposes
also prompts ethical questions, especially because of its
ne-granular, indexical nature (Kitchin, 2014). While
some individual-level data is wil lingly provided by users,
often gladly, others are not. Hence, pursuing research
objectives unequivocally simply because the data is
ava ilable and accessible should prompt some ethical
considerations from IS researchers. For instance, a stud y
conducted by Kramer et al (2014) caused a pu blic outcry
when it was published. The authors conducted a eld
experiment on Facebook, in which they secretly manipu-
lated the news feeds of almost 700,000 users by ltering
out messages with emotional content in order to test if
this had an effect on the sentiments expressed in their
own posts. The results showed that when messages with
positive emotions were cut out, users subsequently
posted less positive and more negative messages. The
opposite happened when the number of messages with
negative emotions was reduced. While both the method
and the results were highly interesting from a scientic
standpoint, it caused numerous discussions about
informed consent and real-world online experiments.
Naturally occurring and open
Do these challenges imply that IS researchers should
refrain from using big data for research purposes entirely?
Up to now, the IS eld has almost exclusively relied on self-
reports or observations collected via instruments like
surveys, experiments, or case studies (Hedman et al,
2013). While these strategies of enquiry have many
advantages (including tight controls), they also are costly
and subject to biases.
In contrast to data provoked by researchers through
interviews, surveys, or experiments, big d ata is predomi-
nantly user-ge nerated, or naturally occurring (Speer,
2002; Silvermann, 2006; Hedman et al, 2013). It is not
collected with a specic research purpose in mind and,
hence, may be less contrived. As a result, naturally
occurring data has the potential to alleviate many of
the methodological biases originating from the use of
data collection instruments (Speer,2002).Whereastra-
ditional instruments rely on self-reports, collect data
after the fact, and are drawn from a conned pool of
subjects, naturally occurring data could possibly reduce
instrument biases as behaviours and attitudes are gath-
ered in an unobtrusive way and at the time and context
in which they occur, from a sam pling pool that pos sibly
comprises the entire population or at least a pool that
spans a wide variety of geographical, societal, and orga-
nisational borders. Simultaneously, we have to acknowl-
edge that even social networks, such as Facebook and
Twitter with millions of active users, are self-selected
tools and thus prone to non-re sponse biases. Als o, not
everybody uses the tool to the same extent, and some
users are not eve n humans, but bots (Boyd & Crawford,
2012). But irrespective of these shortcomings, the pool
for sampling is by far greater than that targeted with
traditional methods.
Naturally occurring data may or may not be open. In the
natural sciences, several open research data repositories
have emerged in the last years (Murray-Rust, 2008). The
Global Biodiversity Information Facility (GBIF), for exam-
ple, hosts more than 15,000 data sets with more than
400 million indexed records published by over 600 institu-
tions, capturing the occurrence of organisms over time
and across the planet. Since 2008, more than 900 peer-
reviewed research publications identied the use of GBIF
data, allowing a wide variety of researchers to scrutinise its
content and draw conclusions about its validity. Having a
similar platform for IS-related data sets available would not
only increase the validity of data captured in the eld, but
Utilizing big data analytics for IS research Oliver Müller et al4
European Journal of Information Systems
also boost the replicability of studies. Such a data library
could even become part of the publication process. The
Journal of Biostatistics, for example, encourages authors to
publish their data set and the code for analysing the data
alongside their accepted paper. A dedicated Associate
Editor for Reproducibility then runs the code on the data
to check if it indeed produces the results presented in the
article (Peng, 2011). The next step would be to publicly
provide linked and executable code and data (Peng, 2011,
p. 5), for example, in the form of an interactive computa-
tional environment (e.g., IPython Notebook) that allows
readers to inspect and explore the original data and
reproduce all data analysis steps.
Data analysis: the computational enquiry of
unstructured data
Quantifying text
Without automated statistical inference, big data turns
into a burden r ather than an opportunity. Yet, statistical
analysis only applies to structured numeric and catego-
rical data; and it is estimated that more than 80% of
todays data is captured in an unstructured format ( e.g.,
text, image, audio, video) (IDC, 2011), much of it
expressed in ambiguous natural l anguage. The analysis
of s uch data would u sually prompt the use of qualitat ive
data analys is approaches, such as reading and manual
coding, but the size of data sets obtained from sources
like Twitter or Facebook deems any kind of manual
analysis virtually impracticable. Hence, researchers have
turned to NLP techniques to enable t he automated
analysis of large corpora of textual data (Che n et al,
But the statistical analysis of textual data only scra tche s
the surface of ho w humans assign meaning to n atural
language. For example, most topic models (i.e., algo-
rithms for discovering latent topic s in a collection of
documents) treat texts as simple unordere d sets of words ,
disregarding any syn ta x or word ord er (Ble i et al, 2003;
Blei, 2012). Consequently, such techniques struggle with
linguistic conc epts, su ch as compo sit ionali ty, ne gation ,
irony, or sarcasm (Lau et al, 2013). Further challenges
arise from a lack of information about the context in
which data have been generated or c ollected. Every unit
of tex t is treated in an egalitarian way irresp ective of who
authored it or what the situation was like when the text
was created (Lycett, 2013). While the c urrent properties
of NLP are in stark contrast to the assumptions and
practices of t raditional qualit ative social science research,
which emphasises features like natural se ttings, local
groundedness, context, richness, and holism (Miles &
Huberman, 1994), the application of NLP in social
science research has alrea dy produced inte resting nd-
ings. For example, Michel et al (2011) sequenced the
development of cultural trends based on the content of
more t han 5 million digitised books, representing about
4% of all books ever printed. The texts wer e examined
using a simple, yet insightful, technique called n-gram
analysis that simply counts how often a wo rd is used over
time. By computing t he yearly relative frequency of all
words found, t he authors were able to quantify cultural
phenomena, such as the adoption of technology. For
example, the study found that m odern societies are
adopting inventions at an accelerating rate. While early
inventions (in the timeframe from 1800 to 1840) took
over 66 years from invention to widespread adoption
(measured by the shape of the frequency distribution of
the corresponding word over time), the average time-to-
adoption dropped to 50 years for the timeframe from
1840 to 1880, and to 27 years for the timeframe from
1880 to 1920.
Numerous empirical studies and real-world applications
have shown that simple NLP models fed with sufciently
large data sets can produce surprisingly accurate results
(Halevy et al, 2009). An experimental evaluation of the
Latent Semantic Analysis (LSA) topic modelling algorithm,
for example, has shown that the agreement between LSA
and human raters when coding clinical interview tran-
scripts reaches 90% (Dam & Kaufmann, 2008). And a study
on neural sentiment analysis found agreements between
80 and 85% for sentiment classications of movie reviews,
performed independently by human raters and machine
algorithms (Socher et al, 2013).
Overcoming the quantitativequalitative divide
An increase in inter-coder reliability between human
raters and machines also has the potential to close the
divide between quantitative and qualitative methods a
dilemma that has long captured the attention of IS
researchers (e.g., Venkatesh et al, 2013). Implementing
techniques like topic modelling or sentiment analysis
into software tools for qualitative data analysis (e.g.,
NVivo, Atlas.ti) could reduce the time required for quali-
tative data analysis and permit researchers to work with
much larger samples, enable the discovery of subtle
for IS researchers that decide to apply a BDA approach for
studying qualitative data it is important to dedicate a
signicant amount of time triangulating their empirical
results. This additional phase not only ensures the valid-
ity of the analysis by placing it in the context of other
studies, but also adds, over t ime, to a growing body of
studies that apply a particular BDA approach. Meticu-
lously documenting the data analysis procedure is there-
fore critical. As the spectrum of NLP algorithms is already
extensiv e, continu ousl y in ux and advancing, ensuring
the traceability of t he underlying algorithm is key. Like -
wise, IS researche rs might also want to consider collabor-
ating with non-IS researchers when conducting BDA
studies, particularly with colleagues from the elds of
computer science and machine learning. Apart from an
advanced understanding of the phenomenon, this
ensures the development of new skill sets and the
advancement of the IS eld t hrough newly emerging
methodological tools.
Utilizing big data analytics for IS research Oliver Müller et al 5
European Journal of Information Systems
Result interpretation: opening up the black box
of analytics
Accuracy vs interpretability
To analyse big data be it qualitative or quantitative in
nature researchers not only apply traditional inferential
statistics (e.g., analysis of variance, regression analysis),
but also increasingly make use of data mining and
machine-learning algorithms (see, e.g., Wu et al, 2008;
KDnuggets, 2011; Rexer, 2013 for an overview) especially
when the studys objective is prediction (Breiman, 2001b;
Shmueli, 2010; Kuhn & Johnson, 2013). It is widely
documented that such algorithmic methods can outper-
form simpler statistical models (e.g., linear regression) in
terms of their predictive accuracy, because they make less
statistical assumptions, can work with high-dimensional
data sets, are able to capture non-linear relationships
between variables, and automatically consider interaction
effects between variables (Breiman, 2001b; Shmueli, 2010;
Kuhn & Johnson, 2013).
Yet, this improvement in predictive accuracy comes at a
cost. Some of the most accurate algorithms, such as
support vector machines or random forests, are largely
incomprehensible (Martens & Provost, 2014). In other
words, these black box algorithms are good in predicting
future or unknown events, but unable to provide explana-
tions for their predictions. When analysis results are
intended to be consumed by machines, for instance, when
placing online ads or ranking search results, this lack of
transparency may be coped with (Lycett, 2013). Yet, for
human decision makers, or when stakes are higher, the
interpretability and accountability of an algorithm grows
in importance (Diakopoulos, 2014). In the nancial indus-
try, for example, regulatory requirements demand that all
automated decisions made by credit scoring algorithms are
transparent and justiable, for example, to avoid discrimi-
nation of individuals. As a consequence, todays credit
scoring systems are based on simple linear models with a
limited number of well-understood variables even
though these models could be easily outperformed by
non-linear models working on high-dimensional data sets
(Martens et al, 2007). Likewise, in the medical eld it is
paramount for doctors to understand the decisions made
by algorithms and to be able to explain and defend them
against patients, colleagues, and insurance providers
(Breiman, 2001b). ‘“Doctors can interpret logistic regres-
sion. There is no way they can interpret a black box
containing fty [decision] trees hooked together. In a
choice between accuracy and interpretability, theyll go
for interpretability (Breiman, 2001b, p. 209). These exam-
ples illustrate that good decisions are decisions that are
both of high quality and accepted by those responsible for
implementing them (Sharma et al, 2014).
Explaining predictions
The difculties of interpreting the results and compre-
hending the underlying algorithm are well-documented
drawbacks of data mining and machine-learning algorithms.
Researchers have long called upon fellow researchers to
provide explanations for their systems outputs (Gregor &
Benbasat, 1999). Today, there is a growing body of
research seeking to disclose the inner workings of black
box data mining and machine-learning algorithms by
bolstering their highly accurate predictions with easy to
comprehend explanations (Martens et al, 2007; Robnik-
Sikonja & Kononenko, 2008; Kayande et al, 2009).
A dedicated explanatory phase is therefore vital for ensur-
ing the interpretability of predictive models. Such expla-
nations can take many forms, including comparing
and contrasting predictions with existing theories and
related explanatory empirical studies (Shmueli & Koppius,
2011), replaying predictions for extreme or typical cases
(Robnik-Sikonja & Kononenko, 2008), performing a sensi-
tivity analysis for selected model inputs and visualising the
corresponding model inputs and outputs (Cortez &
Embrechts, 2013), or extracting comprehendible decision
rules from complex ensemble models (Martens et al, 2007).
In this way, a researcher can rst use data mining or
machine-learning algorithms to identify correlated vari-
ables and then use more traditional statistics on the most
relevant variables to read and interpret the identied
Illustrative BDA study
In the previous sections, we have discussed the challenges
and promises of BDA along the various phases of the
research process, including the denition of the research
question, the nature of data collection, its computational
analysis, and the interpretation of results. In this section,
we illustrate how BDA can be applied in IS research by
presenting an integrated, exemplary BDA study from the
area of online customer reviews. The presentation of this
study mirrors the structure of the previous chapters and
with it the various phases of the research process.
Research question
As our research objective, we have chosen to revisit the
question of What makes a helpful online review?
(Mudambi & Schuff, 2010). Online customer reviews,
dened as peer-generated product evaluations posted on
company or third-party websites (Mudambi & Schuff,
2010; p. 186), are ubiquitous testaments of product experi-
ences and are shown to inuence a customers decision-
making process (BrightLocal, 2014). While most studies on
review helpfulness in IS research have been explanatory
(e.g., Mudambi & Schuff, 2010; Cao et al, 2011), our
exemplary BDA study is predictive in nature. Speci cally,
it predicts perceived review helpfulness based on the
characteristics of the review itself, that is, users numerical
product ratings and textual comments.
A predictive model for review helpfulness might be
valuable for practical and theoretical reasons. It might be
able to determine the proper sorting and ltering of new
reviews or to pinpoint how to write more effective reviews.
In addition, and because of the inductive and data-driven
Utilizing big data analytics for IS research Oliver Müller et al6
European Journal of Information Systems
nature of many machine-learning and data mining algo-
rithms used for prediction, a predictive model for review
helpfulness might also hold the potential to contribute to
theory by discovering to date unknown relationships
between review characteristics and perceived review
Data collection
Leading e-commerce platforms like Amazon, TripAdvisor,
or Yelp allow users not only to write reviews, but also to
rate the helpfulness of other users reviews. To collect this
type of data for analysis, it can either be crawled and
scraped directly from the originating websites (which may
be against the terms of use of some websites), or down-
loaded from open data repositories. For this example, we
used the open Amazon product reviews data set curated by
McAuley and colleagues (McAuley et al, 2015a, b) and
publicly available at
amazon/. The data set comprises 143.7 million reviews of
9.4 million products, spanning a period of 18 years from
May 1996 to July 2014. In the following, we focus on the
1.3 million reviews available for the video games product
category. We chose video game reviews as the unit of
analysis as video games represent a type of software that
customers happily discuss online; also, video games are
representative of a strong market with a total revenue of
more than U.S.$22 billion in 2014 (Entertainment
Software Association, 2015).
While customers assign high relevance to peer-generated
online reviews and even trust them as much as personal
recommendations (BrightLocal, 2014), it is important to
note that reviews can be subject to biases. For example, it is
well-known that online reviewers tend to be extreme in
their rating, leaning either to the very positive or very
negative end of the spectrum (Hu et al, 2009). Nonetheless,
the use of real reviews as a data source is common practice
in the IS eld (e.g., Mudambi & Schuff, 2010; Cao et al,
2011; Ghose & Ipeirotis, 2011), thereby acknowledging its
validity and reliability for research purposes. The richness in
information that online reviews provide, combined with
the high volume of online reviews available on the web,
make them ideal for studying customer information-seeking
and decision-making behaviours.
Although McAuley and colleagues data set has already
been cleaned to some extent (i.e., duplicate reviews have
been removed), we pre-processed and transformed the
data even further for our illustrative example. First, an
exploratory data analysis revealed that about half of the
reviews did not possess a single helpfulness rating (neither
positive nor negative). We excluded reviews with less than
two helpfulness ratings in order to increase the reliability
of our analysis. This reduced our data set to 495,358 video
game reviews. A second issue we encountered was related
to the distribution of the dependent variable, that is,
review helpfulness. Following prior research (e.g.,
Mudambi & Schuff, 2010), we measured review helpful-
ness as the ratio of helpful votes to total votes for a given
review. The resulting measure showed a w-shaped distribu-
tion. In order to avoid statistical concerns arising from the
extreme distribution of values, we dichotomised the
review helpfulness variable (i.e., reviews with a helpfulness
ratio of > 0.5 were recoded as helpful, and reviews 0.5 as
not helpful) (Lesaffre et al, 2007).
Data analysis
While early research on review helpfulness mostly focused
on quantitative review characteristics, especially the
numerical star rating, more recent studies have started to
analyse their textual parts (e.g.,Mudambi & Schuff, 2010;
Cao et al, 2011; Ghose & Ipeirotis, 2011; Pan & Zhang,
2011; Koratisa et al, 2012). Yet, these studies largely focus
on easily quantiable aspects, including syntactic or stylis-
tic features such as review length, sentiment or readability,
and mostly neglect the actual content of the review text
(i.e., what topics is a reviewer writing about?). In order to
capture a reviews content and its impact on review help-
fulness, we applied probabilistic topic modelling using the
Latent Dirichlet Allocation (LDA) algorithm.
Probabilistic topic models are unsupervised machine-
learning algorithms that are able to discover topics run-
ning through a collection of documents and annotate
individual documents with topic labels (Blei et al, 2003;
Blei, 2012). The underlying idea is the distributional
hypothesis of statistical semantics, that is, words that co-
occur together in similar contexts tend to have similar
meanings (Turney & Pantel, 2010). Consequently, sets of
highly co-occurring words (e.g., work, doesnt, not,
download, install , windows, run) can be interpreted
as a topic (e.g., installation problems) and can be used to
annotate documents with labels (Boyd-Graber et al, 2014).
In statistical terms, each topic is dened by a probability
distribution over a controlled vocabulary of words, and
each document is assigned a probabilistic distribution over
all the topics. The per-document topic distribution vector
represents a documents content at a high level and can be
used as input for further statistical analysis (e.g., predictive
Using L DA we extracted the 100 most prevalent topics
from our corpus of video game r eviews and anno-
tated each review with a vector of topic probabilities.
(We used the LDA implementation provided by Mine and pre-processe d all texts by lteri ng out
standard stop wo rds (e.g., and , the, I)aswellasashort
list of custom stop words (e.g.,
game, play) and lemma-
tizing all words of the corpus. Overall, the computation
took about 6 h on a server with 24 CPU cores
and 120 GB main me mory. The re sults of the analysis
are publicly available at
amazon-video-games-reviews.) These topic probabilities
along with review rating and review length, two variables
that were consistently found to be associated with review
helpfulness in prior research (Tre nz & Berger, 2013 ), were
used as predictors for determining whether a new review
would be perceived as helpful or not.
Utilizing big data analytics for IS research Oliver Müller et al 7
European Journal of Information Systems
For training the predictive model we used random forests,
an ensemble supervised-learning technique that is able to
process high-dimensional data sets, model non-linear
relationships between variables, consider complex variable
interactions, and is robust against data anomalies (e.g.,
non-normally distributed variables, outliers) (Kuhn &
Johnson, 2013). A random forest model is constructed by
generating a multitude of decision trees based on boot-
strapped sub-samples such that only a random sample of
the available variables at each split of the tree is considered
a potential split candidate (Breiman, 2001a). Model train-
ing was performed on 80% of the overall data set. We used
the implementation provided by the scikit-learn Python
package and set the number of trees to 128. (The code and
data are publicly available at http://www.minemytext.
Figure 1 (left) shows the Receiver Operating Character-
istic (ROC) curve, illustrating the predictive performance
of our binary classication on the holdout test set (20% of
the overall data set) by plotting the true positive rate
against the false positive rate. The area under the ROC
curve (AUC) amounts to 0.73, which means that the
model can distinguish between a randomly drawn helpful
review and a randomly drawn unhelpful review with an
accuracy of 73% (Fawcett, 2006). (Note that a random
classier has an AUC score of 0.5, whereas perfect classi-
cation is represented by a score of 1.0.)
Random forests usually have a high predictive accuracy,
yet it can be difcult to interpret their output as they lack
regression coefcients and P-values known from regression
analysis. The only diagnostic parameters available to inter-
pret a random forest model are the variable importance
measures (i.e., mean decrease in Gini coefcient), which
measure how each variable contributes to the homogene-
ity of the nodes and leaves in the random forest. Figure 1
(right) shows the 20 most inuential variables for
predicting the helpfulness of a video game review. The
two most important variables are review length and star
rating; the remaining variables comprise topics extracted
through LDA.
In order to take a closer look at the topics extracted for
predicting review helpfulness, Table 1 illustrates the most
probable words associated with each topic and our inter-
pretation of the respective topic derived by co-examining
the per-topic word distributions and per-document topic
distributions. Overall, the list reveals a broad range of
topics that impact a reviews helpfulness, including, for
example, overall affective judgements (e.g., 19, 43, 82),
economic factors (e.g., 10, 86), game features (e.g., 44, 72),
technical issues (e.g., 89, 9), topics related to specic
audiences (e.g., 63, 100), comparison with other games
(e.g., 2, 5), and service-related aspects (e.g., 40, 24). One
topic (52) represented Spanish reviews, mainly for the
game Grand Theft Auto.
To ensure the measurement validity of our study, we
empirically triangulated the LDA results, that is, the per-
topic word distributions and the per-document topic dis-
tributions, with judgments of human coders by adapting a
method proposed by Chang et al (2009). In a rst step, we
employed a word intrusion task to measure the semantic
coherence of topics. Since topics are represented by words
that co-occur with high probability, the idea behind the
word intrusion task is to insert a randomly chosen word
(intruder) into a set of words representative of a topic and
ask human judges to identify the intruder. For each topic,
we generated ve randomly ordered sets of six words: the
ve most probable words for the given topic plus one
randomly chosen word with low probability for the
respective topic. We presented these sets to three indepen-
dent human coders via the crowdsourcing platform Ama-
zon Mechanical Turk and prompted them to identify the
intruder. In total, this procedure resulted in 270
Figure 1 ROC curve and variable importance measures of random forest classifier.
Utilizing big data analytics for IS research Oliver Müller et al8
European Journal of Information Systems
comparisons between algorithm and human, who, when
averaged over all topics, agreed in 71% of the cases (SD:
19%). In a second step, we conducted a best topic task to
validate the topic assignments for each review. (The task is
a variation of the topic intrusion task developed by Chang
et al (2009). Instead of identifying an intruder among a set
of highly probable topics, we chose to identify the best
match of a topic. We applied the best topic task because
the LDA algorithm had shown that the majority of reviews
were dominated by one topic. Hence, it was impossible to
create random sets of three or four highly probable topics
for any given review.) Again, for each topic we randomly
selected reviews and presented them to human coders on
Amazon Mechanical Turk. For each review, three topics
were presented: one that was most probable and two that
were randomly selected low-probability topics. Each topic
was described by its 10 most probable words. Again, for
each of the 18 topics we generated 5 tasks and 3 indepen-
dent workers were prompted to identify which of the 3
topics best matched the review presented (again, resulting
in 270 comparisons overall). The agreement between
algorithm and coders, averaged over all topics, amounted
to 85% (SD: 17%). (For both tasks, testing for differences in
mean percentage agreement for randomly selected sub-
samples (50% of cases) of the 270 comparisons showed no
signicant differences, indicating that the experiments
produced reliable results.)
Result interpretation
Interpreting the results of a black-box algorithm like
random forests can be challenging, as the results of model
tting are not as easy to read like, for example, the
outputs of a logistic regression. Variable importance mea-
sures only provide a hint about the predictive power of
individual variables. Without further analysis it remains
unclear whether a variable has a positive or negative
inuence on the probability of belonging to a certain class
(i.e., helpful or unhelpful). One way to shed more light on
a random forest model is to plot the values of a selected
independent variable against the class probabilities pre-
dicted by the model (i.e., predictions of the dependent
variable) (Friedman et al, 2013). Figure 2 shows such partial
dependence plots of three selected topic probabilities that
are charted against the predicted probability of being
helpful. Topic 40 (customer service), for example, exhibits
a strong negative correlation with review helpfulness. If a
reviewer spends more than 10% of his or her words on
customer service issues, it is highly likely that the review,
on average, will be perceived as unhelpful. In contrast, the
more reviewers write about problems with digital rights
management (Topic 89) or the appropriateness for kids
(Topic 100), the more likely a review receives positive
rather than negative helpfulness ratings.
The nal step of the results interpretation phase is to
compare and contrast the data-driven discoveries with
extant theory and literature. While a full discussion is out
of scope of this illustrative study, we want to highlight two
interesting ndings. First, as mentioned before, Topic 40
(customer service) has a strong negative impact on review
helpfulness. A potential explanation for this observation
might be that a considerable portion of people use Amazon
as a source of product reviews but not as a purchasing
platform. Consequently, judgments about Amazonscusto-
mer service, such as speed of delivery or return policies, are
irrelevant (or not helpful) for these users, as they are not
related to the quality of the product. Second, Topics 89
(digital rights management) and 100 (appropriateness for
kids) show a strong positive relationship with review help-
fulness. Both topics are connected to the concept of
Table 1 Topics for predicting review helpfulness ordered by importance
Topic Most probable words ordered by probability Interpretation
19 bad buy make money graphic terrible horrible waste good before Negative affect
2 call duty cod map multiplayer ops good black battlefield campaign Call of Duty series
59 dont good buy thing alot didnt bad make people doesnt Dont buy it
89 ea city drm server buy internet online simcity make connection Digital rights management
72 great graphic amaze love awesome recommend story good gameplay buy Gameplay and storyline
52 gta city de mission la grand auto theft car el Spanish reviews
63 love son gift buy christmas great year grandson purchase daughter Gift for a family member
44 good pretty graphic fun great bad nice cool thing lot Graphics
9 work download window computer install run steam software problem instal Installation problems
10 money buy worth waste time spend save pay good work Not worth the money
82 fun great lot love good time recommend enjoy challenge easy Fun and enjoyment
40 amazon customer send work product return back service support receive Customer service
5 halo mutliplayer xbox campaign good reach great stroy map weapon Halo series
70 great work good love buy prices product recommend awesome deal Quality of console hardware
86 price buy great worth good pay amazon deal cheap find Good deal
43 love awesome cool buy fun great good thing graphic make Positive affect
24 review star give read buy people rate write good bad Other customers reviews
100 kid love fun year child young great adult age enjoy Appropriateness for kids
Utilizing big data analytics for IS research Oliver Müller et al 9
European Journal of Information Systems
experience goods products that have qualities that are
difcult for a consumer to evaluate in advance, but can be
easily assessed after consumption (Nelson, 1970). In our
case, customers nd it helpful that other gamers or parents
include warnings about digital rights management restric-
tions (especially the need to always stay connected to a
remote server when playing) or inappropriate content (e.g.,
violence, sex). This reinforces the ndings of prior research
that has compared helpfulness ratings of experience goods
vs search goods (e.g., Mudambi & Schuff, 2010).
When re ecting on the illustrative BD A study, a few
ndings should be noted. With regards to t he research
question and contribution, we have showcased how a
BDA study that is data-driven and predictive in natur e
can nonetheless be the basis for theorisation. In terms of
data collection, we illustrated the value of open data
repositories and their remaining need for exploratory
data analysis and data preparation. In the data analysis
phase, we showed the potential of NLP and machine-
learning techniques for discovering p atterns and rela-
tionships in large and unstructured data sets, and how
the results of such techniques can be triangulated with
human judgements. And nally, we demonstrated the
use of intuitive visualisations and theoretical triangula-
tion for result interpretation. In the next section, we will
reect on the broader changes that BDA might bring to
the IS discipline and propose a set of initial guidelines for
conducting rigorous BDA studies in IS.
Initial BDA guidelines for IS researchers
Big data analytics signies a novel and complementary
approach to using large, diverse, and dynamic sets of user-
generated content and digital trace data as well as new
methods of analytics to augment scientic knowledge. But
for BDA to take place, the emphasis we put on the various
phases of the traditional research process might have to re-
weighted. During the initial phase, this means that
researchers should spend more time framing the research
question and exploring its feasibility; during the data
collection phase, it means that researchers while able to
reduce the amount of effort needed for the actual data
collection have to invest more time in understanding
and preparing raw data; during the analysis phase, it
means that researchers should verify their results based
on new evaluation criteria (e.g., inter-coder reliability
between humans and algorithms); and lastly, during the
interpretation phase, it means that researchers need to
spend an extraordinary amount of time making sense of
the results, sometimes even necessitating an additional
explanatory phase. Explicit periods of triangulation should
enter the research process in more salient ways than
before. As to theoretical triangulation, researchers will
need to pay close attention to existing theoretical con-
structs in order to ensure the legitimacy of the research
question as well as the interpretability of ndings. Ensur-
ing that studies are not merely built on volatile correla-
tional relationships without enhancing theory is of vital
importance. As to empirical triangulation, researchers will
have to place their own empirical results in relation to
others. Until sufcient evaluation criteria are developed,
ndings of rst generation BDA studies should be con-
trasted to ndings from studies that have applied more
traditional strategies of enquiry. Table 2 summarises our
initial set of guidelines for IS researchers applying BDA.
Note that each of the guidelines has been addressed, to a
certain extent, in the illustrative BDA study presented in
the previous section. While we cannot claim that the
Topic 40 Topic 89 Topic 100
Topic probability
real_ class
helpful not helpful
Predicted probability of being helpful
Figure 2 Topic probabilities vs predicted class probabilities.
Utilizing big data analytics for IS research Oliver Müller et al10
European Journal of Information Systems
proposed set of guidelines is complete, nor that all of the
presented guidelines will stand the test of time, we never-
theless contend that they represent a valid starting point
for jointly and iteratively testing and revising quality
criteria for BDA in IS research. Reecting on the guidelines,
we can observe that each phase of the research process
requires a revised set of actions and abilities. A shift in
weight and emphasis for the various phases of the research
process will most probably also result in a skill set change
for IS researchers. Stronger emphasis needs to be placed on
developing skills for data preparation and the deployment
of analytical tools and cross-instrumental evaluation cri-
teria. IS researchers should extend their repertoire of
statistical methods to also include approaches that go
beyond statistical hypothesis testing. Among these are
data mining and machine-learning algorithms, NLP tech-
niques, as well as graphical methods for visualising large
data sets in intuitive ways.
In the natural sciences, the evolution of the scientic
method is often portrayed as four eras (Bell et al, 2009;
Hey et al, 2009). In the early times, research was based on
empirical observation; this was followed by an era of
theoretical science, in which building causal models was
cultivated. As models became more complex, an era of
computational research using simulations emerged.
Today, many natural science disciplines nd themselves
in an era of massive data exploration, in which data is
captured in real time via ubiquitous sensors and con-
tinuously analysed through advanced statistical models.
Jim Gray, the recipient of the Association of Computing
Machinerys 1989 Turing Award, referred to this latter
epoch as data-intensive scientic discovery, or simply
the fourth paradigm (Hey et al, 2009). The IS discip-
line is not a natural science, but rather studies socio-
technical systems that are more difcult to measure
and theorise. Hence, we do not argue that the advent of
BDA represents an evolutionary step or paradigm shift
for IS research. However, we must not ignore the poten-
tial that BDA as a novel and complementary data
source and data analysis methodology can offer. In order
to foster its diffusion, in this essay we discussed and
illustrated its challenges and promises and proposed a set
of initial guidelines intended to assist IS researchers in
conducting rigorous BDA studies, and reviewers, editors,
and readers in evaluating such studies. At the same time,
we advise against a blind application of these guidelines,
and recommend to check their applicability for each
specic research project, and adapt and extend them
where necessary.
About the authors
Oliver Müller is an Assistant Professor at the Institute of
Information Systems at the University of Liechtenstein. He
holds a Ph.D. and Masters Degree from the University of
Münster, Germany. His research interests are decision
Table 2 A summary of guidelines for IS researchers applying BDA
Research phase Guidelines
Grow comfortable that research can start with data instead of theory
Undergo a phase of theoretical triangulation in order to verify the vitality of a data-driven research question
Position your study as an explanatory or predictive study and follow the corresponding methodology
Particularly for predictive studies, do not only aim for practical utility, but plan to make a theoretical contribution
Plan to re-adjust the time and effort spent on the various phases of the research process
Data collection
Justify the selection of big data as the primary data source
Discuss the nature of the data with regards to its validity and reliability
Document the data collection process in detail; ensure transparency about the type and nature of data used
If applicable, provide access to the data used in form of a database that accompanies the research paper
Data analysis
Document the data pre-processing steps in detail, especially for studies applying NLP
Algorithms evolve and are in flux at all times; rely on other disciplines, particularly computer science and machine learning,
to ensure their validity
If possible, provide access to the algorithms used for data analysis
Apply empirical triangulation in order to ensure the statistical validity of the analysis; select appropriate evaluation criteria in
order to ensure comparability with other studies
Make black box algorithms transparent by adding an explicit explanatory phase to the research process
Theoretically triangulate the results by drawing on existing theory; at the very least, discuss the results against the backdrop
of existing studies
If applicable, try to replicate findings using traditional data analysis methods
Utilizing big data analytics for IS research Oliver Müller et al 11
European Journal of Information Systems
support systems, business analytics, text mining, and big
data. His work has been published in internationally
renowned journals (e.g., Journal of the Association for
Information Systems, Communications of the Association for
Information Systems, Business & Information Systems Engi-
neering, Computers & Education, IEEE Transactions on Engi-
neering Management) and presented at major international
Iris Junglas is an Associate Professor for Information
Systems at Florida State University. Her research interest
captures a broad spectrum of topics, most prominent are
the areas of E-, M- and U-Commerce, health-care informa-
tion systems, the consumerization of IT and business
analytics. Her research has been published in the European
Journal of Information Systems, Information Systems Journal,
Journal of the Association of Information Systems, Manage-
ment Information Systems Quarterly, Journal of Strategic Infor-
mation Systems, and various others. She serves on the
editorial board of the Management Information Systems
Quarterly Executive and the Journal of Strategic Information
Systems and is also a senior associate editor for the European
Journal of Information Systems.
Jan vom Brocke is professor for Information Systems at
the University of Liechtenstein. He is the Hilti Endowed
Chair of Business Process Management, Director of the
Institute of Information Systems, Co-Director of the Inter-
national Master Program in Information Systems, Director
of the Ph.D. Program in Business Economics, and Vice-
President Research and Innovation at the University of
Liechtenstein. In his research he focuses on digital innova-
tion and transformation capturing business process man-
agement, design science research, Green IS, and Neuro IS,
in particular. His research has been published in Manage-
ment Information Systems Quarterly, Journal of Management
Information Systems, Business & Information Systems Engi-
neering, Communications of the Association for Information
Systems, Information & Management and others. He is
author and editor of seminal books, including the Interna-
tional Handbook on Business Process Management as well as
the book BPM - Driving Innovationin a Digital World. He has
held various editorial roles and leadership positions in
Information Systems research and education.
Stefan Debortoli is a research assistant and Ph.D. candi-
date at the Institute of Information Systems at the
University of Liechtenstein. His doctoral studies focus on
applying Big Data Analytics as a new strategy of inquiry in
Information Systems Research. Within the eld of Big Data
Analytics, he focuses on the application of text mining
techniques for research purposes. His work has been pub-
lished in the Business & Information Systems Engineering
Journal,theCommunications of the Association for Information
Systems, and the European Conference on Information Systems.
Before joining the team, he has gained over 5 years of
working experience in the eld of software engineering and
IT project management across various industries.
AGARWAL RandDHAR V (2015) Editorial big data, data science, and
analytics: the opportunity and challenge for IS research. Information
Systems Research 25(3), 443448.
ECK AandSANGOI A (2011) Systematic analysis of breast cancer
morphology uncovers stromal features associated with survival. Science
Translational Medicine 3(108), 108113.
ELL G, HEY TandSZALAY A (2009) Beyond the data deluge. Science
323(5919), 12971298.
uncertainty in big data. Business & Information Systems Engineering
6(5), 279288.
ERENTE NandSEIDEL S (2014) Big data & inductive theory development:
towards computational grounded theory? In Proceedings of the
Americas Conference on Information Systems (T
Eds), Association for Information Systems, Savannah, USA.
LEI D (2012) Probabilistic topic models. Communications of the ACM
55(4), 7784.
LEI D, NG AandJORDAN M (2003) Latent dirichlet allocation. Journal of
Machine Learning Research 3(1), 9931022.
OLLEN J, MAO HandZENG X (2011) Twitter mood predicts the stock
market. Journal of Computational Science 2(1), 18.
OYD DandCRAWFORD K (2012) Critical questions for big data: provoca-
tions for a cultural, technological, and scholarly phenom enon. Informa-
tion, Communication & Soci ety 15(5), 662679.
OYD-GRABER J, MIMNO DandNEWMAN D (2014) Care and feeding
of topic models: problems, diagnostics, and improvements. In Hand-
book of Mixed Membership Models and Their Applications (A
LEI D, EROSHEVA EA and FIENBERG SE, Eds), pp. 334, CRC Press,
Boca Raton.
REIMAN L (2001a) Random forests. Machine Learning 45(1), 532.
REIMAN L (2001b) Statistical modeling: the two cultures. Statistical Science
16(3), 199231.
RIGHTLOCAL (2014) Local consumer review survey 2014. [WWW document]
2014/ (accessed 14 January 2016).
AO Q, DUAN WandGAN Q (2011) Exploring determinants of voting for
the helpfulness of online user reviews: a text mining appro ach.
Decision Support Systems 50(2), 511521.
AVALLO A (2012) Scraped data and sticky prices. [WWW document] 17 11999
(accessed 6 July 2015).
HANG J, BOYD-GRABER J, GERRISH S, WANG C and BLEI D (2009) Reading tea
leaves: how humans interpret topic models. In Proceedings of the
Advances in Neural Information Processing Systems Conference (B
pp. 19, Neural Information Processing System, Vancouver, Canada.
HATFIELD C (1995) Problem Solving: A statisticians Guide. Chapman & Hall,
Boca Raton.
HEN H, CHIANG RandSTOREY V (2012) Business intelligence and analytics:
from big data to big impact. MIS Quarterly 36( 4), 11651188.
ONSTANTIOU IandKALLINIKOS J (2015) New games, new rules: big data
and the changing context of strategy. Journal of Information Technology
30(1), 4457.
ORTEZ P and EMBRECHTS M (2013) Using sensitivity analysis and visualiza-
tion techniques to open black box data mining models. Information
Sciences 225(1), 117.
AM G and KAUFMANN S (2008) Computer assessment of interview data
using latent semantic analysis. Behavior Research Methods 40(1), 820.
AVENPORT TandKIM J (2013) Keeping Up with the Quants: Your Guide to
Understanding and Using Analytics. Harvard Business School Press,
E P, HU YandRAHMAN M (2013) Product-oriented web technologies and
product returns: an exploratory study. Information Systems Research
24(4), 9981010.
Utilizing big data analytics for IS research Oliver Müller et al12
European Journal of Information Systems
DHAR V (2013) Data science and prediction. Communications of the ACM
56(12), 6473.
HAR VandCHOU D (2001) A comparison of nonlinear models for financial
prediction. IEEE Transactions on Neural Networks 12(4), 907921.
IAKOPOULOS N (2014) Algorithmic accountability reporting: on the
investigation of black boxes. [WWW document]
2/ (accessed 3 July 2015).
INAV LandLEVIN J (2014) Economics in the age of big data. Scien ce
346(6210), 715721.
NTERTAINMENT SOFTWARE ASSOCIATION (2015) Essential facts about the
computer and video game industry 2015 sales, demographic and
usage data. [WWW document]
uploads/2015/04/ESA-Essential-Facts-2015.pdf (accessed 14 January
AWCETT T (2006) An introduction to ROC analysis. Pattern Recognition
Letters 27(8), 861874.
RIEDMAN J, HASTIE TandTIBSHIRANI R (2013) The Elements of Statistical
Learning. Springer, New York.
HOSE AandIPEIROTIS P (2011) Estimating the helpfulness and eco-
nomic impact of product reviews: mining text and reviewer character-
istics. IEEE Transactions of Knowledge and Data Engineering 23(10),
RILLIANT L (2009) Detecting influenza epidemics using search engine
query data. Nature 457(7232), 10121014.
LASER BandSTRAUSS A (1967) The Discovery of Grounded Theory: Strategies
for Qualitative Research. Aldine Pub, Chicago.
OH K, HENG CandLIN Z (2013) Social media brand community and
consumer behavior: quantifying the relative impact of user-and
marketer-generated content. Information Systems Research 24(1) ,
REGOR S (2006) The nature of theory in information systems. MIS
Quarterly 30(3), 611642.
REGOR SandBENBASAT I (1999) Explanations from intelligent systems:
theoretical foundations and implications for practice. MIS Quarter ly
23(4), 497530.
ALEVY A, NORVIG PandPEREIRA F (2009) The unreasonable effectiveness of
data. IEEE Intelligent Systems 24(2), 812.
EDMAN J, SRINIVASAN NandLINDGREN R (2013) Digital traces of informa-
tion systems: sociomateriality made researchable. In Proceedings of the
International Conference on Information Systems (B
M, Eds), Association for Information Systems, Milan, Italy.
EY T, TANSLEY SandTOLLE K, Eds (2009) The Fourth Paradigm: Data-
Intensive Scientific Discovery. Microsoft Research, Redmond.
ILBERT MandLÓPEZ P (2011) The worlds technological capacity to store,
communicate, and compute information. Science 332(60), 6065.
U N, ZHANG JandPAVLOU PA (2009) Overcoming the J-shaped distribution
of product reviews. Communications of the ACM 52(10), 144.
IDC (2011) The 2011 digital universe study. [WWW document] https://
chaos-ar.pdf (accessed 14 January 2016).
IDC (2014) The 2014 digital universe study. [WWW document] http:// (accessed
14 January 2016).
(2009) How incorporating feedback mechanisms in a DSS affects DSS
evaluations. Information Systems Research 20(4), 527546.
NUGGETS (2011) Algorithms for data analysis/data mining. [WWW docu-
ment] /2011/algorithms-analyti cs-data-
mining.html (accessed 14 January 2016).
ITCHIN R (2014) The Data Revolution: Big Data, Open Data, Data Infra-
structures and their Consequences. Sage, Los Angeles.
O D-G and DENNIS A (2011) Profiting from knowledge management: the
impact of time and experience. Information Systems Research 22(1),
ating content quality and helpfulness of online product reviews: the
interplay of review helpfulness vs. review content. Electronic Commerce
Research and Applications 11(3), 2052017.
RAMER A, GUILLORY JandHANCOCK J (2014) Experimental evidence
of massive-scale emotional contagion through social networks.
Proceedings of the National Academy of Science of the United States of
America 111(24), 87888790.
UHN MandJOHNSON K (2013) Applied Predictive Modeling. Springer,
New York.
UMAR AandTELANG R (2012) Does the web reduce customer service cost?
Empirical evidence from a call center. Information Systems Research
23(3), 721737.
ANEY D (2001) 3D data management: controlling data volume, velocity,
and variety. [WWW document]
Velocity-and-Variety.pdf (accessed 14 January 2016).
AU J, BALDWIN T and NEWMAN D (2013) On collocations and topic
models. ACM Transactions on Sp eech and Language Processing 10(3),
AZER D, KENNEDY R, KING GandVESPIGNANI A (2014) The parable of google
flu: traps in big data analysis. Science 343(6176), 12031205.
AZER D et al (2009) Life in the network: the coming age of computational
social science. Science 323(5915 ), 721723.
ESAFFRE E, RIZOPOULOS DandTSONAKA R (2007) The logistic transform for
bounded outcome scores. Biostatistics 8(1), 7285.
IN M, LUCAS HC and SHMUELI G (2013) Too big to fail: large samples and
the P-value problem. Information Systems Research 24(4), 906917.
YCETT M (2013) Datafication: making sense of (big) data in a complex
world. European Journal of Information Systems 22(4), 381386.
sible credit scoring models using rule extraction from support
vector machines. European Journal of Operational Research 183(13),
ARTENS DandPROVOST F (2014) Explaining data-driven document
classifications. MIS Quarterly 38 (1), 7399.
ARTIN M (2012) C-path: updating the art of pathology. Journal of the
National Cancer Institute 104(16), 12021204.
ARTON A, AVITAL MandJENSEN TB (2013) Refram ing open big data.
In Proceedings of the European Conference on Information Systems
ONNOLLY R, Eds), Association for Information Systems, Utrecht, The
CAULEY J, PANDEY RandLESKOVEC J (2015a) Inferring networks of
substitutable and complementary products. In Proceedings of the ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining (L
ESKOVEC J, WANG WandGHANI R, Eds), Association for
Computing Machinery, Sydney.
CAULEY J, TARGETT C, S HI QandVAN DEN HENGEL A (2015b) Image-based
recommendations on styles and substitutes. In Proceedings of the
International ACM SIGIR Conference on Research and Development in
Information Retrieval (L
Association for Computing Machinery, Chile.
ICHEL J-B et al (2011) Quantitative analysis of culture using millions of
digitized books. Science 331(6014), 176182.
ILES MandHUBERMAN A (1994) Qualitative Data Analysis: An Expanded
Sourcebook. Sage Publications, Thousand Oaks.
UDAMBI SandSCHUFF D (2010) What makes a helpful review? A study of
customer reviews on MIS Quarterly 34(1), 185200.
URRAY-RUST P (2008) Open data in science. Serials Review 34(1), 5264.
ELSON P (1970) Information and consumer behavior. Journal of Political
Economy 78(2), 311329.
ESTREICHER-SINGER GandSUNDARARAJAN A (2012) Recommendation net-
works and the long tail of electronic commerce. Management Infor ma-
tion Systems Quarterly 36(1), 6583.
AN YandZHANG J (2011) Born unequal: a study of the helpfulness of user-
generated product reviews. Journal of Retailing 87(4), 598612.
ENG R (2011) Reproducible research in computational science. Science
334(6060), 12261227.
IGLIUCCI M (2009) The end of theory in science? EMBO Reports 10(6), 534.
EXER K (2013) 2013 Data miner survey summary report. [WWW
2013.html (accessed 14 January 2016).
IMM D (2011) C-path: a Watson-like visit to the pathology lab. Science
Translational Medicine 3(108), 108.
OBNIK-SIKONJA MandKONONENKO I (2008) Explaining classifications for
individual instances. IEEE Transactions on Knowledge and Data Engineer-
ing 20(5), 589600.
Utilizing big data analytics for IS research Oliver Müller et al 13
European Journal of Information Systems
SHARMA R, MITHAS SandKANKANHALLI A (2014) Transforming decision-
making processes: a research agenda for understanding the impact of
business analytics on organisations. European Journal of Information
Systems 23(4), 433441.
HMUELI G (2010) To explain or to predict? Statistical Science 25(3),
HMUELI GandKOPPIUS O (2011) Predictive analytics in information
systems research. MIS Quarterly 35(3), 553572.
ILVERMANN D (2006) Interpreting Qualitative Data. Sage Publications,
OCHER R et al (2013) Recursive deep models for semantic composition-
ality ove r a sent imen t tre eban k. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing (L
G HT, Eds), Association for Computational Linguistics, Seattle,
Washington D C.
PEER S (2002) Natural and contrived data: a sustainable distinction?
Discourse Studies 4(4), 511525.
TIEGLITZ SandDANG-XUAN L (2013) Emotions and information diffusion in
social media sentiment of microblogs and sharing behavior. Journal of
Management Information Systems 29(4), 217247.
RENZ MandBERGER B (2013) Analyzing online customer reviews-an
interdisciplinary literature review and research agenda. In European
Conference on Information Systems (B
Information Systems, Utrecht, The Netherlands.
URNEY PandPANTEL P ( 2010) From frequency to meaning: vector space
models of semantics. Jo urnal of Art ificia l I ntellig ence Research 37(1),
ENKATESH V, BROWN SandBALA H (2013) Bridging the qualitative-
quantitative divide: guidelines for conducting mixed methods research
in information systems. MIS Quarterly 37(1), 2154.
technology can create business value: insights from the Hilti case.
Communications of the Association for Information Systems 34(7), 151168.
U X et al (2008) Top 10 algorithms in data mining. Knowledge and
Information Systems 14(1), 137.
OO Y (2015) It is not about size: a further thought on big data. Journal of
Information Technology 30(1), 6365.
U C, JANNASCH-PENNEL A and DIGANGI S (2011) Compatibility between
text mining and qualitative research in the perspectives of grounded
theory, content analysis, and reliability. The Qualitative Report 16(3) ,
ENG XandWEI L (2013) Social ties and user content generation: evidence
from Flickr. Information Systems Research 24(1), 7187.
IMMER M (2010) But the data is already public: on the ethics of research
in facebook. Ethics and Information Technology 12 (4), 313325.
This work is licensed under a Creative Com-
mons Attribution 3.0 Unported License. The
images or other third party material in this article are
included in the articles Creative Commons license, unless
indicated otherwise in the credit line; if the material is not
included under the Creative Commons license, users will
need to obtain permission from the license holder to
reproduce the material. To view a copy of this license, visit
Utilizing big data analytics for IS research Oliver Müller et al14
European Journal of Information Systems
... BDAG. Finally, we consider BDAG by Müller et al. (2016) which again originates in the IS domain. BDAG is predominantly motivated by the analysis of vast data sets. ...
Conference Paper
Full-text available
Recent advances to machine learning (ML) and its rapid proliferation spur the widespread development of advanced analytics applications. Nonetheless, the capabilities of ML can be stalled due to limited or missing data. In this regard, the production of artificial data offers a promising solution. However, its full potential is yet to be unleashed since its frequently misunderstood or overseen. We attribute this to a lack of practical guidance on when and how to employ artificially generated data. Against this backdrop, we draw on two streams-namely, method engineering and design science to develop "GenFlow", a novel method useful to practitioners as well as researchers. The utility is demonstrated in retrospect for previous work and empirically accessed for the context of employee attrition .
... A particular obstacle to human interaction with AI, in practice as well as in theorizing, is the opacity of certain ML algorithms (Müller et al., 2016;Faraj et al., 2018;Berente et al., 2021). It can be difficult for professionals to make use of the knowledge created by AI systems if they cannot follow the logic that was applied to arrive at a certain result (e.g., Lebovitz et al., 2022). ...
Conference Paper
Full-text available
Explainable AI (XAI) holds great potential to reveal the patterns in black-box AI models and to support data-driven decision-making. We apply four post-hoc explanatory methods to demonstrate the explanatory capabilities of these methods for data-driven decision-making using the illustrative example of unwanted job turnover and human resource management (HRM) support. We show that XAI can be a useful aid in data-driven decision-making, but also high-light potential drawbacks and limitations of which users in research and practice should be aware.
... Quantitative research method according to Sugiyono (Cohen et al., 1980) is a research method based on the philosophy of positivism (relying on empiricism) which is used to research certain populations or samples, sampling techniques are generally carried out randomly (random), data collection uses instruments objective research, and data analysis is quantitative or statistical, with the aim of testing the hypotheses that have been set (Amaratunga et al., 2002) (Hjørland, 2011) (Antwi & Hamza, 2015) (Müller et al., 2016) (Amaratunga et al., 2002). The following is an overview of the research methods that can be used: a. Research Design: This study uses a cross-sectional design, with data collection conducted at a certain point in time. ...
Full-text available
This quantitative study examines the complex relationship between economic inequality and social inequities, focusing on educational attainment. The study examines how Gini coefficient-measured economic disparity affects educational attainment in hypothetical countries. This study uses a linear regression model to examine how economic inequality affects social outcomes. A statistically substantial positive correlation exists between the Gini coefficient and educational attainment discrepancy. Educational attainment disparity increases by 2.5 years per unit rise in the Gini coefficient, according to the calculated coefficient. This supports the idea that economic inequality increases education gaps, prolonging cycles of disadvantage for vulnerable populations. The study underlines the intricacy of these processes by conceding that regression analysis cannot prove causation. Economic inequality affects educational attainment disparity through policy frameworks, institutional structures, and cultural norms. This highlights the necessity for extensive policy responses to address these issues. The policy ramifications of this study are significant. The findings emphasize the need for fair access to quality education, especially for disadvantaged communities. Targeted actions are needed to reduce economic inequality's impact on education and promote inclusivity. This study examined educational attainment, but its approach and findings are important. The association between economic inequality and social differences is multifaceted, begging for further study. This study adds to the debate between economic inequality and social inequality. This study emphasizes the need for educated and context-sensitive strategies to reduce gaps, enhance social mobility, and create more equitable society as societies face inequality.
... In recent years, information systems (IS) and management studies 2 have risen to prominence as the two leading disciplines in both the quality and quantitiy of studies exploring DT (Appio et al., 2021;Brosnan et al., 2023;Hanelt et al., 2021a;Kraus et al., 2022;Nadkarni and Prügl, 2021;O'Brien et al., 2023). The rise of DT as a concept, not just in acaemic literature but also industry adoption has been a netbenefit in understanding the conditions accompanying socio-technical renewal among stakeholders, organisations, industries and geographies (Denicolai et al., 2022;Mocker, 2020;Mueller and Renken, 2017;Müller et al., 2016;Tkalich et al., 2021;Singh et al., 2023). However, the cross-discpline nature of DT has left the concept open to interpretation from discipline to disciple, journal to journal and sometimes author to author. ...
Conference Paper
Full-text available
Purpose The emergence of digital transformation as a mechanism of enhancing organisational value chains, operating models and ways of working through technology has resulted in a rapid output of research literature in recent years. Although largely a benefit, the sudden uptake in digital transformation research has resulted in a miriad of definitions, views and methodologies being published. These differences can often be seen from discipline to discipline, journal to journal and author to author. Methodology This study compares and contrasts approaches to digital transformation across two disciplines: information systems and management. The study takes a sample of 168 primary studies across both disciplines to compare and contrast how both disicplines approach understanding digital transformaiton. As an initial study, the unit of measurement is digital transformation risk which is a growing area of study for both disicplines within the digital transformation sphere. Findings On the whole, both disciplines identified digital transformation risk across six major domains: Stakeholder, Culture, Organisation, Strategy, Process and Technology. However, both disciplines largely approached these risk domains from different perspectives. For example, information systems studies largely viewed these risks from the perspective of the stakeholder or organisation while management studies generally focused on industry, national or pan-national implications. Implications Although both disciplines approached the same risk domains, the approaches used provide avenues for future research on digital transformation across both disciplines with different definitions, methodologies, views and units of analysis providing opportunities for information systems researchers to augment their research using management techniques and vice versa. The study also identifies a series of risks accompanying digital transformation which should advise practioners on how to identify, monitor and evaluate digital transformation risks as they arise.
Full-text available
Purpose This paper aims at understanding the differences between user profiles in collaborative consumption (CC) platforms in order to improve their management approaches and set up customized strategies. Particularly, the authors investigate the emerging role of prosumers and their influence on the active participation and growth of CC platforms. Moreover, the authors study user experience to help promoting users' recommendation and offering intention. Design/methodology/approach The sample includes responses from 6,388 users of CC platforms across the EU. The data were collected through the European Commission's Flash Eurobarometer survey 467 and analyzed through a partial least squares structural equation modeling (PLS-SEM) and a fuzzy set qualitative comparative analysis (fsQCA). Findings The PLS-SEM findings suggest that prosumers are more likely than consumers to recommend and offer services through CC platforms. Furthermore, previous experience using platforms positively affects the switch from consumers to prosumers. The fsQCA suggests that only economic advantages affect the switchover decision. Research limitations/implications This study deepens the hitherto unexplored prosumer role in CC platforms and its antecedents and drivers. Practical implications The main limitations concern the generalizability outside of the EU, the unbalanced coverage of sectors and the number of moderator variables. Social implications Prosumers act as golden actors because they contribute to enlarge both the customer base (through recommendations) and the provider base (through offering intention). Hence, managers should focus on prosumers' experiences to increase the critical mass and positive externalities of CC platforms. Originality/value This study helps understand the importance of the role of prosumers in the growth of CC platforms. The study provides more robust results through a cross-country and mixed-method research.
To assist the management and support the strategic decisions, statistical and computational methodologies are being progressively incorporated into decision support systems. To accurately anticipate the financial data, researchers must properly comprehend its utilization. Therefore, a method-based literature evaluation with an emphasis on the predictive analytics field is proposed in this chapter. Regression, clustering, association, and time series models are thoroughly covered in this research. It extends current statistical explanatory modelling into the field of computational modelling. These model results help financial managers and risk management experts get better results. The numerous financial predictive analytic techniques are brought together in this review under one area. This is regarded as a solid foundation since it shows that when more than 40 papers have been identified as eligible, systematic reviews can be employed at the domain level. This search was conducted by the SPAR-4-SLR (scientific procedures and rationales for systematic literature reviews) protocol.
In this paper we explore the evolution of the use of digital technology in the public sector. We do so by analyzing a corpus of IT- and digitalization strategies from Swedish local governments, produced from two time periods, using topic modeling. Our analysis reveals salient topics covered in these two sets of strategies and classifies them into three types: topics that persist across the two periods, topics that are unique to each period, and topics that evolved in content. We suggest that local government strategies became more general and optimistic in terms of the technologies’ new opportunities, specific in terms of management practices, and increasingly blurry in terms of organizational and material boundaries. We also provide evidence of digitalization strategies becoming more homogenous in their covered topics than their IT counterparts. By doing so, we contribute to research devoted to analyzing the discursive landscape of digital government by investigating the official content found in these strategies. Thus, we contribute to research devoted to studying policy in order to historically situate contemporary use of digital technologies and its evolution. We conclude the paper with important implications for practice.KeywordsDigitalizationITPublic policyLocal governmentTopic modeling
In this chapter, we present an overview of some common data mining algorithms. Two techniques are considered in detail. The first is association rules, a fundamental approach that is one of the oldest and most widely used techniques in data mining. It is used, for example, in supermarket basket analysis to identify relationships between purchased items. The second is the maximum sub-array problem, which is an emerging area that is yet to produce a textbook description. This area is becoming important as a new tool for data mining, particularly in the analysis of image data. For both of these techniques, algorithms are presented in pseudo-code to demonstrate the logic of the approaches. We also briefly consider decision and regression trees and clustering techniques.
It has been conjectured that the peer-based recommendations associated with electronic commerce lead to a redistribution of demand from popular products or "blockbusters" to less popular or "niche" products, and that electronic markets will therefore be characterized by a "long tail "of demand and revenue. We test this conjecture using the revenue distributions of books in over 200 distinct categories on Amazon, com and detailed daily snapshots of co-purchase recommendation networks in which the products of these categories are situated. We measure how much a product is influenced by its position in this hyperlinked network of recommendations using a variant of Google's PageRank measure of centrality. We then associate the average influence of the network on each category with the inequality in the distribution of its demand and revenue, quantifying this inequality using the Gini coefficient derived from the category 's Lorenz curve. We establish that categories whose products are influenced more by the recommendation network have significantly flatter demand and revenue distributions, even after controlling for variation in average category demand, category size, and price differentials. Our empirical findings indicate that doubling the average network influence on a category is associated with an average increase of about 50 percent in the relative revenue for the least popular 20 percent of products, and with an average reduction of about 15 percent in the relative revenue for the most popular 20 percent of products. We also show that this effect is enhanced by higher assortative mixing and lower clustering in the network, and is greater in categories whose products are more evenly influenced by recommendations. The direction of these results persists over time, across both demand and revenue distributions, and across both daily and weekly demand aggregations. Our work illustrates how the microscopic economic data revealed by online networks can be used to define and answer new kinds of research questions, offers a fresh perspective on the influence of networked IT artifacts on business outcomes, and provides novel empirical evidence about the impact of visible recommendations on the long tail of electronic commerce.
This research essay highlights the need to integrate predictive analytics into information systems research and shows several concrete ways in which this goal can be accomplished. Predictive analytics include empirical methods (statistical and other) that generate data predictions as well as methods for assessing predictive power. Predictive analytics not only assist in creating practically useful models, they also play an important role alongside explanatory modeling in theory building and theory testing. We describe six roles for predictive analytics: new theory generation, measurement development, comparison of competing theories, improvement of existing models, relevance assessment, and assessment of the predictability of empirical phenomena. Despite the importance of predictive analytics, we find that they are rare in the empirical IS literature. Extant IS literature relies nearly exclusively on explanatory statistical modeling, where statistical inference is used to test and evaluate the explanatory power of underlying causal models, and predictive power is assumed to follow automatically from the explanatory model. However, explanatory power does not imply predictive power and thus predictive analytics are necessary for assessing predictive power and for building empirical models that predict well. To show that predictive analytics and explanatory statistical modeling are fundamentally disparate, we show that they are different in each step of the modeling process. These differences translate into different final models, so that a pure explanatory statistical model is best tuned for testing causal hypotheses and a pure predictive model is best in terms of predictive power. We convert a well-known explanatory paper on TAM to a predictive context to illustrate these differences and show how predictive analytics can add theoretical and practical value to IS research.