Content uploaded by Christian Peukert
Author content
All content in this area was uploaded by Christian Peukert on Jul 11, 2019
Content may be subject to copyright.
1
Obtaining Data from the Internet:
A Guide to Data Crawling in Management Research
Jörg Claussen
Munich School of Management, LMU Munich, Kaulbachstr. 45, 80539 Munich, Germany and
Department of Strategy and Innovation, Copenhagen Business School, Kilevej 14a, 2000
Frederiksberg, Denmark, j.claussen@lmu.de
Christian Peukert
Católica Lisbon School of Business and Economics, 1649-023 Lisbon, Portugal and
ETH Zürich, Center for Law and Economics, 8092 Zürich, Switzerland,
christian.peukert@gmail.com
Abstract: The increasing availability of data on the Internet opens new opportunities for
management research and the method of data crawling can be used for automated large-scale data
extraction. We show that data crawling has quickly gained popularity and is used for a wide variety
of purposes, but has so far gained less traction in the field of management. We argue that we could
use many data sets used in other disciplines for answering questions in management research and
show that setting up a data crawler does not require advanced programming skills. However, a lot
of pitfalls can challenge the success of using crawled data for research. We develop a guideline
for crawling projects and address how many of the regularly occurring challenges can be
addressed.
Keywords: Crawler, Spider, Scrape, Bot, Data
Acknowledgments: We thank participants of data crawling workshops at Baruch College, Georgia
Tech, LMU Munich, the AOM Conference on Big Data, and the DRUID Academy for valuable
feedback and Laura Krahe-Steinke for excellent research assistance. Peukert acknowledges
support from FCT - Portuguese Foundation of Science and Technology for the project
UID/GES/00407/2013 and FCT-PTDC/EGE-OGE/27968/2017.
2
1. INTRODUCTION
As larger parts of the economy go through the process of digital transformation, more and more
data about market interactions of agents such as firms and consumers are generated, collected, and
combined. Much of these data are directly or indirectly available on the Internet, for example
because some of the interactions happen online, but there is also a plethora of private and public
organizations, such as market research firms and statistical offices, that are in the business of
collecting, aggregating, and distributing data. It goes without saying that these data are highly
relevant for management research. For example, product-level data can be informative about how
firms compete, transaction-level data can be informative about how customers perceive the quality
of products, and some forms of metadata can be informative about how firms organize. The issue
of course is that data is very often not readily available in dedicated databases and accessible with
a simple one-click download. Quite the opposite is the case. Data come in unstructured formats
and are cluttered across many webpages. For example, if one were interested in all studying the
competitive dynamics on a marketplace like eBay, even a single product category would spread
across thousands of subpages, making it infeasible to manually download all pages and extract
relevant information like product characteristics, price, and seller ratings by copying it to a
spreadsheet.
Data crawling, sometimes also called web scraping or spidering, is the method that allows to
automatically extract data from the Internet. Using automated systems (“bots”) to extract data has
many practical applications. Popular services such as search engines, price comparison websites,
or news aggregators are essentially huge data crawling operations, but bots are also used for
malicious means, e.g. advertising fraud and cyber attacks. Industry reports suggest that the better
3
half of web traffic is non-human
1
. It is therefore perhaps not surprising that also the economics
and business research community has started to use crawling as a data collection method. We
conduct an extensive literature review of all articles published in journals on the Financial Times
50 list between 2000 and 2018 and show that there is an overall increase in papers that use data
crawling, increasing from a handful to about 30 papers per year, reaching a total of 183 papers.
The variety of use cases and types of data is striking. We also identify important variation by field.
While the share of papers that use data crawling is relatively large in Operations & Information
Systems and Innovation & Entrepreneurship, it is much smaller in Management. Because we
believe that there are ample possibilities for management researchers to gain significant insights
from data that is available on the Internet, this “lacking behind” motivates us to provide a guide of
how to setup data crawling in this paper.
We provide an intuitive roadmap of the crawling process, starting with a discussion of software
tools and boundary conditions and some best practices for identifying patterns and automating
tasks. We then give some guidance on parsing data that is embedded in the source code of
webpages. Finally, we discuss some heuristic solutions to common challenges of the automated
data collection process, including the matching of multiple datasets without common identifiers,
optimizing run-time, and the crawling of panel data.
We do not aim to substitute for the large volume of very good technical resources that are helpful
to master data crawling (such as Munzert et al., 2014), but rather to complement this literature
1
Available at https://www.theatlantic.com/technology/archive/2017/01/bots-bots-bots/515043/, accessed April 16,
2019.
4
with a hands-on guide full of best practices that we deem especially relevant for management
researchers who want to enhance their methodological skill set.
2. USE OF DATA CRAWLING IN BUSINESS RESEARCH
We start by giving an overview about the use of data crawling in different areas of business
research and derive implications of how data crawling can be used in management research. To do
this, we identified papers published in the leading economics and business journals that build on
data that is crawled from the Internet. For our analysis, we conducted a full-text search within all
50 journals included in the Financial Times list that match with the search patterns ‘crawl*’,
‘scrapi*’, or ‘scrape*’. We then manually checked all search results to exclude false
positives and identified a total of 183 papers published between 2000 and 2018
2
that clearly state
that the authors used data crawling to obtain data. While these 183 papers are only 0.21% of all
papers published in the period, Figure 1 highlights that number of published crawling papers has
seen a strong increase within the last couple of years.
------------ INSERT FIGURE 1 HERE ------------
Because we can only classify those articles that explicitly mention that data has been obtained
through crawling, we expect the true number of papers that used crawling methods to be
significantly higher. An example of a false negative would be a case where authors simply state
that they obtained data from a specific website, but do not give further details on the data collection
method.
2
Three papers were forthcoming in 2018 but were later published in print in 2019 and are therefore included.
5
Next, we report the fields in which crawling papers have been published in Table 1. In relative
terms, most crawling papers have been published in Operations & Information Systems (0.76% of
all published papers in this field), Innovation & Entrepreneurship (0.67%) as well as in Marketing
(0.25%). The field of Management is clearly lagging behind, with only 0.04% of all published
articles explicitly stating the use of crawling papers. Among the five identified crawling papers in
management journals, two have been published in the Strategic Management Journal. Compared
to that, there are 21 published articles in Information Systems Research, 12 published articles in
Research Policy, and 12 published articles in Marketing Science.
We will now discuss the five identified papers published in management journals to exemplify
how data crawling can be used in the field.
Ren et al. (2011) study the impact of local competition on a vendor’s decision of providing product
variety at the example of the competition between Best Buy and Circuit City. They use data
crawling to obtain store-specific product variety data for digital cameras from the websites of Best
Buy and Circuit City and find that competition in the same market generally increases product
variety except if the two competing stores are collocated. In a follow-up study, Ren et al. (2019)
study responses to rival exit at the example of how Best Buy reacted to the exit of Circuit City.
They crawl additional data from Best Buy’s website after the exit of Circuit City and find that Best
Buy reacted to the exit of a close-by Circuit City shop by offering more digital cameras.
Gehman and Grimes (2017) and Haans (2019) use text and picture data scraped from corporate
websites to study the strategic positioning of firms relative to their competitive environment. The
first paper investigates the characteristics of firms that seek formal membership in an
organizational category. The latter paper focuses on the performance implications with respect to
6
a trade-off between distinctiveness and legitimacy, which they show to depend on how
homogenous an organizational category is.
Calic and Mosakowski (2016) study innovative forms of getting access to capital with a specific
focus on social entrepreneurship. Using scraped data on crowdfunding campaigns from
Kickstarter, they compare the funding success of projects with and without sustainability goals,
and look into the role that legitimacy and creativity play in this relationship.
As we see from the examples above, data crawling can be used to obtain both product-level as well
as firm-level data to answer relevant questions for management and strategy scholars. In addition
to these examples that are situated in the fields of competitive strategy and innovation strategy,
one could also use individual-level crawled data to study questions of strategic human capital and
stakeholder strategy. Examples for individual-level data available on the Internet are salary
information of public-sector employees (Mas, 2017), contributor information from open source
software (Hahn, Moon, and Zhang, 2008) and user innovation communities (Wooten and Ulrich,
2017), or customer feedback (Chau and Xu, 2012).
------------ INSERT TABLE 1 HERE ------------
We additionally report all crawling papers published in journals included in the FT50 list in Table
A.1 in the appendix. We have furthermore categorized each paper according to one of these 14
different settings: advertisement, between-industry, crowdsourcing, crowdfunding, eCommerce,
online content, online recommendation systems, open source, search engine, social networks, user-
generated-content, within-industry, and other. We believe that many of these settings could
provide interesting data to examine important questions that are relevant for a management
audience. As examples, the setting of crowdfunding could be used to study knowledge and
7
innovation, the setting of open source to study strategic human capital, or social networks to study
behavioral strategy. The eCommerce and online recommendation systems settings lend themselves
to the study of platform strategy.
We furthermore investigated the importance of crawled data for the published papers and found
out that in 61.3% of the cases, the papers relied exclusively on crawled data, while additional data
sources were used in the other cases. We expect that the share of crawling papers relying on
multiple data sources will go up in the future as more of the first-order questions that can be
addressed with single data sets will have already been addressed and as combining different data
sources allows examining interesting and new research questions. These additional data sources
could range from proprietary firm data to public registries.
Finally, we checked if the authors mentioned the use of application programming interfaces (APIs)
to obtain the data, which has been the case in 10.2% of the published papers. As we discuss below,
APIs are increasingly provided by platform operators aiming to open up to third parties, but can
also be used by researchers as an easy way to acquire data. We would therefore also expect more
API usage over time.
3. THE CRAWLING PROCESS
In this section, we introduce a step-by-step guideline for setting up the crawling process. There are
usually many degrees of freedom for how to implement the crawling process. In order to give an
overview, we provide a guiding framework that shows the different choices that can be taken as
well as the respective advantages and disadvantages.
A number of programming languages offer possibilities for crawling data from the Internet and
within these languages, there are a lot of packages that offer additional crawling-related
8
functionalities. The most popular language for data crawling is probably Python. Python is a multi-
purpose programming language used in web development and data analytics and offers powerful
packages for facilitating the crawling process such as requests, BeautifulSoup, or Scrapy. But there
are also a lot of other tools such as R, Java, or Perl that serve well for data crawling. The choice
of programming language should take into account own prior experiences with programming
languages, so it is difficult to give a clear recommendation. Furthermore, a lot of new
functionalities are added to programming languages all the time, so one should always do an up-
to-date comparison before selecting a language. Further, firms offer ready-to-use tools, e.g.
import.io, Parsehub, etc, which depending on the complexity of the crawling task might provide
an efficient solution.
It is of course also possible to outsource data crawling tasks to freelancers, which are easy to locate
on micro-labor platforms such as Freelancer.com or Upwork . But even in this case, it is important
to build a general understanding of how data crawling works in order to be able to specify and
supervise the task and assess the quality of the obtained data.
3.1. Boundaries for data crawling
The first and most important step of each crawling project is to determine if it actually makes sense
to use crawling to obtain the desired data. We discuss the most important boundaries that would
rule out using data crawling. Considering these boundaries at the beginning of the project is
important as this might save a lot of resources.
A first consideration before the start of a project would be to assess if data crawling is really the
most efficient way to obtain the desired data or if there would also be alternative sources. Some
websites make parts or all of their data available for direct download. An example would be the
9
Internet Movie Database, which provides much of their data on movies as direct downloads
3
.
Because of that one should always search if direct data access is provided by a website. If this is
not the case, one could also contact the website operators directly and ask them, if they would be
willing to provide direct access to their data. Furthermore, many of the most prominent websites
are crawled by a lot of parties and some of them make their crawling results available to others.
This could either be non-commercial initiatives (such as InsideAirbnb
4
that provides the crawled
data from Airbnb free of charge) or firms who are selling the data they crawled (such as AirDNA
5
,
which also regularly crawls Airbnb).
Second, one should consider the number of observations that need to be collected. If the number
of observations is relatively low, perhaps it is more efficient to collect the data manually instead
of writing a crawler to extract it automatically. The cut off value of when it is more efficient to
write a crawler depends of course on individual circumstances such as programming skills or
access to research assistants.
Third, data crawling is more attractive, the more structured a website and the data embedded in
the website is. If the data of interest is embedded in text, it might become really difficult to extract
the desired pieces of information. However, if a website provides a tabloid structure to present its
data, the same piece of information will always be presented at the same position of the website
and extraction of information becomes much easier. Very often, there are multiple websites that
collect data on the same industry and that vary in their degree of how structured they store their
data. For example, Wikipedia contains information on a lot of video games but this data is for the
3
Available at https://www.imdb.com/interfaces/, accessed December 12, 2018.
4
Available at http://insideairbnb.com/, accessed December 12, 2018.
5
Available at https://www.airdna.co/, accessed December 12, 2018.
10
most part not very standardized. In contrast, the platform MobyGames
6
provides a lot of
information on video games in a highly structured way. So one should invest some time in
searching for an alternative website that would provide the same kind of data in the most structured
way.
Fourth, many website owners do not want third parties to automatically extract their content even
if they make it freely available to the human reader. They then try to block attempts to crawl their
data in an automated way. Even if there are possibilities to circumvent some of these measures
(e.g. switching IP addresses, hiding behind proxy servers, using the Tor network), this still makes
the crawling process more complex and uncertain. A problem is that most of the time it is not
possible to determine ex-ante if there are any anti-crawling measures employed and how efficient
they are. One way to assess the probability of anti-crawling measures being in place is to consider
the economic incentives of the website owner for keeping third-parties from extracting their data
in an automated way. For example, if copying the data would allow a competitor to compete more
closely or if they also want to sell their data to interested parties, the odds that the website employs
anti-crawling measures would be high. In contrast, it is unlikely that a non-commercial project
would employ these measures.
The fifth key factor for determining if a website should be considered for crawling is if the website
requires a login to access the data of interest. Requiring a login perhaps has legal as well as
technical implications. It is possible that from a legal perspective, users have to create an account
before they can log in and nearly all websites require users to proactively agree to their terms and
conditions before the account can be created. As the terms and conditions of many websites
6
Available at https://www.mobygames.com/, accessed December 12, 2018.
11
explicitly prohibit data crawling, still doing so would result in a breach of contract. In addition to
these legal implications, writing a crawler that first logs in to a website before downloading the
content can be slightly more complex from a technical perspective.
7
3.2. Identification of observations
To be able to arrive at the desired data, it is usually necessary to download a large number of
subpages from a website. These subpages could entail anything from user profiles over product
reviews to firm-specific financial information. A key challenge in many crawling projects is to
determine the addresses (URLs) of the subpages from which content should be extracted. We will
now discuss several alternative strategies for identifying the focal observations. We rank these
options by increasing effort, i.e. one should always try to work with the first options and only
resort to the more complex alternatives if the first one is not successful.
First, one should check if the website offers data access through an application programming
interface (API). Many websites offer these interfaces to their data if they are operating a platform
business model in which they rely on third parties joining their platform. It is often possible to
obtain much of the data of these platform markets through the API and the API usually also
includes functions to search for or list the data sets included in the platform.
Second, the subpages of a websites are sometimes numbered by an increasing identifier. If the
URL of the website has a form such as http://domain.com/objects/x and x is a number
that counts up by one for each subpage, it is sufficient to loop through x until no more results are
returned.
7
Many less sophisticated websites use simple authentication methods such as Htaccess with which logging in is as
easy as adding a specific header to the HTTP request. With respect to other authentication methods, an emerging
standard seems to be OAuth, for which packages exist in all major programming languages.
12
Third, many (but not all) websites offer a machine readable directory of all subpages in a sitemap
file. These sitemap files are standardized and are provided in the XML format. As each XML file
is only allowed to contain the URLs of up to 50,000 subpages, there is often a two-tier structure
of sitemaps. The first tier just consists of a directory of all further sitemaps, while the second tier
contains the actual URLs of the subpages. An advantage of this two-tier structure is that the
sitemaps are usually sorted by the type of subpages and one can then directly obtain the desired
type of objects. Sitemaps are often located at http://domain.com/sitemap.xml, but this
location is not standardized. The location of the sitemap is often reported in the robots.txt file that
is located at http://domain.com/robots.txt and that provides directions for crawlers.
Finally, one can also use an Internet search engine and search for sitemaps within the domain of
the website.
Fourth, many websites offer their users directories that allow to list and iterate through all of their
subpages. One can then iterate through each of these directory pages and extract the URLs of the
subpages. The URL of the directory pages usually has a structure like this:
http://domain.com/directory?page=x&results=y. Iterating through x allows then
to call the subpages of the directory. Also note that many websites allow their users to change the
number of results they can observe on each directory page (such as choosing between 10, 20, and
50 results). It is sometimes possible to manually tune this parameter y to much larger values, which
sometimes allows returning all the subpages in one go.
Fifth, if none of the above works, one could use Internet search engines to obtain a website’s
subpages. This can be achieved by a search term such as
site:http://domain.com/profile/. Note however, that search engines usually have
strong protection against crawling and will relatively soon block requests from a crawler. So it
13
might be possible to extract around three to four digit number of subpages from a search engine,
but not millions.
Sixth, an iterative approach might allow to return a large number of subpages if different subpages
are linked to each other. This might for example be the case for a social network, where each
profile page also contains information about the focal user’s network of connections. With this
iterative approach, one might first get an initial set of seed subpages by using the Internet search
approach mentioned above and then determine all the first-order links from there. Then, all the
links from all new subpages will be retrieved. This process will be repeated until no new subpages
are added. The drawback of this iterative process is that subpages which are not linked to any of
the initial seed sites will not be identified. Therefore, the iterative approach might work best if the
underlying network structure between the subpages is relatively dense.
3.3. Obtaining the content
Having identified all relevant subpages, downloading these subpages from the webserver and
extracting the desired content is the next step in the crawling process. The actual process of
downloading the contents behind a URL is surprisingly simple and can with most programming
languages be achieved with a single line of code. Using the requests module of Python, one can
for example download the contents of a website by calling
html_content=requests.get(‘http://domain.com/subpage.htm’. This type
of command can then be integrated into a loop that iterates through all subpages.
Extracting the desired pieces of information from the subpage is usually a much bigger challenge
compared to downloading the content in the first place. This process is called parsing and the
challenge lies in uniquely identifying the location within the website that contains the desired piece
14
of information. As we have already discussed above, using crawling for automated data retrieval
from the Internet requires that the desired pieces of information are structured in the same way
within each subpage. As the focal piece of information will vary from subpage to subpage, one
can still identify the desired content through the location within each website. Websites are written
in the HyperText Markup Language (HTML), which is a language that tells web browsers how
content should be displayed. A key element of the HTML language are tags, which are enclosed
by angle brackets. There are opening and closing brackets for many tags. Everything between the
opening tag <I> and the closing tag </I> will for example be displayed in italic, but some tags
come also standalone such as the <BR> tag that indicates a line break. If one is not yet familiar
with HTML, we suggest to read through many of the freely available online tutorials
8
.
Parsing is the process of extracting the desired pieces of information from the downloaded contents
of a website. There are two main methods of parsing a website: the direct way through regular
expressions or alternatively through higher-level parsing frameworks such as BeautifulSoup for
Python. Regular expressions are available in most programming languages and in many text
editors and are a powerful tool for defining search patterns.
9
Let us assume that we want to extract
the revenue information from a website where the revenue is embedded as follows: ‘<font
size="4">Revenue: <b>$21,587,519</b></font>’. The regular expression ‘<font
size="4">Revenue: <b>(.*?)</b></font>’ would then return the string
‘$21,587,519’. This search pattern tells the regular expression algorithm to search for all
occurrences starting with ‘<font size="4">Revenue: <b>‘ and ending with
8
One easily HTML tutorial is available here: https://www.w3schools.com/html, accessed December 12, 2018.
9
A good introduction for regular expressions is available here: https://www.regular-expressions.info/quickstart,
accessed December 12, 2018.
15
‘</b></font>’. The expression ‘(.*?)’ consists of the parentheses ‘()’ that specify where
the searched item is located, the dot ‘.’, specifying to search for arbitrary characters, the asterisk
‘*’, specifying that the operator before can be repeated zero or more times, and the question mark
‘?’, specifying to search in a non-greedy way, i.e. stop searching at the first and not at the last
occurence of ‘</b></font>’ within the HTML code. When specifying the regular expression,
it is important to make sure that the regular expression uniquely identifies the search term. If we
would have used the regular expression ‘<b>(.*?)</b>’, we would for example have obtained
all occurrences of bold text on the page. Sometimes it is also necessary to nest multiple calls of
regular expressions: for example, in a first step one might want to extract all rows from a table
with ‘<TR>(.*?)</TR>’ and then iterate over all columns of the table with
‘<TD>(.*?)</TD>’. Alternatively, one can also parse websites with higher-level frameworks
such as BeautifulSoup for Python or rvest for R. In these frameworks, the hierarchy created by the
opening and closing tags throughout each website is used to create a tree-like structure. One can
then identify each element within a website by specifying the location within the tree.
While we have so far assumed that the contents downloaded from a URL are HTML code, we also
sometimes obtain other types of information. If we use an API to access the information from a
website, the API will usually return the data directly as JSON or XML. These formats are machine
readable data formats that can be easily imported with most modern programming languages.
Many websites are even offering packages for different programming languages that are taking
care of these steps. The techniques we discuss are in principle able to systematically access and
store information that is embedded in any type of content. Packages for Python and R allow to
easily access and process text in PDF documents as well as enable access to Google’s and
16
Facebook’s machine learning APIs TensorFlow and Torch that can be used to translate text or
recognize objects in images and videos.
For many research questions, it is sufficient to crawl a website once and use the time structure
embedded in the website to construct a panel. One could for example use the time stamps of
product reviews as proxies for sales. In other cases, the website might not store all historical
changes that are important for the desired analysis and this would then require repeated crawls of
the website. If prices of products in an online shop would for example be important and only
current prices are shown, one would have to regularly crawl the website. In order to re-run a
crawler in specified time intervals without requiring manual restarts of the crawling process, it is
possible to use the scheduling functionalities built into the operating system (Task Scheduler for
Windows, Cron for MacOS and Linux). In cases of repeated crawls, it might also be useful to
separate downloading of the HTML code and parsing it since changes in the structure of the HTML
website might result in parse errors that can render the whole crawling effort worthless if
discovered too late.
Once the desired data is extracted from the downloaded data, the last step is saving it so that it can
be further processed. Even though many programming languages are now offering rich
opportunities for statistical analysis such as the Pandas library for Python, most management
scholars will have a preference for using statistical software such as STATA or SPSS for
continuing their work with the obtained data. The default way to transfer data between languages
is to save the data in text format using packages that are capable of writing CSV files and then
importing these text files into statistical software. It is of course also possible to save the obtained
data to a database such as MySQL, which might be especially useful in cases where crawling has
to be conducted repeatedly.
17
4. SOLUTIONS TO CHALLENGES OF CRAWLING PROJECTS
4.1. Character encoding
The internet allows researchers to access international data, which sometimes means that
information is only available in local languages. This means that text data can include a huge
variety of non-Latin characters (ASCII), such as language-specific alphabets, currency symbols,
or emoticons. The Unicode standard ensures correct representation and handling of these
characters across devices and software systems. Characters are encoded according to standardized
rules, much like in pre-Computer systems like the Morse code. For example, the Unicode
translation of the ampersand “&” is “U+0026”. Unicode itself can support up to 2^31 characters
(code points), but a variety of coding scheme implementations exist that trade-off storage needs
with the number of characters they cover. Older coding schemes such as Latin-1 (ISO-8859-1)
only covers the first 256 code points of Unicode, UTF-8 can support 2^25 codepoints. How does
this relate to data crawling? When reading text for a web source or local file, it is important to
know in which scheme it is encoded to avoid misrepresentations when processing the information.
Here it helps of course that more than 90% of the world wide web is encoded in UTF-8.
10
When
saving the information to a local file or database system, it is important to select the according
encoding scheme, or convert the encoding scheme when needed. Sometimes it will not matter to
the researcher whether some special characters are adequately represented because the underlying
information can still be sufficiently disambiguated. Often, however, encoding matters. First, it is
crucial to have a common encoding scheme when two datasets are matched on string variables.
Failing to have a “common denominator” can cause otherwise identical strings to look different
10
"Usage of character encodings broken down by ranking - W3Techs."
https://w3techs.com/technologies/cross/character_encoding/ranking, accessed December 12, 2018.
18
and 1:1 matching to fail. Second, it is important to recognize that encoding issues can occur at
various steps of the data collection, preparation, and analysis workflow because not all
programming languages and software packages automatically support unicode (i.e. allow to
specify the encoding scheme for input and output files). STATA introduced unicode capabilities
in version 14 in 2015, users of R and Python need to import specific packages.
4.2. A machine learning approach to joining data from multiple sources
Researchers in management are very often confronted with the problem of matching datasets,
perhaps complementing proprietary data with data crawled from the web. For a project on
competition in online retail, for example, we would need to combine price data from various
outlets. In many cases, these datasets would come without common identifiers. For example, the
same product may appear as “iRobot Roomba 675 Robot Vacuum with Wi-Fi Connectivity, Works
with Alexa, Good for Pet Hair, Carpets, Hard Floors” one the website of retailer A and as “iRobot
- Roomba 675 App-Controlled Self-Charging Robot Vacuum - Black” on the website of retailer
B. In some cases, careful pre-processing and data cleaning, involving string manipulation that
removes non-discriminatory words (e.g. regular expressions that replace commonly used words)
and disambiguation algorithms (e.g. phonetic encoding) can help to define a common identifier.
In many practical cases, abbreviations or typos (especially when the underlying database is user-
generated) additionally complicate the process as they are much less easy to consistently spot and
correct, especially in large datasets. Most limiting, however, is that matching in most database
systems and statistical packages is binary and deterministic, i.e. data points either match or not.
An approach to solve these issues is an easy-to-implement machine learning application, often
referred to as fuzzy matching. We want to train a model to predict the continuous likelihood that
two data points match, and then define a threshold value of the estimated likelihood at which we
19
are willing to follow the model’s recommendation and accept a pair of observations as a match.
To implement this, we would first draw random subsamples 𝑁∗of dataset A and dataset B, and
invest manual effort to compare each observation in A to each observation in B and identify
matching observations. Hold on before you hire an army of RA’s though.
11
Because most
observations will have a very low likelihood of being a match ex-ante, there is often no need to go
through all 𝑁
𝐴∗× 𝑁𝐵∗ observations. We can use indexing to reduce the complexity by defining
consideration sets, i.e. not compare observations that have a very low likelihood of overlap ex-
ante. For example, defining a price range of ±500% can limit the number of products to be
compared to 𝑁
𝐴
̃× 𝑁𝐵
̃if we are willing to assume that the price of the same product will not differ
by more than that range. We then generate features that reflect the similarity of text identifiers in
both datasets, such as the number of common characters, phonetic similarity, and perhaps the
overlap in other dimensions of the data (on our product example these may be color, dimensions,
price range, etc.). Now we split the dataset in two parts: a training dataset and a test dataset. With
these features we can estimate the parameters of a logistic regression where we explain matching
observations in the training data.
12
When testing the model’s predictions on a dataset with known
matches, we can optimize the model and decide on a threshold value that minimizes the occurrence
of Type I and Type II errors. Finally, we can use the trained model to predict matching probabilities
for all (optionally indexed) observations in the full sample, and finally define a combined dataset
based on the optimal likelihood threshold value we have discovered.
11
Sometimes it is more efficient to use microwork platforms like Amazon MTurk or Upwork.
12
We choose a logistic regression model, because this can be easily implemented in software packages like STATA
or SPSS with which some readers may be most familiar with. Of course, any other more or less sophisticated
supervised machine learning method, e.g. Naive Bayes, decision trees, random forests, vector support machines, can
be used here. These methods are readily available packages and straightforward to implement in software like R and
Python.
20
4.3. Dynamic websites
So far, we have only considered the case of static websites, i.e. where the entire content is loaded
immediately, or contents do not vary systematically for different users. Most modern websites,
however, are dynamic. That is, individual pages are built on demand, often using a combination
of server-side (e.g. retrieving content from a database) and client-side technologies. For the latter,
the local browser runs a script, e.g. JavaScript or DHTML, that ultimately assembles a
personalized version of the webpage. Some context will then be dynamically loaded when the user
takes action, e.g. scrolling down in the Facebook feed. The simple web crawler that we introduced
in section 3, is an extremely simple web browser that can only display text and does not have any
additional capabilities to execute client-side scripts. As a result, this tool is often not powerful
enough to extract all desired data from a dynamic website.
However, before giving up too quickly, there are two practical workarounds that one can always
try. First, we can study the raw input that is fed into the crawler. In some cases, the content to be
dynamically displayed is already part of HTML source - the input that even our simplistic crawler
can process. Although this content is not directly visible when navigating to the website in a
standard web browser, it is still there and can be captured by our simple crawler. Second, we can
use tools like the network traffic analysis in Chrome’s developer tools (similar extensions are
available for other browsers), to study the external content that is dynamically loaded. This will
sometimes lead us directly to the source, i.e. a URL that returns XML or JSON formatted data.
In many other cases, we will need to upgrade our crawling technology and add scripting
capabilities to our web crawler. An easy method to do so is to “remote control” a fully-featured
web browser like Chrome or Firefox using the Selenium package in Python or R. This package
allows to move the heavy lifting to the browser, which runs in the background, and our crawler
21
can access the resulting content. While this obviously opens up a large number of possibilities to
access data that would otherwise be very difficult to obtain in a structured way, it comes at the cost
of dramatically increased memory usage and run times. We will introduce some ideas to reduce
run time in the next section.
4.4. Speeding up the crawling process
Crawling one page at a time can be too time consuming for a variety of reasons. First and foremost,
the number of pages/observations is too large to capture all desired data in acceptable time, or
before the information is changed/updated or disappears. Given enough hardware resources, the
solution is to run multiple instances of the crawler in parallel.
The simplest method is to manually distribute jobs across multiple instances of the web crawler.
For example, instead of one instance handling 100 URLs sequentially, two instances would handle
batches of 50 each. With simple tests on smaller subsets of inputs, we can make an informed
decision about a reasonable number of instances and the size of individual batches. This do-it-
yourself version of parallel computing, however, can only lead to significant performance
increases if the memory and CPU usage of each instance is small enough. For larger projects, it
makes much more sense to use more sophisticated solutions to run multiple instances of the crawler
in parallel. High-level packages in Python and R make it easy to optimize parallelization by
utilizing multiple CPU cores at the same time. In practice this means that for many applications,
the researcher does not need fancy dedicated server hardware to substantially improve run time:
most consumer-level CPUs that were manufactured in the last decade have multiple CPU cores.
Network speed, network congestions, and server-side limitations are often the bottleneck. Many
websites block access for an IP address if we send too many page requests too frequently. While
it is certainly possible to periodically switch IP addresses by “hiding” behind a proxy or VPN
22
server, it is good practice to limit the number of requests (only crawl what is really necessary for
our research project) or pause for a certain amount of time between requests.
4.5. Crawling panel data
In many research applications, it is interesting to study how variables or cross-sections evolve over
time. In section 3.3, we have already introduced tools that allow systematic repetition of the
crawling process, i.e. frequent crawls of a specific URL that allow to create panel datasets over
time. However, this of course only works to concerning forward-looking data. Often we need
historical data from the past, i.e. points in time before we started to collect our data. Services like
the Internet Archive’s Wayback Machine and CommonCrawl provide archived copies of websites,
sometimes going back multiple years with a relatively high frequency of snapshots. Here, we can
either scrape individual URLs directly from those services or download monthly snapshots of the
data in bulk.
In addition to the content itself, metadata, e.g. on the frequency of content updates, can be
interesting as well. A specific example could be a study of the frequency of price changes on an
eCommerce website. Further, metadata also allows to observe a website’s relationship with third-
parties. When scraping a large number of websites, such metadata can be helpful to study market
shares of tracking and advertising services, such as Google Analytics or Facebook’s Like- and
Share-buttons. Also for this kind of data, there exist databases that contain historical information
on third-party requests, e.g. the HTTPArchive or the Princeton Web Census.
Finally, some research questions might be related to understanding personalized firm behavior,
e.g. targeted advertising, first degree price discrimination, or personalized recommendation. So
far, the crawler we introduced in section 3 was stateless, i.e. we didn’t transfer any specific data to
23
the website. In a stateful crawling approach, we can for example accept cookies that allow the
website to identify us in subsequent scrapes and match information that it potentially obtained
from third-party sources. For example, mimicking the browsing behavior of a human user, we can
leave traces with tracking and advertising services and therefore study, almost like in a lab setting,
how user-specific information “travels” across the web and which type of information websites
and third-party services share. Once we mimic different types of robot-users, it becomes feasible
to understand the contingencies, i.e. geographical origin, types of visited websites, etc., of the way
personal data is tracked and shared (REF to CS Lit). This allows a very different approach of doing
management and strategy research. Instead of using natural experiments to study consumer and
firm reactions to changes in the external environment, we can run controlled field experiments
where we randomly change the type of customer the firm is serving and therefore understand the
contingencies of firm behavior and strategy.
5. CONCLUSION
We have demonstrated that the field of management research is still lagging behind other fields
when it comes to data crawling, but believe that crawled data has a lot of untapped potential to
inform management research. We therefore developed clear guidelines that management scholars
can consider if they want to use crawled data for their own projects and that make it also easier to
judge in how far data crawling has been appropriately used in other research projects.
The question when data crawling is useful depends on weighing several advantages and
disadvantages associated with data crawling. On the plus side, crawled data can give researchers
more exclusivity than using databases used by a lot of other researchers, but this advantage can
also be limited if a website experiences increasing interest from other researchers. Furthermore,
crawling can be much faster than manual data collection once you obtain the necessary skills.
24
Another key advantage of using data crawling is that it can make researchers less dependent on
publication biases that are difficult to avoid when obtaining data through contractual arrangements
with firms. If it is instead possible to base a research project on crawls of publicly available data,
researchers do not have to implicitly or explicitly consider third-party interests before publishing
their results. Finally, many crawled data bases have an impressively high number of observations,
which makes it easier to cleanly identify effects that would be estimated much less precise in
smaller data sets. The challenge of working with these large datasets is then of course to not just
be satisfied with statistical significance, but instead focusing much more on economic effect sizes.
When it comes to the disadvantages of data crawling, a first point is that successful data crawling
comes with certain learning costs. However, as we have shown in sections 3 and 4, data crawling
does not necessarily require advanced programming skills and we hope that our guidelines help
preventing many avoidable problems, therefore making data crawling more attractive for a wider
set of management scholars. Second, data crawling can be prone to errors. Errors can both emerge
from low data quality of the crawled website as well as through wrong programming of crawlers.
The problem of low underlying data quality has always to be considered and it is helpful to
regularly conduct sanity checks of the underlying data. This can for example be done by
triangulating the data with other data sources. Data problems stemming from the crawling process
itself can best be avoided by testing how multiple examples from the data on the crawled website
are affected by each data transformation step. These tests should be conducted side-by-side with
coding the data crawler, as each step can then be immediately checked for errors, making it way
easier to identify problems as compared to debugging multiple transformation steps at once.
Finally, data that is made freely available on the Internet is often not very well documented and
we might therefore struggle to obtain proper data knowledge. Due to this, it is important to not
25
immediately jump to crawling the data, but instead to first develop a good understanding about the
data that is made available on the target website, preferably both by interacting with the website
and by interacting with the website’s community.
As each research method, data crawling should also be evaluated from an ethical and legal
perspective. While website operators make their data freely available on the Internet, many
operators still try to ban third parties from automatically downloading their contents. With ever-
decreasing costs of Internet traffic, it is becoming more and more unlikely that the desire to prevent
data crawling is driven by the consideration that data crawlers created costs but do not contribute
to the monetization of the website. Instead, the motive of trying to prevent data crawling is likely
that the website operator is fine with individual humans accessing small portions of their overall
data, but does not want third parties to get permanent access to their data. One standardized way
how website operators express their preferences when it comes to crawling are robots.txt
files that are located on most websites. Furthermore, the terms and conditions of a website are
often also mentioning the website operator’s preferences with regard to data crawling. It is
however unclear in how far these statements do have any legal implications and given the
complexity of this subject matter, this goes clearly beyond the scope of this paper. For researchers,
it would be beneficial to have a clearer understanding when data crawling is useful. While it is not
clear, if initiatives like the European Union’s General Data Protection Regulation create more
clarity for researchers, we expect that journals and professional organizations will give more
guidance when it comes to specifying the rules for using crawled data. The journal Management
Science for example imposed a rule that website operators can demand the retraction of published
papers if the data crawling was explicitly banned and if they can demonstrate that the data
collection (not the results of the analysis) inflicted significant material harm (Simchi-Levi, 2019).
26
References
Aaltonen A, Seiler S. 2016. Cumulative Growth in User-Generated Content Production:
Evidence from Wikipedia. Management Science 62(7): 2054–2069.
Abrahams AS, Fan W, Wang GA, Zhang Z (John), Jiao J. 2015. An Integrated Text Analytic
Framework for Product Defect Discovery. Production & Operations Management 24(6):
975–990.
Agarwal A, Hosanagar K, Smith MD. 2015. Do Organic Results Help or Hurt Sponsored Search
Performance? Information Systems Research 26(4): 695–713.
Aguiar L, Claussen J, Peukert C. 2018. Catch Me If You Can: Effectiveness and Consequences
of Online Copyright Enforcement. Information Systems Research 29(3): 656–678.
Arora S, ter Hofstede F, Mahajan V. 2017. The Implications of Offering Free Versions for the
Performance of Paid Mobile Apps. Journal of Marketing 81(6): 62–78.
Augenblick N. 2016. The Sunk-Cost Fallacy in Penny Auctions. Review of Economic Studies
83(1): 58–86.
Bandyopadhyay S, Bandyopadhyay S. 2009. Estimating Time Required to Reach Bid Levels in
Online Auctions. Journal of Management Information Systems 26(3): 275–301.
Bao Y, Datta A. 2014. Simultaneously Discovering and Quantifying Risk Types from Textual
Risk Disclosures. Management Science 60(6): 1371–1391.
Bapna R, Ramaprasad J, Umyarov A. 2018. Monetizing Freemium Communities: Does Paying
for Premium Increase Social Engagement? MIS Quarterly 42(3): 719–735.
Bapna R, Umyarov A. 2015. Do Your Online Friends Make You Pay? A Randomized Field
Experiment on Peer Influence in Online Social Networks. Management Science 61(8):
1902–1920.
Bauer J, Franke N, Tuertscher P. 2016. Intellectual Property Norms in Online Communities:
How User-Organized Intellectual Property Regulation Supports Innovation. Information
Systems Research 27(4): 724–750.
Bellezza S, Paharia N, Keinan A. 2017. Conspicuous Consumption of Time: When Busyness and
Lack of Leisure Time Become a Status Symbol. Journal of Consumer Research 44(1):
118–138.
Ben-David I, Palvia A, Spatt C. 2017. Banks’ Internal Capital Markets and Deposit Rates.
Journal of Financial & Quantitative Analysis 52(5): 1797–1826.
Berger J, Milkman KL. 2012. What Makes Online Content Viral?? Journal of Marketing
Research (JMR) 49(2): 192–205.
Bertsimas D, Brynjolfsson E, Reichman S, Silberholz J. 2014. Tenure Analytics: Models for
Predicting Research Impact : 28.
Bianchi AJ, Kang SM, Stewart D. 2012. The Organizational Selection of Status Characteristics:
Status Evaluations in an Open Source Community. Organization Science 23(2): 341–354.
Bockstedt J, Druehl C, Mishra A. 2015. Problem-solving effort and success in innovation
contests: The role of national wealth and national culture. Journal of Operations
Management 36: 187–200.
Bodea T, Ferguson M, Garrow L. 2009. Choice-Based Revenue Management: Data from a
Major Hotel Chain. Manufacturing & Service Operations Management 11(2): 356–361.
Bronnenberg BJ, Kim JB, Mela CF. 2016. Zooming In on Choice: How Do Consumers Search
for Cameras Online? Marketing Science 35(5): 693–712.
Brynjolfsson E, Hu Y (Jeffrey), Rahman MS. 2009. Battle of the Retail Channels: How Product
27
Selection and Geography Drive Cross-Channel Competition. Management Science
55(11): 1755–1765.
Bucklin RE, Sismeiro C. 2003. A Model of Web Site Browsing Behavior Estimated on
Clickstream Data. Journal of Marketing Research (JMR) 40(3): 249–267.
Busse JA, Goyal A, Wahal S. 2014. Investing in a Global World*. Review of Finance 18(2):
561–590.
Cachon GP, Gallino S, Olivares M. 2018. Does Adding Inventory Increase Sales? Evidence of a
Scarcity Effect in U.S. Automobile Dealerships. Management Science. Available at:
http://pubsonline.informs.org/doi/10.1287/mnsc.2017.3014.
Cai Y, Sevilir M. 2012. Board connections and M&A transactions. Journal of Financial
Economics 103(2): 327–349.
Calic G, Mosakowski E. 2016. Kicking off social entrepreneurship: How a sustainability
orientation influences crowdfunding success. Journal of Management Studies 53(5): 738–
767.
Campello M, Lin C, Ma Y, Zou H. 2011. The real and financial implications of corporate
hedging. The journal of finance 66(5): 1615–1647.
Cao Z, Hui K-L, Xu H. 2018. When Discounts Hurt Sales: The Case of Daily-Deal Markets.
Information Systems Research 29(3): 567–591.
Carmi E, Oestreicher-Singer G, Stettner U, Sundararajan A. 2017. Is Oprah Contagious? The
Depth of Diffusion of Demand Shocks in a Product Network. MIS Quarterly 41(1): 207-
A13.
Cavallo A, Neiman B, Rigobon R. 2014. Currency Unions, Product Introductions, and the Real
Exchange Rate*. Quarterly Journal of Economics 129(2): 529–595.
Chau M, Xu J. 2012. Business Intelligence in Blogs: Understanding Consumer Interactions and
Communities. MIS Quarterly 36(4): 1189–1216.
Chen MK. 2013. The Effect of Language on Economic Behavior: Evidence from Savings Rates,
Health Behaviors, and Retirement Assets. American Economic Review 103(2): 690–731.
Chen P-Y, Hong Y, Liu Y. 2018. The Value of Multidimensional Rating Systems: Evidence
from a Natural Experiment and Randomized Experiments. Management Science 64(10):
4629–4647.
Claussen J, Essling C, Kretschmer T. 2015. When less can be more–Setting technology levels in
complementary goods markets. Research Policy 44(2): 328–339.
Claussen J, Kretschmer T, Mayrhofer P. 2013. The Effects of Rewarding User Engagement: The
Case of Facebook Apps. Information Systems Research 24(1): 186–200.
Corredoira RA, Goldfarb BD, Shi Y. 2018. Federal funding and the rate and direction of
inventive activity. Research Policy 47(9): 1777–1800.
Das SR, Chen MY. 2007. Yahoo! for Amazon: Sentiment Extraction from Small Talk on the
Web. Management Science 53(9): 1375–1388.
Datta H, Knox G, Bronnenberg BJ. 2018. Changing Their Tune: How Consumers’ Adoption of
Online Streaming Affects Music Consumption and Discovery. Marketing Science 37(1):
5–21.
De Bakker FGA, Hellsten I. 2013. Capturing Online Presence: Hyperlinks and Semantic
Networks in Activist Group Websites on Corporate Social Responsibility. Journal of
Business Ethics 118(4): 807–823.
De Langhe B, Fernbach PM, Lichtenstein DR. 2016. Navigating by the Stars: Investigating the
Actual and Perceived Validity of Online User Ratings. Journal of Consumer Research
28
42(6): 817–833.
Dellavigna S, Hermle J. 2017. Does Conflict of Interest Lead to Biased Coverage? Evidence
from Movie Reviews. Review of Economic Studies 84(4): 1510–1550.
Dhar V, Geva T, Oestreicher-Singer G, Sundararajan A. 2014. Prediction in Economic
Networks. Information Systems Research 25(2): 264–284.
Dissanayake I, Zhang J, Gu B. 2015. Task Division for Team Success in Crowdsourcing
Contests: Resource Allocation and Alignment Effects. Journal of Management
Information Systems 32(2): 8–39.
Engelberg J, Gao P. 2011. In search of attention. The Journal of Finance 66(5): 1461–1499.
Feldman M, Lowe N. 2015. Triangulating regional economies: Realizing the promise of digital
data. Research Policy 44(9): 1785–1793.
Fischer E, Reuber AR. 2014. Online entrepreneurial communication: Mitigating uncertainty and
increasing differentiation via Twitter. Journal of Business Venturing 29(4): 565–583.
Fisher M, Gallino S, Li J. 2018. Competition-Based Dynamic Pricing in Online Retailing: A
Methodology Validated with Field Experiments. Management Science 64(6): 2496–2514.
Galak J, Small D, Stephen AT. 2011. Microfinance Decision Making: A Field Study of Prosocial
Lending. Journal of Marketing Research (JMR) 48: S130–S137.
Gao M, Huang J. 2016. Capitalizing on Capitol Hill: Informed trading by hedge fund managers.
Journal of Financial Economics 121(3): 521–545.
Ge R, Feng J, Gu B, Zhang P. 2017. Predicting and Deterring Default with Social Media
Information in Peer-to-Peer Lending. Journal of Management Information Systems 34(2):
401–424.
Gehman J, Grimes M. 2017. Hidden Badge of Honor: How Contextual Distinctiveness Affects
Category Promotion Among Certified B Corporations. Academy of Management Journal
60(6): 2294–2320.
Geuna A et al. 2015. SiSOB data extraction and codification: A tool to analyze scientific careers.
Research Policy, The New Data Frontier 44(9): 1645–1658.
Giglio S, Maggiori M, Stroebel J. 2016. No‐bubble condition: Model‐free tests in housing
markets. Econometrica 84(3): 1047–1091.
Giglio S, Shue K. 2014. No News Is News: Do Markets Underreact to Nothing? Review of
Financial Studies 27(12): 3389–3440.
Goldenberg J, Han S, Lehmann DR, Hong JW. 2009. The Role of Hubs in the Adoption Process.
Journal of Marketing 73(2): 1–13.
Goldenberg J, Oestreicher-Singer G, Reichman S. 2012. The Quest for Content: How User-
Generated Links Can Facilitate Online Exploration. Journal of Marketing Research
(JMR) 49(4): 452–468.
Green TC, Jame R. 2013. Company name fluency, investor recognition, and firm value. Journal
of Financial Economics 109(3): 813–834.
Greenwood BN, Gopal A. 2015. Research Note—Tigerblood: Newspapers, Blogs, and the
Founding of Information Technology Firms. Information Systems Research 26(4): 812–
828.
Greenwood BN, Gopal A. 2017. Ending the Mending Wall: Herding, Media Coverage, and
Colocation in It Entrepreneurship. MIS Quarterly 41(3): 989-A14.
Gu B, Ye Q. 2014. First Step in Social Media: Measuring the Influence of Online Management
Responses on Customer Satisfaction. Production & Operations Management 23(4): 570–
582.
29
Guo H, Cheng HK, Kelley K. 2016. Impact of Network Structure on Malware Propagation: A
Growth Curve Perspective. Journal of Management Information Systems 33(1): 296–325.
Guo S, Guo X, Fang Y, Vogel D. 2017. How Doctors Gain Social and Economic Returns in
Online Health-Care Communities: A Professional Capital Perspective. Journal of
Management Information Systems 34(2): 487–519.
Haans RF. 2019. What’s the value of being different when everyone is? The effects of
distinctiveness on performance in homogeneous versus heterogeneous categories.
Strategic Management Journal 40(1): 3–27.
Hahn J, Moon JY, Zhang C. 2008. Emergence of New Project Teams from Open Source
Software Developer Networks: Impact of Prior Collaboration Ties. Information Systems
Research 19(3): 369–391.
Hanley KW, Hoberg G. 2010. The Information Content of IPO Prospectuses. Review of
Financial Studies 23(7): 2821–2864.
Hanley KW, Hoberg G. 2012. Litigation risk, strategic disclosure and the underpricing of initial
public offerings. Journal of Financial Economics 103(2): 235–254.
He L. 2016. Service Region Design for Urban Electric Vehicle Sharing Systems : 49.
Heim GR, Field JM. 2007. Process drivers of e-service quality: Analysis of data from an online
rating site. Journal of Operations Management 25(5): 962–984.
Heimbach I, Hinz O. 2018. The Impact of Sharing Mechanism Design on Content Sharing in
Online Social Networks. Information Systems Research 29(3): 592–611.
Heimeriks G, Van den Besselaar P, Frenken K. 2008. Digital disciplinary differences: An
analysis of computer-mediated science and ‘Mode 2’knowledge production. Research
Policy 37(9): 1602–1615.
Helveston JP, Wang Y, Karplus VJ, Fuchs ER. 2019. Institutional complementarities: The
origins of experimentation in China’s plug-in electric vehicle industry. Research Policy
48(1): 206–222.
Henkel AP, Boegershausen J, Hoegg J, Aquino K, Lemmink J. 2018. Discounting humanity:
When consumers are price conscious, employees appear less human. Journal of
Consumer Psychology 28(2): 272–292.
Herzenstein M, Sonenshein S, Dholakia UM. 2011. Tell Me a Good Story and I May Lend You
Money: The Role of Narratives in Peer-to-Peer Lending Decisions. Journal of Marketing
Research (JMR) 48: S138–S149.
Hinz O, Skiera B, Barrot C, Becker JU. 2011. Seeding Strategies for Viral Marketing: An
Empirical Comparison. Journal of Marketing 75(6): 55–71.
Hoberg G, Maksimovic V. 2015. Redefining Financial Constraints: A Text-Based Analysis.
Review of Financial Studies 28(5): 1312–1352.
Hoberg G, Phillips G. 2010. Product market synergies and competition in mergers and
acquisitions: A text-based analysis. The Review of Financial Studies 23(10): 3773–3811.
Hoberg G, Phillips G. 2016. Text-Based Network Industries and Endogenous Product
Differentiation. Journal of Political Economy 124(5): 1423–1465.
Hoberg G, Phillips G. 2018. Conglomerate Industry Choice and Product Language. Management
Science 64(8): 3735–3755.
Hong Y, Hu Y, Burtch G. 2018. Embeddedness, Prosociality, and Social Influence: Evidence
from Online Crowdfunding. MIS Quarterly 42(4): 1211–1224.
Huang J. 2018. The customer knows best: The investment value of consumer opinions. Journal
of Financial Economics 128(1): 164–182.
30
Huang J, Boh WF, Goh KH. 2017. A Temporal Study of the Effects of Online Opinions:
Information Sources Matter. Journal of Management Information Systems 34(4): 1169–
1202.
Huang L, Tan C-H, Ke W, Wei K-K. 2013. Comprehension and Assessment of Product
Reviews: A Review-Product Congruity Proposition. Journal of Management Information
Systems 30(3): 311–343.
Islam M, Miller J, Park HD. 2017. But what will it cost me? How do private costs of
participation affect open source software projects? Research Policy 46(6): 1062–1070.
Jain N, Girotra K, Netessine S. 2014. Managing Global Sourcing: Inventory Performance.
Management Science 60(5): 1202–1222.
Jancsary D, Meyer RE, Höllerer MA, Barberio V. 2017. Toward a Structural Model of
Organizational-Level Institutional Pluralism and Logic Interconnectedness. Organization
Science 28(6): 1150–1167.
Jegadeesh N, Wu D. 2013. Word power: A new approach for content analysis. Journal of
Financial Economics 110(3): 712–729.
Jerath K, Ma L, Park Y-H, Srinivasan K. 2011. A ‘Position Paradox’ in Sponsored Search
Auctions. Marketing Science 30(4): 612–627.
Jiahui Mo, Sarkar S, Menon S. 2018. Know When to Run: Recommendations in Crowdsourcing
Contests. MIS Quarterly 42(3): 919–944.
Ketcham JD, Lucarelli C, Miravete EJ, Roebuck MC. 2012. Sinking, Swimming, or Learning to
Swim in Medicare Part D. American Economic Review 102(6): 2639–2673.
Khansa L, Ma X, Liginlal D, Kim SS. 2015. Understanding Members’ Active Participation in
Online Question-and-Answer Communities: A Theory and Empirical Analysis. Journal
of Management Information Systems 32(2): 162–203.
Kim K, Gopal A, Hoberg G. 2016. Does Product Market Competition Drive CVC Investment?
Evidence from the U.S. IT Industry. Information Systems Research 27(2): 259–281.
Kim N, Lee H, Kim W, Lee H, Suh JH. 2015. Dynamic patterns of industry convergence:
Evidence from a large amount of unstructured data. Research Policy 44(9): 1734–1748.
Kornish LJ, Ulrich KT. 2014. The Importance of the Raw Idea in Innovation: Testing the Sow’s
Ear Hypothesis. Journal of Marketing Research (JMR) 51(1): 14–26.
Kozlenkova IV, Palmatier RW, Fang E (Er), Xiao B, Huang M. 2017. Online Relationship
Formation. Journal of Marketing 81(3): 21–40.
Kuhn P, Shen K. 2013. Gender Discrimination in Job Ads: Evidence from China*. Quarterly
Journal of Economics 128(1): 287–336.
Kupor D, Tormala Z. 2018. When Moderation Fosters Persuasion: The Persuasive Power of
Deviatory Reviews. Journal of Consumer Research 45(3): 490–510.
Lash MT, Zhao K. 2016. Early Predictions of Movie Success: The Who, What, and When of
Profitability. Journal of Management Information Systems 33(3): 874–903.
Lau RYK, Liao SSY, Wong KF, Chiu DKW. 2012. Web 2.0 Environmental Scanning and
Adaptive Decision Support for Business Mergers and Acquisitions. MIS Quarterly 36(4):
1239-A6.
Lee D, Hosanagar K, Nair HS. 2018. Advertising Content and Consumer Engagement on Social
Media: Evidence from Facebook. Management Science 64(11): 5105–5131.
Lee K, Lee B, Oh W. 2015a. Thumbs Up, Sales Up? The Contingent Effect of Facebook Likes
on Sales Performance in Social Commerce. Journal of Management Information Systems
32(4): 109–143.
31
Lee K, Oh W-Y, Kim N. 2013. Social Media for Socially Responsible Firms: Analysis of
Fortune 500’s Twitter Profiles and their CSR/CSIR Ratings. Journal of Business Ethics
118(4): 791–806.
Lee S-H, Mun HJ, Park KM. 2015b. When is dependence on other organizations burdensome?
The effect of asymmetric dependence on internet firm failure. Strategic Management
Journal 36(13): 2058–2074.
Li C, Luo X, Zhang C, Wang X. 2017. Sunny, Rainy, and Cloudy with a Chance of Mobile
Promotion Effectiveness. Marketing Science 36(5): 762–779.
Li J, Granados N, Netessine S. 2014. Are Consumers Strategic? Structural Estimation from the
Air-Travel Industry. Management Science 60(9): 2114–2137.
Li W, Chen H, Nunamaker JF. 2016. Identifying and Profiling Key Sellers in Cyber Carding
Community: AZSecure Text Mining System. Journal of Management Information
Systems 33(4): 1059–1086.
Liang H, Marquis C, Renneboog L, Sun SL. 2018. Future-Time Framing: The Effect of
Language on Corporate Future Orientation. Organization Science. Available at:
http://pubsonline.informs.org/doi/10.1287/orsc.2018.1217.
Liu TX, Yang J, Adamic LA, Chen Y. 2014. Crowdsourcing with All-Pay Auctions: A Field
Experiment on Taskcn. Management Science 60(8): 2020–2037.
Lobschat L, Osinga EC, Reinartz WJ. 2017. What Happens Online Stays Online? Segment-
Specific Online and Offline Effects of Banner Advertisements. Journal of Marketing
Research (JMR) 54(6): 901–913.
Lu Y, Jerath K, Singh PV. 2013. The Emergence of Opinion Leaders in a Networked Online
Community: A Dyadic Model with Time Dynamics and a Heuristic for Fast Estimation.
Management Science 59(8): 1783–1799.
Luo X, Zhang J. 2013. How Do Consumer Buzz and Traffic in Social Media Marketing Predict
the Value of the Firm? Journal of Management Information Systems 30(2): 213–238.
Luo X, Zhang J, Duan W. 2013. Social Media and Firm Equity Value. Information Systems
Research 24(1): 146–163.
Lynn Wu. 2013. Social Network Effects on Productivity and Job Security: Evidence from the
Adoption of a Social Networking Tool. Information Systems Research 24(1): 30–51.
Mai F, Shan Z, Bai Q, Wang X (Shane), Chiang RHL. 2018. How Does Social Media Impact
Bitcoin Value? A Test of the Silent Majority Hypothesis. Journal of Management
Information Systems 35(1): 19–52.
Malesky E, Taussig M. 2017. The Danger of Not Listening to Firms: Government
Responsiveness and the Goal of Regulatory Compliance. Academy of Management
Journal 60(5): 1741–1770.
Marino A, Aversa P, Mesquita L, Anand J. 2015. Driving performance via exploration in
changing environments: Evidence from formula one racing. Organization Science 26(4):
1079–1100.
Marquis C, Bird Y. 2018. The Paradox of Responsive Authoritarianism: How Civic Activism
Spurs Environmental Penalties in China. Organization Science 29(5): 948–968.
Martin KD, Borah A, Palmatier RW. 2017. Data Privacy: Effects on Customer and Firm
Performance. Journal of Marketing 81(1): 36–58.
Mas A. 2017. Does Transparency Lead to Pay Compression? Journal of Political Economy
125(5): 1683–1721.
Massimino B, Gray JV, Boyer KK. 2017. The Effects of Agglomeration and National Property
32
Rights on Digital Confidentiality Performance. Production & Operations Management
26(1): 162–179.
Mayzlin D, Dover Y, Chevalier J. 2014. Promotional reviews: An empirical investigation of
online review manipulation. American Economic Review 104(8): 2421–55.
McDevitt RC. 2014. ‘A’ Business by Any Other Name: Firm Name Choice as a Signal of Firm
Quality. Journal of Political Economy 122(4): 909–944.
Mogilner C, Aaker J, Kamvar SD. 2012. How Happiness Affects Choice. Journal of Consumer
Research 39(2): 429–443.
Moon JY, Sproull LS. 2008. The Role of Feedback in Managing the Internet-Based Volunteer
Work Force. Information Systems Research 19(4): 494–515.
Munzert S, Rubba C, Mei\s sner P, Nyhuis D. 2014. Automated data collection with R: A
practical guide to web scraping and text mining. John Wiley & Sons.
Nelson AJ. 2009. Measuring knowledge spillovers: What patents, licenses and publications
reveal about innovation diffusion. Research policy 38(6): 994–1005.
Nishida M, Remer M. 2018. The Determinants and Consequences of Search Cost Heterogeneity:
Evidence from Local Gasoline Markets. Journal of Marketing Research (JMR) 55(3):
305–320.
Oertel S, Thommes K. 2018. History as a source of organizational identity creation.
Organization Studies 39(12): 1709–1731.
Oestreicher-Singer G, Sundararajan A. 2012a. Recommendation Networks and the Long Tail of
Electronic Commerce. MIS Quarterly 36(1): 65-A4.
Oestreicher-Singer G, Sundararajan A. 2012b. The Visible Hand? Demand Effects of
Recommendation Networks in Electronic Markets. Management Science 58(11): 1963–
1981.
Oestreicher-Singer G, Zalmanson L. 2013. Content or Community? A Digital Business Strategy
for Content Providers in the Social Age. MIS Quarterly 37(2): 591–616.
Olivares M, Cachon GP. 2009. Competing Retailers and Inventory: An Empirical Investigation
of General Motors’ Dealerships in Isolated U.S. Markets. Management Science 55(9):
1586–1604.
Paik Y, Zhu F. 2016. The Impact of Patent Wars on Firm Strategy: Evidence from the Global
Smartphone Industry. Organization Science 27(6): 1397–1416.
Palmer JW. 2002. Web Site Usability, Design, and Performance Metrics. INFORMS: Institute
for Operations Research: 151–167. Available at:
http://search.ebscohost.com/login.aspx?direct=true&db=bth&AN=6706196&site=ehost-
live.
Pancer E, Chandler V, Poole M, Noseworthy TJ. 2018. How Readability Shapes Social Media
Engagement. Journal of Consumer Psychology.
Pant G, Sheng ORL. 2015. Web Footprints of Firms: Using Online Isomorphism for Competitor
Identification. Information Systems Research 26(1): 188–209.
Pant G, Srinivasan P. 2010. Predicting Web Page Status. Information Systems Research 21(2):
345–364.
Pant G, Srinivasan P. 2013. Status Locality on the Web: Implications for Building Focused
Collections. Information Systems Research 24(3): 802–821.
Paravisini D, Rappoport V, Schnabl P, Wolfenzon D. 2015. Dissecting the Effect of Credit
Supply on Trade: Evidence from Matched Credit-Export Data. Review of Economic
Studies 82(1): 333–359.
33
Perren L, Sapsed J. 2013. Innovation as politics: The rise and reshaping of innovation in UK
parliamentary discourse 1960–2005. Research Policy 42(10): 1815–1828.
Phan HV. 2014. Inside Debt and Mergers and Acquisitions. Journal of Financial & Quantitative
Analysis 49(5/6): 1365–1401.
Quariguasi-Frota-Neto J, Bloemhof J. 2012. An Analysis of the Eco-Efficiency of
Remanufactured Personal Computers and Mobile Phones. Production & Operations
Management 21(1): 101–114.
Rabinovich E, Sinha R, Laseter T. 2011. Unlimited shelf space in Internet supply chains:
Treasure trove or wasteland? Journal of Operations Management 29(4): 305–317.
Rauh JD. 2009. Risk Shifting versus Risk Management: Investment Policy in Corporate Pension
Plans. Review of Financial Studies 22(7): 2687–2733.
Reich T, Kupor DM, Smith RK. 2018. Made by Mistake: When Mistakes Increase Product
Preference. Journal of Consumer Research 44(5): 1085–1103.
Ren CR, Hu Y, Cui TH. 2019. Responses to rival exit: Product variety, market expansion, and
preexisting market structure. Strategic Management Journal 40(2): 253–276.
Ren CR, Ye Hu, Hausman J, Hu Y (Jeffrey). 2011. Managing Product Variety and Collocation in
a Competitive Environment: An Empirical Investigation of Consumer Electronics
Retailing. Management Science 57(6): 1009–1024.
Samtani S, Chinn R, Chen H, Nunamaker JF. 2017. Exploring Emerging Hacker Assets and Key
Hackers for Proactive Cyber Threat Intelligence. Journal of Management Information
Systems 34(4): 1023–1053.
Sayedi A, Jerath K, Srinivasan K. 2014. Competitive Poaching in Sponsored Search Advertising
and Its Strategic Impact on Traditional Advertising. Marketing Science 33(4): 586–608.
Seiler S, Yao S, Wang W. 2017. Does Online Word of Mouth Increase Demand? (And How?)
Evidence from a Natural Experiment. Marketing Science 36(6): 838–861.
Setia P, Rajagopalan B, Sambamurthy V, Calantone R. 2012. How Peripheral Developers
Contribute to Open-Source Software Development. Information Systems Research 23(1):
144–163.
Sha Yang, Ghose A. 2010. Analyzing the Relationship Between Organic and Sponsored Search
Advertising: Positive, Negative, or Zero Interdependence? Marketing Science 29(4):
602–623.
Simchi-Levi D. 2019. From the Editor. Management Science 65(2): v–vi.
Sismeiro C, Bucklin RE. 2004. Modeling Purchase Behavior at an E-Commerce Web Site: A
Task-Completion Approach. Journal of Marketing Research (JMR) 41(3): 306–323.
Sonnier GP, McAlister L, Rutz OJ. 2011. A Dynamic Model of the Effect of Online
Communications on Firm Sales. Marketing Science 30(4): 702–716.
Sosa ME, Mihm J, Browning TR. 2013. Linking Cyclicality and Product Quality. Manufacturing
& Service Operations Management 15(3): 473–491.
Stango V, Zinman J. 2009. What Do Consumers Really Pay on Their Checking and Credit Card
Accounts? Explicit, Implicit, and Avoidable Costs. American Economic Review 99(2):
424–429.
Stango V, Zinman J. 2014. Limited and Varying Consumer Attention: Evidence from Shocks to
the Salience of Bank Overdraft Fees. Review of Financial Studies 27(4): 990–1030.
Thies F, Wessel M, Benlian A. 2016. Effects of Social Interaction Dynamics on Platforms.
Journal of Management Information Systems 33(3): 843–873.
Toubia O, Netzer O. 2017. Idea Generation, Creativity, and Prototypicality. Marketing Science
34
36(1): 1–20.
Trusov M, Ma L, Jamal Z. 2016. Crumbs of the Cookie: User Profiling in Customer-Base
Analysis and Behavioral Targeting. Marketing Science 35(3): 405–426.
Tucker C, Juanjuan Zhang. 2011. How Does Popularity Information Affect Choices? A Field
Experiment. Management Science 57(5): 828–842.
Van Osch W, Steinfield CW. 2018. Strategic Visibility in Enterprise Social Media: Implications
for Network Formation and Boundary Spanning. Journal of Management Information
Systems 35(2): 647–682.
Villarroel Ordenes F, Ludwig S, De Ruyter K, Grewal D, Wetzels M. 2017. Unveiling What Is
Written in the Stars: Analyzing Explicit, Implicit, and Discourse Patterns of Sentiment in
Social Media. Journal of Consumer Research 43(6): 875–894.
Wang X (Shane), Mai F, Chiang RHL. 2014. Database Submission—Market Dynamics and
User-Generated Content About Tablet Computers. Marketing Science 33(3): 449–458.
Wei Dong, Shaoyi Liao, Zhongju Zhang. 2018. Leveraging Financial Social Media Data for
Corporate Fraud Detection. Journal of Management Information Systems 35(2): 461–487.
Wooten JO, Ulrich KT. 2017. Idea Generation and the Role of Feedback: Evidence from Field
Experiments with Innovation Tournaments. Production & Operations Management
26(1): 80–99.
Wu J, Shi M, Hu M. 2015. Threshold Effects in Online Group Buying. Management Science
61(9): 2025–2040.
Xiaohua Zeng, Liyuan Wei. 2013. Social Ties and User Content Generation: Evidence from
Flickr. Information Systems Research 24(1): 71–87.
Xitong Li, Lynn Wu. 2018. Herding and Social Media Word-of-Mouth: Evidence from Groupon.
MIS Quarterly 42(4): 1331–1351.
Yu S, Johnson S, Lai C, Cricelli A, Fleming L. 2017. Crowdfunding and regional entrepreneurial
investment: an application of the CrowdBerkeley database. Research Policy 46(10):
1723–1737.
Zhan Shi, Huaxia Rui, Whinston AB. 2014. Content Sharing in a Social Broadcasting
Environment: Evidence from Twitter. MIS Quarterly 38(1): 123-A6.
Zhang D, Zhou L, Kehoe JL, Kilic IY. 2016. What Online Reviewer Behaviors Really Matter?
Effects of Verbal and Nonverbal Behaviors on Detection of Fake Online Reviews.
Journal of Management Information Systems 33(2): 456–481.
Zheng Z (Eric), Pavlou PA, Gu B. 2014. Latent Growth Modeling for Information Systems:
Theoretical Extensions and Practical Applications. Information Systems Research 25(3):
547–568.}
35
Figures and Tables
Figure 1: Publication of crawling papers by year
36
Table 1: Number and share of published crawling papers by field
Field
Notes: Numbers in parentheses denote the number of published
crawling papers in each journal
Crawling
Papers
Share of
Publications
Accounting: Journal of Accounting Research (6), Contemporary
Accounting Research (2), Accounting, Organizations and Society
(1), Review of Accounting Studies (1)
10
0.15%
Economics: American Economic Review (4), Journal of Political
Economy (3), Review of Economic Studies (3), Quarterly Journal
of Economics (2), Econometrica (1), Behavioral Economics
department of Management Science (1)
14
0.07%
Innovation & Entrepreneurship: Research Policy (12), Journal
of Business Venturing (1)
13
0.67%
Ethics: Journal of Business Ethics (2)
2
0.04%
Finance: Review of Financial Studies (7), Journal of Financial
Economics (6), Journal of Finance (2), Journal of Financial &
Quantitative Analysis (2), Finance Department of Management
Science (2), Review of Finance (1)
20
0.16%
Management: Strategic Management Journal (2), Academy of
Management Journal (1), Business Strategy Department of
Management Science (1), Journal of Management Studies (1)
5
0.04%
Marketing: Marketing Science (12), Journal of Marketing
Research (9), Journal of Consumer Research (6), Journal of
Marketing (5), Marketing Department of Management Science
(3), Journal of Consumer Psychology (2)
37
0.25%
Operations & Information Systems: Information Systems
Research (21), Journal of Management Information Systems (18),
Information Systems Department of Management Science (12),
MIS Quarterly (11), Production and Operations Management (6),
Journal of Operations Management (3), Manufacturing & Service
Operations Management (3), Operations Research (1)
75
0.76%
Organizational Behavior: Organization Science (6),
Organization Studies (1)
7
0.12%
37
Appendix
Table A.1: Published crawling papers by field and setting
Notes: Numbers in parentheses denote setting of paper: 1: Advertisement 2: Between-Industry
3: Crowdsourcing 4: Crowdfunding 5: eCommerce 6: ICT 7: Online Content 8: Online
recommendation systems 9: Open Source 10: Other 11: Search Engine 12: Social Networks
13: User-Generated-Content 14: Within-Industry
Economics: Cavallo et al. (2014,14), Chen (2013, 10), Mas (2017,2), Dellavigna and Hermle
(2017, 14), Hoberg and Phillips (2016,2), Augenblick (2016,5), Paravisini et al. (2015,2),
McDevitt (2014,14), Mayzlin et al. (2014,8), Kuhn and Shen (2013,10), Ketcham et al.
(2012,14), Stango and Zinman (2009,14), Liu et al. (2014,3), Giglio et al. (2016,10)
Ethics: De Bakker and Hellsten (2013,10), Lee et al. (2013,2)
Finance: Da et al. (2014,2), Ben-David et al. (2017,14), Hoberg and Maksimovic (2015,2),
Giglio and Shue (2014,2), Phan (2014,2), Stango and Zinman (2014,14), Das and Chen
(2007,2), Hoberg and Phillips (2010,2), Hanley and Hoberg (2010,2), Rauh (2009,2), Busse et
al. (2014,14), Hoberg and Phillips (2018,2), Campello et al. (2011,2), Engelberg and Gao
(2011,2), Huang (2018,5), Jegadeesh and Wu (2013,2), Gao and Huang (2016,14), Green and
Jame (2013,2), Hanley and Hoberg (2012,2), Cai and Sevilir (2012,2)
Innovation & Entrepreneurship: Yu et al. (2017,4), Geuna et al. (2015,10), Fischer and
Reuber (2014,2), Yu et al. (2017,4), Kim et al. (2015,2), Feldman and Lowe (2015,2),
Corredoira et al. (2018,10), Islam et al.(2017,9), Helveston et al. (2019,10), Heimeriks et al.
(2008,6), Nelson (2009,10), Perren and Sapsed (2013,10), Claussen et al. (2015,6)
Management: Gehman and Grimes (2017,2), Malesky and Taussig (2017,2), Ren et al.
(2011,5), Calic and Mosakowski (2016,4), Haans (2019,10), Lee et al. (2015b,6), Ren et al.
(2019,5)
Marketing: Kupor and Tormala (2018,2), Reich et al. (2018,2), Nishida and Remer (2018,14),
Lobschat et al. (2017,5), Arora et al. (2017,6), Kozlenkova et al. (2017,5), Bellezza et al.
(2017,12), Goldenberg et al. (2009,12), Villarroel Ordenes et al. (2017,8), Martin et al.
(2017,2), De Langhe et al. (2016,5), Kornish and Ulrich (2014,14), Goldenberg et al. (2012,7),
Berger and Milkman (2012,7), Hinz et al. (2011,14), Mogilner et al. (2012,13), Sonnier et al.
(2011,5), Sismeiro and Bucklin (2004,5), Bucklin and Sismeiro (2003,5), Oestreicher-Singer
and Sundararajan (2012b,5), Jerath et al. (2011,11), Tucker and Juanjuan Zhang (2011,14),
Sha Yang and Ghose (2010,5), Galak et al. (2011,4), Herzenstein et al. (2011,4), Sonnier et al.
(2011,5), Seiler et al. (2017,13), Datta et al. (2018,7), Wang et al. (2014,6), Sayedi et al.
(2014,14), Trusov et al. (2016,11), Bronnenberg et al. (2016,6), Toubia and Netzer (2017,3),
Li et al. (2017,14), Wu et al. (2015,5), Henkel et al. (2018,10), Pancer et al. (2018,12)
(continued on next page)
38
Table A.1 (continued)
Operations & Information Systems: Aguiar et al. (2018, 3), Massimino et al. (2017,6),
Greenwood and Gopal (2017,6), Jiahui Mo et al. (2018,3), Bapna et al. (2018,7), Xitong Li and
Lynn Wu (2018,5), Hong et al. (2018,4), Nishida and Remer (2018,14), Van Osch and
Steinfield (2018,14), Wei Dong et al. (2018,12), Mai et al. (2018,6), Pant and Srinivasan
(2013,12), Huang et al. (2017,7), Samtani et al. (2017,12), Bandyopadhyay and
Bandyopadhyay (2009,5), Carmi et al. (2017,5), Guo et al. (2017,12), Ge et al. (2017,4), Li et
al. (2016,12), Lash and Zhao (2016,14), Thies et al. (2016,4), Zhang et al. (2016,8), Guo et al.
(2016,6), Lee et al. (2015a,5), Dissanayake et al. (2015,3), (2015,3), Bodea et al. (2009,14),
Zhan Shi et al. (2014,13), Huang et al. (2013,5), Oestreicher-Singer and Zalmanson (2013,7),
Lau et al. (2012,2), Chau and Xu (2012,13), Luo and Zhang (2013,6), Toubia and Netzer
(2017,2 ), Oestreicher-Singer and Sundararajan (2012,5), Nishida and Remer (2018,3), Pant
and Srinivasan (2010,11), Abrahams et al. (2015,14), Gu and Ye (2014,14), Xiaohua Zeng and
Liyuan Wei (2013,13), Lynn Wu (2013,14), Lu et al. (2013,12), Sosa et al. (2013,9), Setia et
al. (2012, 9), Quariguasi-Frota-Neto and Bloemhof (2012,6), Brynjolfsson et al. (2009,5),
Moon and Sproull (2008,12), Olivares and Cachon (2009,14), Hahn et al. (2008,9), Palmer
(2002,5), Pant and Srinivasan (2010,12), Li et al. (2014,14), Cachon et al. (2018,14), Luo et al.
(2013,6), Claussen et al. (2013,12), Dhar et al. (2014,5), Zheng et al. (2014,5), Pant and Sheng
(2015,2), Agarwal et al. (2015,5), Agarwal et al. (2015,6), Kim et al. (2016,6), Bauer et al.
(2016,3), Heimbach and Hinz (2018,12), Cao et al. (2018,5), Aguiar et al. (2018,1), Jain et al.
(2014,14), Bao and Datta (2014,2), Bapna and Umyarov (2015,12), Aaltonen and Seiler
(2016,3), Fisher et al. (2018,5), Chen et al. (2018,8), Lee et al. (2018,12), He (2016,10),
Bertsimas et al. (2014,10), Heim and Field (2007,8), Rabinovich et al. (2011,5), Bockstedt et
al. (2015,3), Wooten and Ulrich (2017,3)
Organizational Behavior: Bianchi et al. (2012,9), Paik and Zhu (2016,6), Liang et al.
(2018,2), Marino et al. (2015,14), Jancsary et al. (2017,14), Marquis and Bird (2018,2), Oertel
and Thommes (2018,10)