ArticlePDF Available

Obtaining Data from the Internet: A Guide to Data Crawling in Management Research

Authors:
1
Obtaining Data from the Internet:
A Guide to Data Crawling in Management Research
Jörg Claussen
Munich School of Management, LMU Munich, Kaulbachstr. 45, 80539 Munich, Germany and
Department of Strategy and Innovation, Copenhagen Business School, Kilevej 14a, 2000
Frederiksberg, Denmark, j.claussen@lmu.de
Christian Peukert
Católica Lisbon School of Business and Economics, 1649-023 Lisbon, Portugal and
ETH Zürich, Center for Law and Economics, 8092 Zürich, Switzerland,
christian.peukert@gmail.com
Abstract: The increasing availability of data on the Internet opens new opportunities for
management research and the method of data crawling can be used for automated large-scale data
extraction. We show that data crawling has quickly gained popularity and is used for a wide variety
of purposes, but has so far gained less traction in the field of management. We argue that we could
use many data sets used in other disciplines for answering questions in management research and
show that setting up a data crawler does not require advanced programming skills. However, a lot
of pitfalls can challenge the success of using crawled data for research. We develop a guideline
for crawling projects and address how many of the regularly occurring challenges can be
addressed.
Keywords: Crawler, Spider, Scrape, Bot, Data
Acknowledgments: We thank participants of data crawling workshops at Baruch College, Georgia
Tech, LMU Munich, the AOM Conference on Big Data, and the DRUID Academy for valuable
feedback and Laura Krahe-Steinke for excellent research assistance. Peukert acknowledges
support from FCT - Portuguese Foundation of Science and Technology for the project
UID/GES/00407/2013 and FCT-PTDC/EGE-OGE/27968/2017.
2
1. INTRODUCTION
As larger parts of the economy go through the process of digital transformation, more and more
data about market interactions of agents such as firms and consumers are generated, collected, and
combined. Much of these data are directly or indirectly available on the Internet, for example
because some of the interactions happen online, but there is also a plethora of private and public
organizations, such as market research firms and statistical offices, that are in the business of
collecting, aggregating, and distributing data. It goes without saying that these data are highly
relevant for management research. For example, product-level data can be informative about how
firms compete, transaction-level data can be informative about how customers perceive the quality
of products, and some forms of metadata can be informative about how firms organize. The issue
of course is that data is very often not readily available in dedicated databases and accessible with
a simple one-click download. Quite the opposite is the case. Data come in unstructured formats
and are cluttered across many webpages. For example, if one were interested in all studying the
competitive dynamics on a marketplace like eBay, even a single product category would spread
across thousands of subpages, making it infeasible to manually download all pages and extract
relevant information like product characteristics, price, and seller ratings by copying it to a
spreadsheet.
Data crawling, sometimes also called web scraping or spidering, is the method that allows to
automatically extract data from the Internet. Using automated systems (“bots”) to extract data has
many practical applications. Popular services such as search engines, price comparison websites,
or news aggregators are essentially huge data crawling operations, but bots are also used for
malicious means, e.g. advertising fraud and cyber attacks. Industry reports suggest that the better
3
half of web traffic is non-human
1
. It is therefore perhaps not surprising that also the economics
and business research community has started to use crawling as a data collection method. We
conduct an extensive literature review of all articles published in journals on the Financial Times
50 list between 2000 and 2018 and show that there is an overall increase in papers that use data
crawling, increasing from a handful to about 30 papers per year, reaching a total of 183 papers.
The variety of use cases and types of data is striking. We also identify important variation by field.
While the share of papers that use data crawling is relatively large in Operations & Information
Systems and Innovation & Entrepreneurship, it is much smaller in Management. Because we
believe that there are ample possibilities for management researchers to gain significant insights
from data that is available on the Internet, this “lacking behind” motivates us to provide a guide of
how to setup data crawling in this paper.
We provide an intuitive roadmap of the crawling process, starting with a discussion of software
tools and boundary conditions and some best practices for identifying patterns and automating
tasks. We then give some guidance on parsing data that is embedded in the source code of
webpages. Finally, we discuss some heuristic solutions to common challenges of the automated
data collection process, including the matching of multiple datasets without common identifiers,
optimizing run-time, and the crawling of panel data.
We do not aim to substitute for the large volume of very good technical resources that are helpful
to master data crawling (such as Munzert et al., 2014), but rather to complement this literature
1
Available at https://www.theatlantic.com/technology/archive/2017/01/bots-bots-bots/515043/, accessed April 16,
2019.
4
with a hands-on guide full of best practices that we deem especially relevant for management
researchers who want to enhance their methodological skill set.
2. USE OF DATA CRAWLING IN BUSINESS RESEARCH
We start by giving an overview about the use of data crawling in different areas of business
research and derive implications of how data crawling can be used in management research. To do
this, we identified papers published in the leading economics and business journals that build on
data that is crawled from the Internet. For our analysis, we conducted a full-text search within all
50 journals included in the Financial Times list that match with the search patterns crawl*’,
scrapi*’, or ‘scrape*’. We then manually checked all search results to exclude false
positives and identified a total of 183 papers published between 2000 and 2018
2
that clearly state
that the authors used data crawling to obtain data. While these 183 papers are only 0.21% of all
papers published in the period, Figure 1 highlights that number of published crawling papers has
seen a strong increase within the last couple of years.
------------ INSERT FIGURE 1 HERE ------------
Because we can only classify those articles that explicitly mention that data has been obtained
through crawling, we expect the true number of papers that used crawling methods to be
significantly higher. An example of a false negative would be a case where authors simply state
that they obtained data from a specific website, but do not give further details on the data collection
method.
2
Three papers were forthcoming in 2018 but were later published in print in 2019 and are therefore included.
5
Next, we report the fields in which crawling papers have been published in Table 1. In relative
terms, most crawling papers have been published in Operations & Information Systems (0.76% of
all published papers in this field), Innovation & Entrepreneurship (0.67%) as well as in Marketing
(0.25%). The field of Management is clearly lagging behind, with only 0.04% of all published
articles explicitly stating the use of crawling papers. Among the five identified crawling papers in
management journals, two have been published in the Strategic Management Journal. Compared
to that, there are 21 published articles in Information Systems Research, 12 published articles in
Research Policy, and 12 published articles in Marketing Science.
We will now discuss the five identified papers published in management journals to exemplify
how data crawling can be used in the field.
Ren et al. (2011) study the impact of local competition on a vendor’s decision of providing product
variety at the example of the competition between Best Buy and Circuit City. They use data
crawling to obtain store-specific product variety data for digital cameras from the websites of Best
Buy and Circuit City and find that competition in the same market generally increases product
variety except if the two competing stores are collocated. In a follow-up study, Ren et al. (2019)
study responses to rival exit at the example of how Best Buy reacted to the exit of Circuit City.
They crawl additional data from Best Buy’s website after the exit of Circuit City and find that Best
Buy reacted to the exit of a close-by Circuit City shop by offering more digital cameras.
Gehman and Grimes (2017) and Haans (2019) use text and picture data scraped from corporate
websites to study the strategic positioning of firms relative to their competitive environment. The
first paper investigates the characteristics of firms that seek formal membership in an
organizational category. The latter paper focuses on the performance implications with respect to
6
a trade-off between distinctiveness and legitimacy, which they show to depend on how
homogenous an organizational category is.
Calic and Mosakowski (2016) study innovative forms of getting access to capital with a specific
focus on social entrepreneurship. Using scraped data on crowdfunding campaigns from
Kickstarter, they compare the funding success of projects with and without sustainability goals,
and look into the role that legitimacy and creativity play in this relationship.
As we see from the examples above, data crawling can be used to obtain both product-level as well
as firm-level data to answer relevant questions for management and strategy scholars. In addition
to these examples that are situated in the fields of competitive strategy and innovation strategy,
one could also use individual-level crawled data to study questions of strategic human capital and
stakeholder strategy. Examples for individual-level data available on the Internet are salary
information of public-sector employees (Mas, 2017), contributor information from open source
software (Hahn, Moon, and Zhang, 2008) and user innovation communities (Wooten and Ulrich,
2017), or customer feedback (Chau and Xu, 2012).
------------ INSERT TABLE 1 HERE ------------
We additionally report all crawling papers published in journals included in the FT50 list in Table
A.1 in the appendix. We have furthermore categorized each paper according to one of these 14
different settings: advertisement, between-industry, crowdsourcing, crowdfunding, eCommerce,
online content, online recommendation systems, open source, search engine, social networks, user-
generated-content, within-industry, and other. We believe that many of these settings could
provide interesting data to examine important questions that are relevant for a management
audience. As examples, the setting of crowdfunding could be used to study knowledge and
7
innovation, the setting of open source to study strategic human capital, or social networks to study
behavioral strategy. The eCommerce and online recommendation systems settings lend themselves
to the study of platform strategy.
We furthermore investigated the importance of crawled data for the published papers and found
out that in 61.3% of the cases, the papers relied exclusively on crawled data, while additional data
sources were used in the other cases. We expect that the share of crawling papers relying on
multiple data sources will go up in the future as more of the first-order questions that can be
addressed with single data sets will have already been addressed and as combining different data
sources allows examining interesting and new research questions. These additional data sources
could range from proprietary firm data to public registries.
Finally, we checked if the authors mentioned the use of application programming interfaces (APIs)
to obtain the data, which has been the case in 10.2% of the published papers. As we discuss below,
APIs are increasingly provided by platform operators aiming to open up to third parties, but can
also be used by researchers as an easy way to acquire data. We would therefore also expect more
API usage over time.
3. THE CRAWLING PROCESS
In this section, we introduce a step-by-step guideline for setting up the crawling process. There are
usually many degrees of freedom for how to implement the crawling process. In order to give an
overview, we provide a guiding framework that shows the different choices that can be taken as
well as the respective advantages and disadvantages.
A number of programming languages offer possibilities for crawling data from the Internet and
within these languages, there are a lot of packages that offer additional crawling-related
8
functionalities. The most popular language for data crawling is probably Python. Python is a multi-
purpose programming language used in web development and data analytics and offers powerful
packages for facilitating the crawling process such as requests, BeautifulSoup, or Scrapy. But there
are also a lot of other tools such as R, Java, or Perl that serve well for data crawling. The choice
of programming language should take into account own prior experiences with programming
languages, so it is difficult to give a clear recommendation. Furthermore, a lot of new
functionalities are added to programming languages all the time, so one should always do an up-
to-date comparison before selecting a language. Further, firms offer ready-to-use tools, e.g.
import.io, Parsehub, etc, which depending on the complexity of the crawling task might provide
an efficient solution.
It is of course also possible to outsource data crawling tasks to freelancers, which are easy to locate
on micro-labor platforms such as Freelancer.com or Upwork . But even in this case, it is important
to build a general understanding of how data crawling works in order to be able to specify and
supervise the task and assess the quality of the obtained data.
3.1. Boundaries for data crawling
The first and most important step of each crawling project is to determine if it actually makes sense
to use crawling to obtain the desired data. We discuss the most important boundaries that would
rule out using data crawling. Considering these boundaries at the beginning of the project is
important as this might save a lot of resources.
A first consideration before the start of a project would be to assess if data crawling is really the
most efficient way to obtain the desired data or if there would also be alternative sources. Some
websites make parts or all of their data available for direct download. An example would be the
9
Internet Movie Database, which provides much of their data on movies as direct downloads
3
.
Because of that one should always search if direct data access is provided by a website. If this is
not the case, one could also contact the website operators directly and ask them, if they would be
willing to provide direct access to their data. Furthermore, many of the most prominent websites
are crawled by a lot of parties and some of them make their crawling results available to others.
This could either be non-commercial initiatives (such as InsideAirbnb
4
that provides the crawled
data from Airbnb free of charge) or firms who are selling the data they crawled (such as AirDNA
5
,
which also regularly crawls Airbnb).
Second, one should consider the number of observations that need to be collected. If the number
of observations is relatively low, perhaps it is more efficient to collect the data manually instead
of writing a crawler to extract it automatically. The cut off value of when it is more efficient to
write a crawler depends of course on individual circumstances such as programming skills or
access to research assistants.
Third, data crawling is more attractive, the more structured a website and the data embedded in
the website is. If the data of interest is embedded in text, it might become really difficult to extract
the desired pieces of information. However, if a website provides a tabloid structure to present its
data, the same piece of information will always be presented at the same position of the website
and extraction of information becomes much easier. Very often, there are multiple websites that
collect data on the same industry and that vary in their degree of how structured they store their
data. For example, Wikipedia contains information on a lot of video games but this data is for the
3
Available at https://www.imdb.com/interfaces/, accessed December 12, 2018.
4
Available at http://insideairbnb.com/, accessed December 12, 2018.
5
Available at https://www.airdna.co/, accessed December 12, 2018.
10
most part not very standardized. In contrast, the platform MobyGames
6
provides a lot of
information on video games in a highly structured way. So one should invest some time in
searching for an alternative website that would provide the same kind of data in the most structured
way.
Fourth, many website owners do not want third parties to automatically extract their content even
if they make it freely available to the human reader. They then try to block attempts to crawl their
data in an automated way. Even if there are possibilities to circumvent some of these measures
(e.g. switching IP addresses, hiding behind proxy servers, using the Tor network), this still makes
the crawling process more complex and uncertain. A problem is that most of the time it is not
possible to determine ex-ante if there are any anti-crawling measures employed and how efficient
they are. One way to assess the probability of anti-crawling measures being in place is to consider
the economic incentives of the website owner for keeping third-parties from extracting their data
in an automated way. For example, if copying the data would allow a competitor to compete more
closely or if they also want to sell their data to interested parties, the odds that the website employs
anti-crawling measures would be high. In contrast, it is unlikely that a non-commercial project
would employ these measures.
The fifth key factor for determining if a website should be considered for crawling is if the website
requires a login to access the data of interest. Requiring a login perhaps has legal as well as
technical implications. It is possible that from a legal perspective, users have to create an account
before they can log in and nearly all websites require users to proactively agree to their terms and
conditions before the account can be created. As the terms and conditions of many websites
6
Available at https://www.mobygames.com/, accessed December 12, 2018.
11
explicitly prohibit data crawling, still doing so would result in a breach of contract. In addition to
these legal implications, writing a crawler that first logs in to a website before downloading the
content can be slightly more complex from a technical perspective.
7
3.2. Identification of observations
To be able to arrive at the desired data, it is usually necessary to download a large number of
subpages from a website. These subpages could entail anything from user profiles over product
reviews to firm-specific financial information. A key challenge in many crawling projects is to
determine the addresses (URLs) of the subpages from which content should be extracted. We will
now discuss several alternative strategies for identifying the focal observations. We rank these
options by increasing effort, i.e. one should always try to work with the first options and only
resort to the more complex alternatives if the first one is not successful.
First, one should check if the website offers data access through an application programming
interface (API). Many websites offer these interfaces to their data if they are operating a platform
business model in which they rely on third parties joining their platform. It is often possible to
obtain much of the data of these platform markets through the API and the API usually also
includes functions to search for or list the data sets included in the platform.
Second, the subpages of a websites are sometimes numbered by an increasing identifier. If the
URL of the website has a form such as http://domain.com/objects/x and x is a number
that counts up by one for each subpage, it is sufficient to loop through x until no more results are
returned.
7
Many less sophisticated websites use simple authentication methods such as Htaccess with which logging in is as
easy as adding a specific header to the HTTP request. With respect to other authentication methods, an emerging
standard seems to be OAuth, for which packages exist in all major programming languages.
12
Third, many (but not all) websites offer a machine readable directory of all subpages in a sitemap
file. These sitemap files are standardized and are provided in the XML format. As each XML file
is only allowed to contain the URLs of up to 50,000 subpages, there is often a two-tier structure
of sitemaps. The first tier just consists of a directory of all further sitemaps, while the second tier
contains the actual URLs of the subpages. An advantage of this two-tier structure is that the
sitemaps are usually sorted by the type of subpages and one can then directly obtain the desired
type of objects. Sitemaps are often located at http://domain.com/sitemap.xml, but this
location is not standardized. The location of the sitemap is often reported in the robots.txt file that
is located at http://domain.com/robots.txt and that provides directions for crawlers.
Finally, one can also use an Internet search engine and search for sitemaps within the domain of
the website.
Fourth, many websites offer their users directories that allow to list and iterate through all of their
subpages. One can then iterate through each of these directory pages and extract the URLs of the
subpages. The URL of the directory pages usually has a structure like this:
http://domain.com/directory?page=x&results=y. Iterating through x allows then
to call the subpages of the directory. Also note that many websites allow their users to change the
number of results they can observe on each directory page (such as choosing between 10, 20, and
50 results). It is sometimes possible to manually tune this parameter y to much larger values, which
sometimes allows returning all the subpages in one go.
Fifth, if none of the above works, one could use Internet search engines to obtain a website’s
subpages. This can be achieved by a search term such as
site:http://domain.com/profile/. Note however, that search engines usually have
strong protection against crawling and will relatively soon block requests from a crawler. So it
13
might be possible to extract around three to four digit number of subpages from a search engine,
but not millions.
Sixth, an iterative approach might allow to return a large number of subpages if different subpages
are linked to each other. This might for example be the case for a social network, where each
profile page also contains information about the focal user’s network of connections. With this
iterative approach, one might first get an initial set of seed subpages by using the Internet search
approach mentioned above and then determine all the first-order links from there. Then, all the
links from all new subpages will be retrieved. This process will be repeated until no new subpages
are added. The drawback of this iterative process is that subpages which are not linked to any of
the initial seed sites will not be identified. Therefore, the iterative approach might work best if the
underlying network structure between the subpages is relatively dense.
3.3. Obtaining the content
Having identified all relevant subpages, downloading these subpages from the webserver and
extracting the desired content is the next step in the crawling process. The actual process of
downloading the contents behind a URL is surprisingly simple and can with most programming
languages be achieved with a single line of code. Using the requests module of Python, one can
for example download the contents of a website by calling
html_content=requests.get(‘http://domain.com/subpage.htm’. This type
of command can then be integrated into a loop that iterates through all subpages.
Extracting the desired pieces of information from the subpage is usually a much bigger challenge
compared to downloading the content in the first place. This process is called parsing and the
challenge lies in uniquely identifying the location within the website that contains the desired piece
14
of information. As we have already discussed above, using crawling for automated data retrieval
from the Internet requires that the desired pieces of information are structured in the same way
within each subpage. As the focal piece of information will vary from subpage to subpage, one
can still identify the desired content through the location within each website. Websites are written
in the HyperText Markup Language (HTML), which is a language that tells web browsers how
content should be displayed. A key element of the HTML language are tags, which are enclosed
by angle brackets. There are opening and closing brackets for many tags. Everything between the
opening tag <I> and the closing tag </I> will for example be displayed in italic, but some tags
come also standalone such as the <BR> tag that indicates a line break. If one is not yet familiar
with HTML, we suggest to read through many of the freely available online tutorials
8
.
Parsing is the process of extracting the desired pieces of information from the downloaded contents
of a website. There are two main methods of parsing a website: the direct way through regular
expressions or alternatively through higher-level parsing frameworks such as BeautifulSoup for
Python. Regular expressions are available in most programming languages and in many text
editors and are a powerful tool for defining search patterns.
9
Let us assume that we want to extract
the revenue information from a website where the revenue is embedded as follows: <font
size="4">Revenue: <b>$21,587,519</b></font>’. The regular expression <font
size="4">Revenue: <b>(.*?)</b></font> would then return the string
$21,587,519’. This search pattern tells the regular expression algorithm to search for all
occurrences starting with <font size="4">Revenue: <b> and ending with
8
One easily HTML tutorial is available here: https://www.w3schools.com/html, accessed December 12, 2018.
9
A good introduction for regular expressions is available here: https://www.regular-expressions.info/quickstart,
accessed December 12, 2018.
15
</b></font>’. The expression ‘(.*?)consists of the parentheses ()’ that specify where
the searched item is located, the dot ‘.’, specifying to search for arbitrary characters, the asterisk
*’, specifying that the operator before can be repeated zero or more times, and the question mark
?’, specifying to search in a non-greedy way, i.e. stop searching at the first and not at the last
occurence of ‘</b></font>’ within the HTML code. When specifying the regular expression,
it is important to make sure that the regular expression uniquely identifies the search term. If we
would have used the regular expression ‘<b>(.*?)</b>’, we would for example have obtained
all occurrences of bold text on the page. Sometimes it is also necessary to nest multiple calls of
regular expressions: for example, in a first step one might want to extract all rows from a table
with <TR>(.*?)</TR> and then iterate over all columns of the table with
<TD>(.*?)</TD>’. Alternatively, one can also parse websites with higher-level frameworks
such as BeautifulSoup for Python or rvest for R. In these frameworks, the hierarchy created by the
opening and closing tags throughout each website is used to create a tree-like structure. One can
then identify each element within a website by specifying the location within the tree.
While we have so far assumed that the contents downloaded from a URL are HTML code, we also
sometimes obtain other types of information. If we use an API to access the information from a
website, the API will usually return the data directly as JSON or XML. These formats are machine
readable data formats that can be easily imported with most modern programming languages.
Many websites are even offering packages for different programming languages that are taking
care of these steps. The techniques we discuss are in principle able to systematically access and
store information that is embedded in any type of content. Packages for Python and R allow to
easily access and process text in PDF documents as well as enable access to Google’s and
16
Facebook’s machine learning APIs TensorFlow and Torch that can be used to translate text or
recognize objects in images and videos.
For many research questions, it is sufficient to crawl a website once and use the time structure
embedded in the website to construct a panel. One could for example use the time stamps of
product reviews as proxies for sales. In other cases, the website might not store all historical
changes that are important for the desired analysis and this would then require repeated crawls of
the website. If prices of products in an online shop would for example be important and only
current prices are shown, one would have to regularly crawl the website. In order to re-run a
crawler in specified time intervals without requiring manual restarts of the crawling process, it is
possible to use the scheduling functionalities built into the operating system (Task Scheduler for
Windows, Cron for MacOS and Linux). In cases of repeated crawls, it might also be useful to
separate downloading of the HTML code and parsing it since changes in the structure of the HTML
website might result in parse errors that can render the whole crawling effort worthless if
discovered too late.
Once the desired data is extracted from the downloaded data, the last step is saving it so that it can
be further processed. Even though many programming languages are now offering rich
opportunities for statistical analysis such as the Pandas library for Python, most management
scholars will have a preference for using statistical software such as STATA or SPSS for
continuing their work with the obtained data. The default way to transfer data between languages
is to save the data in text format using packages that are capable of writing CSV files and then
importing these text files into statistical software. It is of course also possible to save the obtained
data to a database such as MySQL, which might be especially useful in cases where crawling has
to be conducted repeatedly.
17
4. SOLUTIONS TO CHALLENGES OF CRAWLING PROJECTS
4.1. Character encoding
The internet allows researchers to access international data, which sometimes means that
information is only available in local languages. This means that text data can include a huge
variety of non-Latin characters (ASCII), such as language-specific alphabets, currency symbols,
or emoticons. The Unicode standard ensures correct representation and handling of these
characters across devices and software systems. Characters are encoded according to standardized
rules, much like in pre-Computer systems like the Morse code. For example, the Unicode
translation of the ampersand “&” is “U+0026”. Unicode itself can support up to 2^31 characters
(code points), but a variety of coding scheme implementations exist that trade-off storage needs
with the number of characters they cover. Older coding schemes such as Latin-1 (ISO-8859-1)
only covers the first 256 code points of Unicode, UTF-8 can support 2^25 codepoints. How does
this relate to data crawling? When reading text for a web source or local file, it is important to
know in which scheme it is encoded to avoid misrepresentations when processing the information.
Here it helps of course that more than 90% of the world wide web is encoded in UTF-8.
10
When
saving the information to a local file or database system, it is important to select the according
encoding scheme, or convert the encoding scheme when needed. Sometimes it will not matter to
the researcher whether some special characters are adequately represented because the underlying
information can still be sufficiently disambiguated. Often, however, encoding matters. First, it is
crucial to have a common encoding scheme when two datasets are matched on string variables.
Failing to have a “common denominator” can cause otherwise identical strings to look different
10
"Usage of character encodings broken down by ranking - W3Techs."
https://w3techs.com/technologies/cross/character_encoding/ranking, accessed December 12, 2018.
18
and 1:1 matching to fail. Second, it is important to recognize that encoding issues can occur at
various steps of the data collection, preparation, and analysis workflow because not all
programming languages and software packages automatically support unicode (i.e. allow to
specify the encoding scheme for input and output files). STATA introduced unicode capabilities
in version 14 in 2015, users of R and Python need to import specific packages.
4.2. A machine learning approach to joining data from multiple sources
Researchers in management are very often confronted with the problem of matching datasets,
perhaps complementing proprietary data with data crawled from the web. For a project on
competition in online retail, for example, we would need to combine price data from various
outlets. In many cases, these datasets would come without common identifiers. For example, the
same product may appear as “iRobot Roomba 675 Robot Vacuum with Wi-Fi Connectivity, Works
with Alexa, Good for Pet Hair, Carpets, Hard Floors” one the website of retailer A and as “iRobot
- Roomba 675 App-Controlled Self-Charging Robot Vacuum - Black” on the website of retailer
B. In some cases, careful pre-processing and data cleaning, involving string manipulation that
removes non-discriminatory words (e.g. regular expressions that replace commonly used words)
and disambiguation algorithms (e.g. phonetic encoding) can help to define a common identifier.
In many practical cases, abbreviations or typos (especially when the underlying database is user-
generated) additionally complicate the process as they are much less easy to consistently spot and
correct, especially in large datasets. Most limiting, however, is that matching in most database
systems and statistical packages is binary and deterministic, i.e. data points either match or not.
An approach to solve these issues is an easy-to-implement machine learning application, often
referred to as fuzzy matching. We want to train a model to predict the continuous likelihood that
two data points match, and then define a threshold value of the estimated likelihood at which we
19
are willing to follow the model’s recommendation and accept a pair of observations as a match.
To implement this, we would first draw random subsamples 𝑁of dataset A and dataset B, and
invest manual effort to compare each observation in A to each observation in B and identify
matching observations. Hold on before you hire an army of RA’s though.
11
Because most
observations will have a very low likelihood of being a match ex-ante, there is often no need to go
through all 𝑁
𝐴× 𝑁𝐵 observations. We can use indexing to reduce the complexity by defining
consideration sets, i.e. not compare observations that have a very low likelihood of overlap ex-
ante. For example, defining a price range of ±500% can limit the number of products to be
compared to 𝑁
𝐴
̃× 𝑁𝐵
̃if we are willing to assume that the price of the same product will not differ
by more than that range. We then generate features that reflect the similarity of text identifiers in
both datasets, such as the number of common characters, phonetic similarity, and perhaps the
overlap in other dimensions of the data (on our product example these may be color, dimensions,
price range, etc.). Now we split the dataset in two parts: a training dataset and a test dataset. With
these features we can estimate the parameters of a logistic regression where we explain matching
observations in the training data.
12
When testing the model’s predictions on a dataset with known
matches, we can optimize the model and decide on a threshold value that minimizes the occurrence
of Type I and Type II errors. Finally, we can use the trained model to predict matching probabilities
for all (optionally indexed) observations in the full sample, and finally define a combined dataset
based on the optimal likelihood threshold value we have discovered.
11
Sometimes it is more efficient to use microwork platforms like Amazon MTurk or Upwork.
12
We choose a logistic regression model, because this can be easily implemented in software packages like STATA
or SPSS with which some readers may be most familiar with. Of course, any other more or less sophisticated
supervised machine learning method, e.g. Naive Bayes, decision trees, random forests, vector support machines, can
be used here. These methods are readily available packages and straightforward to implement in software like R and
Python.
20
4.3. Dynamic websites
So far, we have only considered the case of static websites, i.e. where the entire content is loaded
immediately, or contents do not vary systematically for different users. Most modern websites,
however, are dynamic. That is, individual pages are built on demand, often using a combination
of server-side (e.g. retrieving content from a database) and client-side technologies. For the latter,
the local browser runs a script, e.g. JavaScript or DHTML, that ultimately assembles a
personalized version of the webpage. Some context will then be dynamically loaded when the user
takes action, e.g. scrolling down in the Facebook feed. The simple web crawler that we introduced
in section 3, is an extremely simple web browser that can only display text and does not have any
additional capabilities to execute client-side scripts. As a result, this tool is often not powerful
enough to extract all desired data from a dynamic website.
However, before giving up too quickly, there are two practical workarounds that one can always
try. First, we can study the raw input that is fed into the crawler. In some cases, the content to be
dynamically displayed is already part of HTML source - the input that even our simplistic crawler
can process. Although this content is not directly visible when navigating to the website in a
standard web browser, it is still there and can be captured by our simple crawler. Second, we can
use tools like the network traffic analysis in Chrome’s developer tools (similar extensions are
available for other browsers), to study the external content that is dynamically loaded. This will
sometimes lead us directly to the source, i.e. a URL that returns XML or JSON formatted data.
In many other cases, we will need to upgrade our crawling technology and add scripting
capabilities to our web crawler. An easy method to do so is to “remote control” a fully-featured
web browser like Chrome or Firefox using the Selenium package in Python or R. This package
allows to move the heavy lifting to the browser, which runs in the background, and our crawler
21
can access the resulting content. While this obviously opens up a large number of possibilities to
access data that would otherwise be very difficult to obtain in a structured way, it comes at the cost
of dramatically increased memory usage and run times. We will introduce some ideas to reduce
run time in the next section.
4.4. Speeding up the crawling process
Crawling one page at a time can be too time consuming for a variety of reasons. First and foremost,
the number of pages/observations is too large to capture all desired data in acceptable time, or
before the information is changed/updated or disappears. Given enough hardware resources, the
solution is to run multiple instances of the crawler in parallel.
The simplest method is to manually distribute jobs across multiple instances of the web crawler.
For example, instead of one instance handling 100 URLs sequentially, two instances would handle
batches of 50 each. With simple tests on smaller subsets of inputs, we can make an informed
decision about a reasonable number of instances and the size of individual batches. This do-it-
yourself version of parallel computing, however, can only lead to significant performance
increases if the memory and CPU usage of each instance is small enough. For larger projects, it
makes much more sense to use more sophisticated solutions to run multiple instances of the crawler
in parallel. High-level packages in Python and R make it easy to optimize parallelization by
utilizing multiple CPU cores at the same time. In practice this means that for many applications,
the researcher does not need fancy dedicated server hardware to substantially improve run time:
most consumer-level CPUs that were manufactured in the last decade have multiple CPU cores.
Network speed, network congestions, and server-side limitations are often the bottleneck. Many
websites block access for an IP address if we send too many page requests too frequently. While
it is certainly possible to periodically switch IP addresses by “hiding” behind a proxy or VPN
22
server, it is good practice to limit the number of requests (only crawl what is really necessary for
our research project) or pause for a certain amount of time between requests.
4.5. Crawling panel data
In many research applications, it is interesting to study how variables or cross-sections evolve over
time. In section 3.3, we have already introduced tools that allow systematic repetition of the
crawling process, i.e. frequent crawls of a specific URL that allow to create panel datasets over
time. However, this of course only works to concerning forward-looking data. Often we need
historical data from the past, i.e. points in time before we started to collect our data. Services like
the Internet Archive’s Wayback Machine and CommonCrawl provide archived copies of websites,
sometimes going back multiple years with a relatively high frequency of snapshots. Here, we can
either scrape individual URLs directly from those services or download monthly snapshots of the
data in bulk.
In addition to the content itself, metadata, e.g. on the frequency of content updates, can be
interesting as well. A specific example could be a study of the frequency of price changes on an
eCommerce website. Further, metadata also allows to observe a website’s relationship with third-
parties. When scraping a large number of websites, such metadata can be helpful to study market
shares of tracking and advertising services, such as Google Analytics or Facebook’s Like- and
Share-buttons. Also for this kind of data, there exist databases that contain historical information
on third-party requests, e.g. the HTTPArchive or the Princeton Web Census.
Finally, some research questions might be related to understanding personalized firm behavior,
e.g. targeted advertising, first degree price discrimination, or personalized recommendation. So
far, the crawler we introduced in section 3 was stateless, i.e. we didn’t transfer any specific data to
23
the website. In a stateful crawling approach, we can for example accept cookies that allow the
website to identify us in subsequent scrapes and match information that it potentially obtained
from third-party sources. For example, mimicking the browsing behavior of a human user, we can
leave traces with tracking and advertising services and therefore study, almost like in a lab setting,
how user-specific information “travels” across the web and which type of information websites
and third-party services share. Once we mimic different types of robot-users, it becomes feasible
to understand the contingencies, i.e. geographical origin, types of visited websites, etc., of the way
personal data is tracked and shared (REF to CS Lit). This allows a very different approach of doing
management and strategy research. Instead of using natural experiments to study consumer and
firm reactions to changes in the external environment, we can run controlled field experiments
where we randomly change the type of customer the firm is serving and therefore understand the
contingencies of firm behavior and strategy.
5. CONCLUSION
We have demonstrated that the field of management research is still lagging behind other fields
when it comes to data crawling, but believe that crawled data has a lot of untapped potential to
inform management research. We therefore developed clear guidelines that management scholars
can consider if they want to use crawled data for their own projects and that make it also easier to
judge in how far data crawling has been appropriately used in other research projects.
The question when data crawling is useful depends on weighing several advantages and
disadvantages associated with data crawling. On the plus side, crawled data can give researchers
more exclusivity than using databases used by a lot of other researchers, but this advantage can
also be limited if a website experiences increasing interest from other researchers. Furthermore,
crawling can be much faster than manual data collection once you obtain the necessary skills.
24
Another key advantage of using data crawling is that it can make researchers less dependent on
publication biases that are difficult to avoid when obtaining data through contractual arrangements
with firms. If it is instead possible to base a research project on crawls of publicly available data,
researchers do not have to implicitly or explicitly consider third-party interests before publishing
their results. Finally, many crawled data bases have an impressively high number of observations,
which makes it easier to cleanly identify effects that would be estimated much less precise in
smaller data sets. The challenge of working with these large datasets is then of course to not just
be satisfied with statistical significance, but instead focusing much more on economic effect sizes.
When it comes to the disadvantages of data crawling, a first point is that successful data crawling
comes with certain learning costs. However, as we have shown in sections 3 and 4, data crawling
does not necessarily require advanced programming skills and we hope that our guidelines help
preventing many avoidable problems, therefore making data crawling more attractive for a wider
set of management scholars. Second, data crawling can be prone to errors. Errors can both emerge
from low data quality of the crawled website as well as through wrong programming of crawlers.
The problem of low underlying data quality has always to be considered and it is helpful to
regularly conduct sanity checks of the underlying data. This can for example be done by
triangulating the data with other data sources. Data problems stemming from the crawling process
itself can best be avoided by testing how multiple examples from the data on the crawled website
are affected by each data transformation step. These tests should be conducted side-by-side with
coding the data crawler, as each step can then be immediately checked for errors, making it way
easier to identify problems as compared to debugging multiple transformation steps at once.
Finally, data that is made freely available on the Internet is often not very well documented and
we might therefore struggle to obtain proper data knowledge. Due to this, it is important to not
25
immediately jump to crawling the data, but instead to first develop a good understanding about the
data that is made available on the target website, preferably both by interacting with the website
and by interacting with the website’s community.
As each research method, data crawling should also be evaluated from an ethical and legal
perspective. While website operators make their data freely available on the Internet, many
operators still try to ban third parties from automatically downloading their contents. With ever-
decreasing costs of Internet traffic, it is becoming more and more unlikely that the desire to prevent
data crawling is driven by the consideration that data crawlers created costs but do not contribute
to the monetization of the website. Instead, the motive of trying to prevent data crawling is likely
that the website operator is fine with individual humans accessing small portions of their overall
data, but does not want third parties to get permanent access to their data. One standardized way
how website operators express their preferences when it comes to crawling are robots.txt
files that are located on most websites. Furthermore, the terms and conditions of a website are
often also mentioning the website operator’s preferences with regard to data crawling. It is
however unclear in how far these statements do have any legal implications and given the
complexity of this subject matter, this goes clearly beyond the scope of this paper. For researchers,
it would be beneficial to have a clearer understanding when data crawling is useful. While it is not
clear, if initiatives like the European Union’s General Data Protection Regulation create more
clarity for researchers, we expect that journals and professional organizations will give more
guidance when it comes to specifying the rules for using crawled data. The journal Management
Science for example imposed a rule that website operators can demand the retraction of published
papers if the data crawling was explicitly banned and if they can demonstrate that the data
collection (not the results of the analysis) inflicted significant material harm (Simchi-Levi, 2019).
26
References
Aaltonen A, Seiler S. 2016. Cumulative Growth in User-Generated Content Production:
Evidence from Wikipedia. Management Science 62(7): 20542069.
Abrahams AS, Fan W, Wang GA, Zhang Z (John), Jiao J. 2015. An Integrated Text Analytic
Framework for Product Defect Discovery. Production & Operations Management 24(6):
975990.
Agarwal A, Hosanagar K, Smith MD. 2015. Do Organic Results Help or Hurt Sponsored Search
Performance? Information Systems Research 26(4): 695713.
Aguiar L, Claussen J, Peukert C. 2018. Catch Me If You Can: Effectiveness and Consequences
of Online Copyright Enforcement. Information Systems Research 29(3): 656678.
Arora S, ter Hofstede F, Mahajan V. 2017. The Implications of Offering Free Versions for the
Performance of Paid Mobile Apps. Journal of Marketing 81(6): 6278.
Augenblick N. 2016. The Sunk-Cost Fallacy in Penny Auctions. Review of Economic Studies
83(1): 5886.
Bandyopadhyay S, Bandyopadhyay S. 2009. Estimating Time Required to Reach Bid Levels in
Online Auctions. Journal of Management Information Systems 26(3): 275301.
Bao Y, Datta A. 2014. Simultaneously Discovering and Quantifying Risk Types from Textual
Risk Disclosures. Management Science 60(6): 13711391.
Bapna R, Ramaprasad J, Umyarov A. 2018. Monetizing Freemium Communities: Does Paying
for Premium Increase Social Engagement? MIS Quarterly 42(3): 719735.
Bapna R, Umyarov A. 2015. Do Your Online Friends Make You Pay? A Randomized Field
Experiment on Peer Influence in Online Social Networks. Management Science 61(8):
19021920.
Bauer J, Franke N, Tuertscher P. 2016. Intellectual Property Norms in Online Communities:
How User-Organized Intellectual Property Regulation Supports Innovation. Information
Systems Research 27(4): 724750.
Bellezza S, Paharia N, Keinan A. 2017. Conspicuous Consumption of Time: When Busyness and
Lack of Leisure Time Become a Status Symbol. Journal of Consumer Research 44(1):
118138.
Ben-David I, Palvia A, Spatt C. 2017. Banks’ Internal Capital Markets and Deposit Rates.
Journal of Financial & Quantitative Analysis 52(5): 17971826.
Berger J, Milkman KL. 2012. What Makes Online Content Viral?? Journal of Marketing
Research (JMR) 49(2): 192205.
Bertsimas D, Brynjolfsson E, Reichman S, Silberholz J. 2014. Tenure Analytics: Models for
Predicting Research Impact : 28.
Bianchi AJ, Kang SM, Stewart D. 2012. The Organizational Selection of Status Characteristics:
Status Evaluations in an Open Source Community. Organization Science 23(2): 341354.
Bockstedt J, Druehl C, Mishra A. 2015. Problem-solving effort and success in innovation
contests: The role of national wealth and national culture. Journal of Operations
Management 36: 187200.
Bodea T, Ferguson M, Garrow L. 2009. Choice-Based Revenue Management: Data from a
Major Hotel Chain. Manufacturing & Service Operations Management 11(2): 356361.
Bronnenberg BJ, Kim JB, Mela CF. 2016. Zooming In on Choice: How Do Consumers Search
for Cameras Online? Marketing Science 35(5): 693712.
Brynjolfsson E, Hu Y (Jeffrey), Rahman MS. 2009. Battle of the Retail Channels: How Product
27
Selection and Geography Drive Cross-Channel Competition. Management Science
55(11): 17551765.
Bucklin RE, Sismeiro C. 2003. A Model of Web Site Browsing Behavior Estimated on
Clickstream Data. Journal of Marketing Research (JMR) 40(3): 249267.
Busse JA, Goyal A, Wahal S. 2014. Investing in a Global World*. Review of Finance 18(2):
561590.
Cachon GP, Gallino S, Olivares M. 2018. Does Adding Inventory Increase Sales? Evidence of a
Scarcity Effect in U.S. Automobile Dealerships. Management Science. Available at:
http://pubsonline.informs.org/doi/10.1287/mnsc.2017.3014.
Cai Y, Sevilir M. 2012. Board connections and M&A transactions. Journal of Financial
Economics 103(2): 327349.
Calic G, Mosakowski E. 2016. Kicking off social entrepreneurship: How a sustainability
orientation influences crowdfunding success. Journal of Management Studies 53(5): 738
767.
Campello M, Lin C, Ma Y, Zou H. 2011. The real and financial implications of corporate
hedging. The journal of finance 66(5): 16151647.
Cao Z, Hui K-L, Xu H. 2018. When Discounts Hurt Sales: The Case of Daily-Deal Markets.
Information Systems Research 29(3): 567591.
Carmi E, Oestreicher-Singer G, Stettner U, Sundararajan A. 2017. Is Oprah Contagious? The
Depth of Diffusion of Demand Shocks in a Product Network. MIS Quarterly 41(1): 207-
A13.
Cavallo A, Neiman B, Rigobon R. 2014. Currency Unions, Product Introductions, and the Real
Exchange Rate*. Quarterly Journal of Economics 129(2): 529595.
Chau M, Xu J. 2012. Business Intelligence in Blogs: Understanding Consumer Interactions and
Communities. MIS Quarterly 36(4): 11891216.
Chen MK. 2013. The Effect of Language on Economic Behavior: Evidence from Savings Rates,
Health Behaviors, and Retirement Assets. American Economic Review 103(2): 690731.
Chen P-Y, Hong Y, Liu Y. 2018. The Value of Multidimensional Rating Systems: Evidence
from a Natural Experiment and Randomized Experiments. Management Science 64(10):
46294647.
Claussen J, Essling C, Kretschmer T. 2015. When less can be moreSetting technology levels in
complementary goods markets. Research Policy 44(2): 328339.
Claussen J, Kretschmer T, Mayrhofer P. 2013. The Effects of Rewarding User Engagement: The
Case of Facebook Apps. Information Systems Research 24(1): 186200.
Corredoira RA, Goldfarb BD, Shi Y. 2018. Federal funding and the rate and direction of
inventive activity. Research Policy 47(9): 17771800.
Das SR, Chen MY. 2007. Yahoo! for Amazon: Sentiment Extraction from Small Talk on the
Web. Management Science 53(9): 13751388.
Datta H, Knox G, Bronnenberg BJ. 2018. Changing Their Tune: How Consumers’ Adoption of
Online Streaming Affects Music Consumption and Discovery. Marketing Science 37(1):
521.
De Bakker FGA, Hellsten I. 2013. Capturing Online Presence: Hyperlinks and Semantic
Networks in Activist Group Websites on Corporate Social Responsibility. Journal of
Business Ethics 118(4): 807823.
De Langhe B, Fernbach PM, Lichtenstein DR. 2016. Navigating by the Stars: Investigating the
Actual and Perceived Validity of Online User Ratings. Journal of Consumer Research
28
42(6): 817833.
Dellavigna S, Hermle J. 2017. Does Conflict of Interest Lead to Biased Coverage? Evidence
from Movie Reviews. Review of Economic Studies 84(4): 15101550.
Dhar V, Geva T, Oestreicher-Singer G, Sundararajan A. 2014. Prediction in Economic
Networks. Information Systems Research 25(2): 264284.
Dissanayake I, Zhang J, Gu B. 2015. Task Division for Team Success in Crowdsourcing
Contests: Resource Allocation and Alignment Effects. Journal of Management
Information Systems 32(2): 839.
Engelberg J, Gao P. 2011. In search of attention. The Journal of Finance 66(5): 14611499.
Feldman M, Lowe N. 2015. Triangulating regional economies: Realizing the promise of digital
data. Research Policy 44(9): 17851793.
Fischer E, Reuber AR. 2014. Online entrepreneurial communication: Mitigating uncertainty and
increasing differentiation via Twitter. Journal of Business Venturing 29(4): 565583.
Fisher M, Gallino S, Li J. 2018. Competition-Based Dynamic Pricing in Online Retailing: A
Methodology Validated with Field Experiments. Management Science 64(6): 24962514.
Galak J, Small D, Stephen AT. 2011. Microfinance Decision Making: A Field Study of Prosocial
Lending. Journal of Marketing Research (JMR) 48: S130S137.
Gao M, Huang J. 2016. Capitalizing on Capitol Hill: Informed trading by hedge fund managers.
Journal of Financial Economics 121(3): 521545.
Ge R, Feng J, Gu B, Zhang P. 2017. Predicting and Deterring Default with Social Media
Information in Peer-to-Peer Lending. Journal of Management Information Systems 34(2):
401424.
Gehman J, Grimes M. 2017. Hidden Badge of Honor: How Contextual Distinctiveness Affects
Category Promotion Among Certified B Corporations. Academy of Management Journal
60(6): 22942320.
Geuna A et al. 2015. SiSOB data extraction and codification: A tool to analyze scientific careers.
Research Policy, The New Data Frontier 44(9): 16451658.
Giglio S, Maggiori M, Stroebel J. 2016. No‐bubble condition: Model‐free tests in housing
markets. Econometrica 84(3): 10471091.
Giglio S, Shue K. 2014. No News Is News: Do Markets Underreact to Nothing? Review of
Financial Studies 27(12): 33893440.
Goldenberg J, Han S, Lehmann DR, Hong JW. 2009. The Role of Hubs in the Adoption Process.
Journal of Marketing 73(2): 113.
Goldenberg J, Oestreicher-Singer G, Reichman S. 2012. The Quest for Content: How User-
Generated Links Can Facilitate Online Exploration. Journal of Marketing Research
(JMR) 49(4): 452468.
Green TC, Jame R. 2013. Company name fluency, investor recognition, and firm value. Journal
of Financial Economics 109(3): 813834.
Greenwood BN, Gopal A. 2015. Research NoteTigerblood: Newspapers, Blogs, and the
Founding of Information Technology Firms. Information Systems Research 26(4): 812
828.
Greenwood BN, Gopal A. 2017. Ending the Mending Wall: Herding, Media Coverage, and
Colocation in It Entrepreneurship. MIS Quarterly 41(3): 989-A14.
Gu B, Ye Q. 2014. First Step in Social Media: Measuring the Influence of Online Management
Responses on Customer Satisfaction. Production & Operations Management 23(4): 570
582.
29
Guo H, Cheng HK, Kelley K. 2016. Impact of Network Structure on Malware Propagation: A
Growth Curve Perspective. Journal of Management Information Systems 33(1): 296325.
Guo S, Guo X, Fang Y, Vogel D. 2017. How Doctors Gain Social and Economic Returns in
Online Health-Care Communities: A Professional Capital Perspective. Journal of
Management Information Systems 34(2): 487519.
Haans RF. 2019. What’s the value of being different when everyone is? The effects of
distinctiveness on performance in homogeneous versus heterogeneous categories.
Strategic Management Journal 40(1): 327.
Hahn J, Moon JY, Zhang C. 2008. Emergence of New Project Teams from Open Source
Software Developer Networks: Impact of Prior Collaboration Ties. Information Systems
Research 19(3): 369391.
Hanley KW, Hoberg G. 2010. The Information Content of IPO Prospectuses. Review of
Financial Studies 23(7): 28212864.
Hanley KW, Hoberg G. 2012. Litigation risk, strategic disclosure and the underpricing of initial
public offerings. Journal of Financial Economics 103(2): 235254.
He L. 2016. Service Region Design for Urban Electric Vehicle Sharing Systems : 49.
Heim GR, Field JM. 2007. Process drivers of e-service quality: Analysis of data from an online
rating site. Journal of Operations Management 25(5): 962984.
Heimbach I, Hinz O. 2018. The Impact of Sharing Mechanism Design on Content Sharing in
Online Social Networks. Information Systems Research 29(3): 592611.
Heimeriks G, Van den Besselaar P, Frenken K. 2008. Digital disciplinary differences: An
analysis of computer-mediated science and ‘Mode 2’knowledge production. Research
Policy 37(9): 16021615.
Helveston JP, Wang Y, Karplus VJ, Fuchs ER. 2019. Institutional complementarities: The
origins of experimentation in China’s plug-in electric vehicle industry. Research Policy
48(1): 206222.
Henkel AP, Boegershausen J, Hoegg J, Aquino K, Lemmink J. 2018. Discounting humanity:
When consumers are price conscious, employees appear less human. Journal of
Consumer Psychology 28(2): 272292.
Herzenstein M, Sonenshein S, Dholakia UM. 2011. Tell Me a Good Story and I May Lend You
Money: The Role of Narratives in Peer-to-Peer Lending Decisions. Journal of Marketing
Research (JMR) 48: S138S149.
Hinz O, Skiera B, Barrot C, Becker JU. 2011. Seeding Strategies for Viral Marketing: An
Empirical Comparison. Journal of Marketing 75(6): 5571.
Hoberg G, Maksimovic V. 2015. Redefining Financial Constraints: A Text-Based Analysis.
Review of Financial Studies 28(5): 13121352.
Hoberg G, Phillips G. 2010. Product market synergies and competition in mergers and
acquisitions: A text-based analysis. The Review of Financial Studies 23(10): 37733811.
Hoberg G, Phillips G. 2016. Text-Based Network Industries and Endogenous Product
Differentiation. Journal of Political Economy 124(5): 14231465.
Hoberg G, Phillips G. 2018. Conglomerate Industry Choice and Product Language. Management
Science 64(8): 37353755.
Hong Y, Hu Y, Burtch G. 2018. Embeddedness, Prosociality, and Social Influence: Evidence
from Online Crowdfunding. MIS Quarterly 42(4): 12111224.
Huang J. 2018. The customer knows best: The investment value of consumer opinions. Journal
of Financial Economics 128(1): 164182.
30
Huang J, Boh WF, Goh KH. 2017. A Temporal Study of the Effects of Online Opinions:
Information Sources Matter. Journal of Management Information Systems 34(4): 1169
1202.
Huang L, Tan C-H, Ke W, Wei K-K. 2013. Comprehension and Assessment of Product
Reviews: A Review-Product Congruity Proposition. Journal of Management Information
Systems 30(3): 311343.
Islam M, Miller J, Park HD. 2017. But what will it cost me? How do private costs of
participation affect open source software projects? Research Policy 46(6): 10621070.
Jain N, Girotra K, Netessine S. 2014. Managing Global Sourcing: Inventory Performance.
Management Science 60(5): 12021222.
Jancsary D, Meyer RE, Höllerer MA, Barberio V. 2017. Toward a Structural Model of
Organizational-Level Institutional Pluralism and Logic Interconnectedness. Organization
Science 28(6): 11501167.
Jegadeesh N, Wu D. 2013. Word power: A new approach for content analysis. Journal of
Financial Economics 110(3): 712729.
Jerath K, Ma L, Park Y-H, Srinivasan K. 2011. A ‘Position Paradox’ in Sponsored Search
Auctions. Marketing Science 30(4): 612627.
Jiahui Mo, Sarkar S, Menon S. 2018. Know When to Run: Recommendations in Crowdsourcing
Contests. MIS Quarterly 42(3): 919944.
Ketcham JD, Lucarelli C, Miravete EJ, Roebuck MC. 2012. Sinking, Swimming, or Learning to
Swim in Medicare Part D. American Economic Review 102(6): 26392673.
Khansa L, Ma X, Liginlal D, Kim SS. 2015. Understanding Members’ Active Participation in
Online Question-and-Answer Communities: A Theory and Empirical Analysis. Journal
of Management Information Systems 32(2): 162203.
Kim K, Gopal A, Hoberg G. 2016. Does Product Market Competition Drive CVC Investment?
Evidence from the U.S. IT Industry. Information Systems Research 27(2): 259281.
Kim N, Lee H, Kim W, Lee H, Suh JH. 2015. Dynamic patterns of industry convergence:
Evidence from a large amount of unstructured data. Research Policy 44(9): 17341748.
Kornish LJ, Ulrich KT. 2014. The Importance of the Raw Idea in Innovation: Testing the Sow’s
Ear Hypothesis. Journal of Marketing Research (JMR) 51(1): 1426.
Kozlenkova IV, Palmatier RW, Fang E (Er), Xiao B, Huang M. 2017. Online Relationship
Formation. Journal of Marketing 81(3): 2140.
Kuhn P, Shen K. 2013. Gender Discrimination in Job Ads: Evidence from China*. Quarterly
Journal of Economics 128(1): 287336.
Kupor D, Tormala Z. 2018. When Moderation Fosters Persuasion: The Persuasive Power of
Deviatory Reviews. Journal of Consumer Research 45(3): 490510.
Lash MT, Zhao K. 2016. Early Predictions of Movie Success: The Who, What, and When of
Profitability. Journal of Management Information Systems 33(3): 874903.
Lau RYK, Liao SSY, Wong KF, Chiu DKW. 2012. Web 2.0 Environmental Scanning and
Adaptive Decision Support for Business Mergers and Acquisitions. MIS Quarterly 36(4):
1239-A6.
Lee D, Hosanagar K, Nair HS. 2018. Advertising Content and Consumer Engagement on Social
Media: Evidence from Facebook. Management Science 64(11): 51055131.
Lee K, Lee B, Oh W. 2015a. Thumbs Up, Sales Up? The Contingent Effect of Facebook Likes
on Sales Performance in Social Commerce. Journal of Management Information Systems
32(4): 109143.
31
Lee K, Oh W-Y, Kim N. 2013. Social Media for Socially Responsible Firms: Analysis of
Fortune 500’s Twitter Profiles and their CSR/CSIR Ratings. Journal of Business Ethics
118(4): 791806.
Lee S-H, Mun HJ, Park KM. 2015b. When is dependence on other organizations burdensome?
The effect of asymmetric dependence on internet firm failure. Strategic Management
Journal 36(13): 20582074.
Li C, Luo X, Zhang C, Wang X. 2017. Sunny, Rainy, and Cloudy with a Chance of Mobile
Promotion Effectiveness. Marketing Science 36(5): 762779.
Li J, Granados N, Netessine S. 2014. Are Consumers Strategic? Structural Estimation from the
Air-Travel Industry. Management Science 60(9): 21142137.
Li W, Chen H, Nunamaker JF. 2016. Identifying and Profiling Key Sellers in Cyber Carding
Community: AZSecure Text Mining System. Journal of Management Information
Systems 33(4): 10591086.
Liang H, Marquis C, Renneboog L, Sun SL. 2018. Future-Time Framing: The Effect of
Language on Corporate Future Orientation. Organization Science. Available at:
http://pubsonline.informs.org/doi/10.1287/orsc.2018.1217.
Liu TX, Yang J, Adamic LA, Chen Y. 2014. Crowdsourcing with All-Pay Auctions: A Field
Experiment on Taskcn. Management Science 60(8): 20202037.
Lobschat L, Osinga EC, Reinartz WJ. 2017. What Happens Online Stays Online? Segment-
Specific Online and Offline Effects of Banner Advertisements. Journal of Marketing
Research (JMR) 54(6): 901913.
Lu Y, Jerath K, Singh PV. 2013. The Emergence of Opinion Leaders in a Networked Online
Community: A Dyadic Model with Time Dynamics and a Heuristic for Fast Estimation.
Management Science 59(8): 17831799.
Luo X, Zhang J. 2013. How Do Consumer Buzz and Traffic in Social Media Marketing Predict
the Value of the Firm? Journal of Management Information Systems 30(2): 213238.
Luo X, Zhang J, Duan W. 2013. Social Media and Firm Equity Value. Information Systems
Research 24(1): 146163.
Lynn Wu. 2013. Social Network Effects on Productivity and Job Security: Evidence from the
Adoption of a Social Networking Tool. Information Systems Research 24(1): 3051.
Mai F, Shan Z, Bai Q, Wang X (Shane), Chiang RHL. 2018. How Does Social Media Impact
Bitcoin Value? A Test of the Silent Majority Hypothesis. Journal of Management
Information Systems 35(1): 1952.
Malesky E, Taussig M. 2017. The Danger of Not Listening to Firms: Government
Responsiveness and the Goal of Regulatory Compliance. Academy of Management
Journal 60(5): 17411770.
Marino A, Aversa P, Mesquita L, Anand J. 2015. Driving performance via exploration in
changing environments: Evidence from formula one racing. Organization Science 26(4):
10791100.
Marquis C, Bird Y. 2018. The Paradox of Responsive Authoritarianism: How Civic Activism
Spurs Environmental Penalties in China. Organization Science 29(5): 948968.
Martin KD, Borah A, Palmatier RW. 2017. Data Privacy: Effects on Customer and Firm
Performance. Journal of Marketing 81(1): 3658.
Mas A. 2017. Does Transparency Lead to Pay Compression? Journal of Political Economy
125(5): 16831721.
Massimino B, Gray JV, Boyer KK. 2017. The Effects of Agglomeration and National Property
32
Rights on Digital Confidentiality Performance. Production & Operations Management
26(1): 162179.
Mayzlin D, Dover Y, Chevalier J. 2014. Promotional reviews: An empirical investigation of
online review manipulation. American Economic Review 104(8): 242155.
McDevitt RC. 2014. ‘A’ Business by Any Other Name: Firm Name Choice as a Signal of Firm
Quality. Journal of Political Economy 122(4): 909944.
Mogilner C, Aaker J, Kamvar SD. 2012. How Happiness Affects Choice. Journal of Consumer
Research 39(2): 429443.
Moon JY, Sproull LS. 2008. The Role of Feedback in Managing the Internet-Based Volunteer
Work Force. Information Systems Research 19(4): 494515.
Munzert S, Rubba C, Mei\s sner P, Nyhuis D. 2014. Automated data collection with R: A
practical guide to web scraping and text mining. John Wiley & Sons.
Nelson AJ. 2009. Measuring knowledge spillovers: What patents, licenses and publications
reveal about innovation diffusion. Research policy 38(6): 9941005.
Nishida M, Remer M. 2018. The Determinants and Consequences of Search Cost Heterogeneity:
Evidence from Local Gasoline Markets. Journal of Marketing Research (JMR) 55(3):
305320.
Oertel S, Thommes K. 2018. History as a source of organizational identity creation.
Organization Studies 39(12): 17091731.
Oestreicher-Singer G, Sundararajan A. 2012a. Recommendation Networks and the Long Tail of
Electronic Commerce. MIS Quarterly 36(1): 65-A4.
Oestreicher-Singer G, Sundararajan A. 2012b. The Visible Hand? Demand Effects of
Recommendation Networks in Electronic Markets. Management Science 58(11): 1963
1981.
Oestreicher-Singer G, Zalmanson L. 2013. Content or Community? A Digital Business Strategy
for Content Providers in the Social Age. MIS Quarterly 37(2): 591616.
Olivares M, Cachon GP. 2009. Competing Retailers and Inventory: An Empirical Investigation
of General Motors’ Dealerships in Isolated U.S. Markets. Management Science 55(9):
15861604.
Paik Y, Zhu F. 2016. The Impact of Patent Wars on Firm Strategy: Evidence from the Global
Smartphone Industry. Organization Science 27(6): 13971416.
Palmer JW. 2002. Web Site Usability, Design, and Performance Metrics. INFORMS: Institute
for Operations Research: 151167. Available at:
http://search.ebscohost.com/login.aspx?direct=true&db=bth&AN=6706196&site=ehost-
live.
Pancer E, Chandler V, Poole M, Noseworthy TJ. 2018. How Readability Shapes Social Media
Engagement. Journal of Consumer Psychology.
Pant G, Sheng ORL. 2015. Web Footprints of Firms: Using Online Isomorphism for Competitor
Identification. Information Systems Research 26(1): 188209.
Pant G, Srinivasan P. 2010. Predicting Web Page Status. Information Systems Research 21(2):
345364.
Pant G, Srinivasan P. 2013. Status Locality on the Web: Implications for Building Focused
Collections. Information Systems Research 24(3): 802821.
Paravisini D, Rappoport V, Schnabl P, Wolfenzon D. 2015. Dissecting the Effect of Credit
Supply on Trade: Evidence from Matched Credit-Export Data. Review of Economic
Studies 82(1): 333359.
33
Perren L, Sapsed J. 2013. Innovation as politics: The rise and reshaping of innovation in UK
parliamentary discourse 19602005. Research Policy 42(10): 18151828.
Phan HV. 2014. Inside Debt and Mergers and Acquisitions. Journal of Financial & Quantitative
Analysis 49(5/6): 13651401.
Quariguasi-Frota-Neto J, Bloemhof J. 2012. An Analysis of the Eco-Efficiency of
Remanufactured Personal Computers and Mobile Phones. Production & Operations
Management 21(1): 101114.
Rabinovich E, Sinha R, Laseter T. 2011. Unlimited shelf space in Internet supply chains:
Treasure trove or wasteland? Journal of Operations Management 29(4): 305317.
Rauh JD. 2009. Risk Shifting versus Risk Management: Investment Policy in Corporate Pension
Plans. Review of Financial Studies 22(7): 26872733.
Reich T, Kupor DM, Smith RK. 2018. Made by Mistake: When Mistakes Increase Product
Preference. Journal of Consumer Research 44(5): 10851103.
Ren CR, Hu Y, Cui TH. 2019. Responses to rival exit: Product variety, market expansion, and
preexisting market structure. Strategic Management Journal 40(2): 253276.
Ren CR, Ye Hu, Hausman J, Hu Y (Jeffrey). 2011. Managing Product Variety and Collocation in
a Competitive Environment: An Empirical Investigation of Consumer Electronics
Retailing. Management Science 57(6): 10091024.
Samtani S, Chinn R, Chen H, Nunamaker JF. 2017. Exploring Emerging Hacker Assets and Key
Hackers for Proactive Cyber Threat Intelligence. Journal of Management Information
Systems 34(4): 10231053.
Sayedi A, Jerath K, Srinivasan K. 2014. Competitive Poaching in Sponsored Search Advertising
and Its Strategic Impact on Traditional Advertising. Marketing Science 33(4): 586608.
Seiler S, Yao S, Wang W. 2017. Does Online Word of Mouth Increase Demand? (And How?)
Evidence from a Natural Experiment. Marketing Science 36(6): 838861.
Setia P, Rajagopalan B, Sambamurthy V, Calantone R. 2012. How Peripheral Developers
Contribute to Open-Source Software Development. Information Systems Research 23(1):
144163.
Sha Yang, Ghose A. 2010. Analyzing the Relationship Between Organic and Sponsored Search
Advertising: Positive, Negative, or Zero Interdependence? Marketing Science 29(4):
602623.
Simchi-Levi D. 2019. From the Editor. Management Science 65(2): vvi.
Sismeiro C, Bucklin RE. 2004. Modeling Purchase Behavior at an E-Commerce Web Site: A
Task-Completion Approach. Journal of Marketing Research (JMR) 41(3): 306323.
Sonnier GP, McAlister L, Rutz OJ. 2011. A Dynamic Model of the Effect of Online
Communications on Firm Sales. Marketing Science 30(4): 702716.
Sosa ME, Mihm J, Browning TR. 2013. Linking Cyclicality and Product Quality. Manufacturing
& Service Operations Management 15(3): 473491.
Stango V, Zinman J. 2009. What Do Consumers Really Pay on Their Checking and Credit Card
Accounts? Explicit, Implicit, and Avoidable Costs. American Economic Review 99(2):
424429.
Stango V, Zinman J. 2014. Limited and Varying Consumer Attention: Evidence from Shocks to
the Salience of Bank Overdraft Fees. Review of Financial Studies 27(4): 9901030.
Thies F, Wessel M, Benlian A. 2016. Effects of Social Interaction Dynamics on Platforms.
Journal of Management Information Systems 33(3): 843873.
Toubia O, Netzer O. 2017. Idea Generation, Creativity, and Prototypicality. Marketing Science
34
36(1): 120.
Trusov M, Ma L, Jamal Z. 2016. Crumbs of the Cookie: User Profiling in Customer-Base
Analysis and Behavioral Targeting. Marketing Science 35(3): 405426.
Tucker C, Juanjuan Zhang. 2011. How Does Popularity Information Affect Choices? A Field
Experiment. Management Science 57(5): 828842.
Van Osch W, Steinfield CW. 2018. Strategic Visibility in Enterprise Social Media: Implications
for Network Formation and Boundary Spanning. Journal of Management Information
Systems 35(2): 647682.
Villarroel Ordenes F, Ludwig S, De Ruyter K, Grewal D, Wetzels M. 2017. Unveiling What Is
Written in the Stars: Analyzing Explicit, Implicit, and Discourse Patterns of Sentiment in
Social Media. Journal of Consumer Research 43(6): 875894.
Wang X (Shane), Mai F, Chiang RHL. 2014. Database SubmissionMarket Dynamics and
User-Generated Content About Tablet Computers. Marketing Science 33(3): 449458.
Wei Dong, Shaoyi Liao, Zhongju Zhang. 2018. Leveraging Financial Social Media Data for
Corporate Fraud Detection. Journal of Management Information Systems 35(2): 461487.
Wooten JO, Ulrich KT. 2017. Idea Generation and the Role of Feedback: Evidence from Field
Experiments with Innovation Tournaments. Production & Operations Management
26(1): 8099.
Wu J, Shi M, Hu M. 2015. Threshold Effects in Online Group Buying. Management Science
61(9): 20252040.
Xiaohua Zeng, Liyuan Wei. 2013. Social Ties and User Content Generation: Evidence from
Flickr. Information Systems Research 24(1): 7187.
Xitong Li, Lynn Wu. 2018. Herding and Social Media Word-of-Mouth: Evidence from Groupon.
MIS Quarterly 42(4): 13311351.
Yu S, Johnson S, Lai C, Cricelli A, Fleming L. 2017. Crowdfunding and regional entrepreneurial
investment: an application of the CrowdBerkeley database. Research Policy 46(10):
17231737.
Zhan Shi, Huaxia Rui, Whinston AB. 2014. Content Sharing in a Social Broadcasting
Environment: Evidence from Twitter. MIS Quarterly 38(1): 123-A6.
Zhang D, Zhou L, Kehoe JL, Kilic IY. 2016. What Online Reviewer Behaviors Really Matter?
Effects of Verbal and Nonverbal Behaviors on Detection of Fake Online Reviews.
Journal of Management Information Systems 33(2): 456481.
Zheng Z (Eric), Pavlou PA, Gu B. 2014. Latent Growth Modeling for Information Systems:
Theoretical Extensions and Practical Applications. Information Systems Research 25(3):
547568.}
35
Figures and Tables
Figure 1: Publication of crawling papers by year
36
Table 1: Number and share of published crawling papers by field
Crawling
Papers
Share of
Publications
10
0.15%
14
0.07%
13
0.67%
2
0.04%
20
0.16%
5
0.04%
37
0.25%
75
0.76%
7
0.12%
37
Appendix
Table A.1: Published crawling papers by field and setting
Notes: Numbers in parentheses denote setting of paper: 1: Advertisement 2: Between-Industry
3: Crowdsourcing 4: Crowdfunding 5: eCommerce 6: ICT 7: Online Content 8: Online
recommendation systems 9: Open Source 10: Other 11: Search Engine 12: Social Networks
13: User-Generated-Content 14: Within-Industry
Economics: Cavallo et al. (2014,14), Chen (2013, 10), Mas (2017,2), Dellavigna and Hermle
(2017, 14), Hoberg and Phillips (2016,2), Augenblick (2016,5), Paravisini et al. (2015,2),
McDevitt (2014,14), Mayzlin et al. (2014,8), Kuhn and Shen (2013,10), Ketcham et al.
(2012,14), Stango and Zinman (2009,14), Liu et al. (2014,3), Giglio et al. (2016,10)
Ethics: De Bakker and Hellsten (2013,10), Lee et al. (2013,2)
Finance: Da et al. (2014,2), Ben-David et al. (2017,14), Hoberg and Maksimovic (2015,2),
Giglio and Shue (2014,2), Phan (2014,2), Stango and Zinman (2014,14), Das and Chen
(2007,2), Hoberg and Phillips (2010,2), Hanley and Hoberg (2010,2), Rauh (2009,2), Busse et
al. (2014,14), Hoberg and Phillips (2018,2), Campello et al. (2011,2), Engelberg and Gao
(2011,2), Huang (2018,5), Jegadeesh and Wu (2013,2), Gao and Huang (2016,14), Green and
Jame (2013,2), Hanley and Hoberg (2012,2), Cai and Sevilir (2012,2)
Innovation & Entrepreneurship: Yu et al. (2017,4), Geuna et al. (2015,10), Fischer and
Reuber (2014,2), Yu et al. (2017,4), Kim et al. (2015,2), Feldman and Lowe (2015,2),
Corredoira et al. (2018,10), Islam et al.(2017,9), Helveston et al. (2019,10), Heimeriks et al.
(2008,6), Nelson (2009,10), Perren and Sapsed (2013,10), Claussen et al. (2015,6)
Management: Gehman and Grimes (2017,2), Malesky and Taussig (2017,2), Ren et al.
(2011,5), Calic and Mosakowski (2016,4), Haans (2019,10), Lee et al. (2015b,6), Ren et al.
(2019,5)
Marketing: Kupor and Tormala (2018,2), Reich et al. (2018,2), Nishida and Remer (2018,14),
Lobschat et al. (2017,5), Arora et al. (2017,6), Kozlenkova et al. (2017,5), Bellezza et al.
(2017,12), Goldenberg et al. (2009,12), Villarroel Ordenes et al. (2017,8), Martin et al.
(2017,2), De Langhe et al. (2016,5), Kornish and Ulrich (2014,14), Goldenberg et al. (2012,7),
Berger and Milkman (2012,7), Hinz et al. (2011,14), Mogilner et al. (2012,13), Sonnier et al.
(2011,5), Sismeiro and Bucklin (2004,5), Bucklin and Sismeiro (2003,5), Oestreicher-Singer
and Sundararajan (2012b,5), Jerath et al. (2011,11), Tucker and Juanjuan Zhang (2011,14),
Sha Yang and Ghose (2010,5), Galak et al. (2011,4), Herzenstein et al. (2011,4), Sonnier et al.
(2011,5), Seiler et al. (2017,13), Datta et al. (2018,7), Wang et al. (2014,6), Sayedi et al.
(2014,14), Trusov et al. (2016,11), Bronnenberg et al. (2016,6), Toubia and Netzer (2017,3),
Li et al. (2017,14), Wu et al. (2015,5), Henkel et al. (2018,10), Pancer et al. (2018,12)
(continued on next page)
38
Table A.1 (continued)
Operations & Information Systems: Aguiar et al. (2018, 3), Massimino et al. (2017,6),
Greenwood and Gopal (2017,6), Jiahui Mo et al. (2018,3), Bapna et al. (2018,7), Xitong Li and
Lynn Wu (2018,5), Hong et al. (2018,4), Nishida and Remer (2018,14), Van Osch and
Steinfield (2018,14), Wei Dong et al. (2018,12), Mai et al. (2018,6), Pant and Srinivasan
(2013,12), Huang et al. (2017,7), Samtani et al. (2017,12), Bandyopadhyay and
Bandyopadhyay (2009,5), Carmi et al. (2017,5), Guo et al. (2017,12), Ge et al. (2017,4), Li et
al. (2016,12), Lash and Zhao (2016,14), Thies et al. (2016,4), Zhang et al. (2016,8), Guo et al.
(2016,6), Lee et al. (2015a,5), Dissanayake et al. (2015,3), (2015,3), Bodea et al. (2009,14),
Zhan Shi et al. (2014,13), Huang et al. (2013,5), Oestreicher-Singer and Zalmanson (2013,7),
Lau et al. (2012,2), Chau and Xu (2012,13), Luo and Zhang (2013,6), Toubia and Netzer
(2017,2 ), Oestreicher-Singer and Sundararajan (2012,5), Nishida and Remer (2018,3), Pant
and Srinivasan (2010,11), Abrahams et al. (2015,14), Gu and Ye (2014,14), Xiaohua Zeng and
Liyuan Wei (2013,13), Lynn Wu (2013,14), Lu et al. (2013,12), Sosa et al. (2013,9), Setia et
al. (2012, 9), Quariguasi-Frota-Neto and Bloemhof (2012,6), Brynjolfsson et al. (2009,5),
Moon and Sproull (2008,12), Olivares and Cachon (2009,14), Hahn et al. (2008,9), Palmer
(2002,5), Pant and Srinivasan (2010,12), Li et al. (2014,14), Cachon et al. (2018,14), Luo et al.
(2013,6), Claussen et al. (2013,12), Dhar et al. (2014,5), Zheng et al. (2014,5), Pant and Sheng
(2015,2), Agarwal et al. (2015,5), Agarwal et al. (2015,6), Kim et al. (2016,6), Bauer et al.
(2016,3), Heimbach and Hinz (2018,12), Cao et al. (2018,5), Aguiar et al. (2018,1), Jain et al.
(2014,14), Bao and Datta (2014,2), Bapna and Umyarov (2015,12), Aaltonen and Seiler
(2016,3), Fisher et al. (2018,5), Chen et al. (2018,8), Lee et al. (2018,12), He (2016,10),
Bertsimas et al. (2014,10), Heim and Field (2007,8), Rabinovich et al. (2011,5), Bockstedt et
al. (2015,3), Wooten and Ulrich (2017,3)
Organizational Behavior: Bianchi et al. (2012,9), Paik and Zhu (2016,6), Liang et al.
(2018,2), Marino et al. (2015,14), Jancsary et al. (2017,14), Marquis and Bird (2018,2), Oertel
and Thommes (2018,10)
... Content analysis is used in this research to unravel antiradical memes uploaded in the two accounts. The data collection process was conducted with the data crawling technique (Claussen & Peukert, 2019) between August and December 2022. After investigating the general contents of both accounts, we started collecting data manually based onthe hashtag (on X) and captions (on Instagram). ...
Article
Full-text available
Memes have become an important medium for expressing multiple intentions on the internet. Social media has advanced increasingly, making memes a contestation zone, an active hook for delivering information, and an expression of counterradicalism. Memes are a very effective way to take a jab at radicalism in a laid-back or even humorous manner so the public can refreshingly capture the messages. As a part of the digital way, the counterradical group also benefits from the same medium and feature. This research aims to investigate the data on the field on how the memes spreading on social media fight against radicalism in their ways. Therefore, the antiradicalism movement through memes is conducted to look for the patterns, forms, and meanings, especially on X and Instagram accounts of NU Garis Lucu (NUGL) and Muhammadiyah Garis Lucu (MuGL). Using a qualitative approach with the content analysis method, the memes posted in the two accounts were collected between August and December 2022. We found that NUGL and MuGL are actively plotting the antiradicalism movement by criticizing radicalism and fighting against religious indoctrination. Apart from that, for those two accounts, memes function as a medium to raise awareness on multiculturalism and nurture nationalism.
... The data collected was obtained using the snscrape library from the Python programming language [14], [15]. The data used in this study is Indonesian-language Twitter data. ...
Article
Full-text available
Twitter is a medium of communication, transmission of information, and exchange of opinions on a topic with an extensive reach. Twitter has a tweet with a text message of 280 characters. Because text messages can only be written briefly, tweets often use slang and may not follow structured grammar. The diverse vocabulary in tweets leads to word discrepancies, so tweets are difficult to understand. The problem often found in classifying topics in tweets is that they need higher accuracy due to these factors. Therefore, the authors used the GloVe feature expansion to reduce vocabulary discrepancies by building a corpus from Twitter and IndoNews. Research on the classification of topics in previous tweets has been done extensively with various Machine Learning or Deep Learning methods using feature expansion. However, To the best of our knowledge, Hybrid Deep Learning has not been previously used for topic classification on Twitter. Therefore, the study conducted experiments to analyze the impact of Hybrid Deep Learning and the expansion of GloVe features on classification topics. The total data used in this study was 55,411 datasets in Indonesian-language text. The methods used in this study are Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Hybrid CNN-RNN. The results show that the topic classification system with GloVe feature expansion using the CNN method achieved the highest accuracy of 92.80%, with an increase of 0.40% compared to the baseline. The RNN followed it with an accuracy of 93.72% and a 0.23% improvement. The CNN-RN Hybrid Deep Learning model achieved the highest accuracy of 94.56%, with a significant increase of 2.30%. The RNN-CNN model also achieved high accuracy, reaching 94.39% with a 0.95% increase. Based on the accuracy results, the Hybrid Deep Learning model, with the addition of feature expansion, significantly improved the system's performance, resulting in higher accuracy.
... Web crawling on Airbnb and other intermediaries' websites requires creating a user to access the data of interest. This is associated with agreeing to terms of service, which, in the case of Airbnb and many other platforms, explicitly prohibit data crawling (Scassa 2019;Scassa 2018;Claussen and Peukert 2019). ...
Conference Paper
Full-text available
Purpose – This paper aims to identify major supply data sources for short-term rental market research and to provide their advantages and limitations. Methodology – In the paper a grounded approach was used based on a literature review. This review comprised two steps with the first being the query in major databases that was supplemented by academic search engine that resulted in 170 articles. The second step was to investigate the papers’ methodological sections to identify characteristics and limitations of all data sources. Findings – This study identifies three major data sources for the short-term rental market: web scraping with the use of self-made bots, Inside Airbnb and Airdna. A majority (e.g. 74% of papers using Airdna as a source) did not mention any limitations and provide no discussion about the data source, while the remainder gave only superfluous information about possible limitations of its use. Their characteristics and limitations are extensively discussed using a proposed framework that consists of three levels: intermediary, web scraping, and source-specific. Contribution – Very limited number of studies have focused on the short-term rental data sources and this is the first one that discusses advantages and limitation of their use. This paper may be of help to academics or professionals in identifying the right source of data to suit their technical knowledge, financial and technical resources and research areas.
Article
The paper is aimed at investigating data mining technologies by acquiring tweets from Nigerian University students on Twitter on how they feel about the current state of the Nigerian university system. The study for this paper was conducted in a way that the tweet data collected using the Twitter Application was pre-processed before being translated from text to vector representation using a feature extraction technique such Bag-of-Words. In the paper, the proposed sentiment analysis architecture was designed using UML and the Naïve Bayes classifier (NBC) approach, which is a simple but effective classifier to determine the polarity of the education dataset, was applied to compute the probabilities of the classes. Furthermore, Naïve Bayes classifier polarized the tweets' wording as negative or positive for polarity. Based on our investigation, the experiment revealed after data cleaning that 4016 of the total data obtained were utilized. Also, Positive attitudes accounted for 40.56%, while negative sentiments accounted for 59.44% of the total data having divided the dataset into 70:30 training and testing ratio, with the Naïve Bayes classifier being taught on the training set and its performance being evaluated on the test set. Because the models were trained on unbalanced data, we employed more relevant evaluation metrics such as precision, recall, F1-score, and balanced accuracy for model evaluation. The classifier's prediction accuracy, misclassification error rate, recall, precision, and f1-score were 63 %, 37%, 63%, 62%, and 62% respectively. All of the analyses were completed using the Python programming language and the Natural Language Tool Kit packages. Finally, the outcome of this prediction is the highest likelihood class. These forecasts can be used by Nigerian Government to improve the educational system and assist students to receive a better education.
Article
Full-text available
RESEARCH SUMMARY Is moderate distinctiveness optimal for performance? Answers to this question have been mixed, with both inverted U‐ and U‐shaped relationships being argued for and found in the literature. I show how nearly identical mechanisms driving the distinctiveness‐performance relationship can yield both U‐ and inverted U‐shaped effects due to differences in relative strength, rather than their countervailing nature. Incorporating distinctiveness heterogeneity, I theorize a U‐shaped effect in homogeneous categories that flattens into an inverted U in heterogeneous categories. Results combining a topic model of 69,188 organizational websites with survey data from 2,279 participants in the Dutch creative industries show a U‐shaped effect in homogeneous categories, flattening and then disappearing in more heterogeneous categories. How distinctiveness affects performance thus depends entirely on how distinct others are. MANAGERIAL SUMMARY A core strategy recommendation is to be different from competitors. Recent work highlights the notion of optimal distinctiveness—being different enough to escape competition yet similar enough to be legitimate, thus yielding highest performance. This paper challenges the notion that one “optimal” level of distinctiveness exists and focuses on distinctiveness heterogeneity (representing variation in firm positions in a category) as a key contextual factor. Results from a sample of firms in the Dutch creative industries show that either being entirely different or entirely the same to competitors pays off when one's category is very homogeneous. However, being different loses its performance effects entirely when heterogeneity in firm positions is higher. Being different from competitors therefore no longer pays when others tend to be different, too. This article is protected by copyright. All rights reserved.
Article
Full-text available
We examine how international variation in corporate future-oriented behavior, such as corporate social responsibility and research and development investment, could partially stem from characteristics of the languages spoken at firms. We develop a future-time framing perspective rooted in the literatures on organizational categorization and framing. Our theory and hypotheses focus on how companies with working languages that obligatorily separate the future tense and the present tense engage less in future-oriented behaviors, and this effect is attenuated by exposure to multilingual environments. The results based on a large global sample of firms from 39 countries support our theory, highlighting the importance of language in affecting organizational behavior around the world. The online appendix is available at https://doi.org/10.1287/orsc.2018.1217.
Article
Full-text available
We suggest that text readability plays an important role in driving consumer engagement on social media. Consistent with a processing fluency account, we find that easy‐to‐read posts are more liked, commented on, and shared on social media. We analyze over 4,000 Facebook posts from Humans of New York, a popular photography blog on social media, over a 3‐year period to see how readability shapes social media engagement. The results hold when controlling for photo features, story valence, and other content‐related characteristics. Experimental findings further demonstrate the causal impact of readability and the processing fluency mechanism in the context of a fictitious brand community. This research articulates the impact of processing fluency on brief word‐of‐mouth transmissions in the real world while empirically demonstrating that readability as a message feature matters. It also extends the impact of processing fluency to a novel behavioral outcome: commenting and sharing actions. This article is protected by copyright. All rights reserved.
Article
Full-text available
Research Summary This study investigates incumbent responses to a main rival’s exit. We argue that long‐time rivals have developed an equilibrium by offering a mix of overlapping and unique products and by choosing geographic proximity to each other. A rival’s exit, however, disrupts this equilibrium and motivates surviving firms to expand in both product and geographic spaces to seek a new equilibrium. Using data from all U.S. Best Buy stores before and after the exit of Circuit City, we find that Best Buy uses product variety expansion as its major response in markets where Circuit City was colocated, but it more often responds by opening new stores in non‐colocated markets. Regardless of preexisting market structures, the magnitude of product variety expansion decreases with the opening of new stores. Managerial Summary How do surviving firms respond to a major rival’s exit? By studying Best Buy’s responses to Circuit City’s withdrawal, we find the survivor expands in both product space (increasing product variety) and geographic space (opening new stores), due to two motives. First, the survivor strives to fill in “holes” left in the market. Second, the survivor experiences uncertainty in the post‐exit world wherein its reference point is gone, threat of potential entry looms, and it lacks information about new entrants. Thus, it must deter potential entry ex ante by preempting many prime product and geographic locations. Best Buy also responds according to preexisting market structures, primarily through product variety expansion in markets wherein Circuit City was colocated and through opening new stores in non‐colocated markets.
Article
This paper examines how (1) a crowdfunding campaign's prosociality (the production of a public versus private good), (2) the social network structure (embeddedness) among individuals advocating for the campaign on social media, and (3) the volume of social media activity around a campaign jointly determine fundraising from the crowd. Integrating the emerging literature on social media and crowdfunding with the literature on social networks and public goods, we theorize that prosocially, public-oriented crowdfunding campaigns will benefit disproportionately from social media activity when advocates' social media networks exhibit greater levels of embeddedness. Drawing on a panel dataset that combines campaign fundraising activity associated with more than 1,000 campaigns on Kickstarter with campaign-related social media activity on Twitter, we construct network-level measures of embeddedness between and amongst individuals initiating the latter, in terms of transitivity and topological overlap. We demonstrate that Twitter activity drives a disproportionate increase in fundraising for prosocially oriented crowdfunding campaigns when posting users' networks exhibit greater embeddedness. We discuss the theoretical implications of our findings, highlighting how our work extends prior research on the role of embeddedness in peer influence by demonstrating the joint roles of message features and network structure in the peer influence process. Our work suggests that when a transmitter's message is prosocial or cause-oriented, embeddedness will play a stronger role in determining influence. We also discuss the broader theoretical implications for the literatures on social media, crowdfunding, crowdsourcing, and private contributions to public goods. Finally, we highlight the practical implications for marketers, campaign organizers, and crowdfunding platform operators.
Article
Modern online retailing practices provide consumers with new types of real-time information that can potentially increase demand. In particular, showing sales information to a customer can increase certainty about product quality, inducing consumers to herd. This effect can be particularly salient for experience goods due to their quality being inherently highly uncertain. Social media word-of-mouth (WOM) can increase product awareness as product information spreads via social media, increasing demand directly while amplifying existing quality signals such as past sales. This study examines the mechanisms behind the strategy of facilitating herding and the strategy of integrating social media platforms to understand the potential complementarities between the two strategies. We conduct empirical analysis using data from Groupon.com, which sells goods in a fast cycle format of "daily deals." We find that facilitating herding and integrating social media platforms are complements that generate sales, supporting the idea that it is beneficial to combine the two strategies on social media platforms. Furthermore, we find that herding is more salient for experience goods, consistent with our hypothesized mechanisms, while the effect of social media WOM is similar for experience goods and search goods.
Article
Making sustainable profits from a baseline zero price and motivating free consumers to convert to premium subscribers is a continuing challenge for all freemium communities. Prior research has causally established that social engagement (Oestreicher-Singer and Zalmanson 2013) and peer influence (Bapna and Umyarov 2015) are two important drivers of users converting to premium subscribers in such communities. In this paper, we flip the perspective of prior research and ask whether the decision to pay for a premium subscription causes users to become more socially engaged. In the context of the Last.fm music listening freemium social community, we establish, using a novel 41-month-long panel dataset, a look-ahead propensity score matching (LA-PSM) procedure coupled with a difference-in-difference estimator of the treatment effect, that payment for premium leads to more social engagement. Specifically, we find that paying for premium leads to an increase in both content-related and community-related social engagement. Free users who convert to premium listen to 287.2% more songs, create 1.92% more playlists, exhibit a 2.01% increase in the number of forum posts made, and gain 15.77% more friends. Thus, premium subscribers create value not only for themselves by consuming more content, but also for the community and site by organizing more content and adding more friends, who are subsequently engaged by the social diffusion emerging from the focal user's activities.
Article
Crowdsourcing contests have emerged as an innovative way for firms to solve business problems by acquiring ideas from participants external to the firm. As the number of participants on crowdsourcing contest platforms has increased, so has the number of tasks that are open at any time. This has made it difficult for solvers to identify tasks in which to participate. We present a framework to recommend tasks to solvers who wish to participate in crowdsourcing contests. The existence of competition among solvers is an important and unique aspect of this environment, and our framework considers the competition a solver would face in each open task. As winning a task depends on performance, we identify a theory of performance and reinforce it with theories from learning, motivation, and tournaments. This augmented theory of performance guides us to variables specific to crowdsourcing contests that could impact a solver's winning probability. We use these variables as input into various probability prediction models adapted to our context, and make recommendations based on the probability or the expected payoff of the solver winning an open task. We validate our framework using data from a real crowdsourcing platform. The recommender system is shown to have the potential of improving the success rates of solvers across all abilities. Recommendations have to be made for open tasks and we find that the relative rankings of tasks at similar stages of their time lines remain remarkably consistent when the tasks close. Further, we show that deploying such a system should benefit not only the solvers, but also the seekers and the platform itself.
Article
A vast literature on technology transitions within industries suggests that early phases of new technologies are marked by periods of intense experimentation, but we know little about the conditions under which these periods emerge. We apply inductive, grounded theory-building techniques to examine what prompts firms to experiment across one emerging technology platform—plug-in electric vehicles (PEVs)—in China. Triangulating annual vehicle make and model sales data from 2003 to 2016 (plus monthly data from 2010 to 2016); 112 English and Mandarin archival documents from industry, academic, and news outlets; and 51 semi-structured interviews with industry, government, and academic stakeholders, we develop four in-depth case studies. We find that in contrast to the innovation trajectories of multinational and Chinese arms of joint venture (JV) firms, independent domestic Chinese firms (those with no history of international JV partnerships) are undertaking significant experimentation across multiple levels—infrastructure, core system, subsystem, and component—of the emerging PEV technology platform. We propose the concept of “institutional complementarities” to describe how interactions among institutions—here the national JV regulation and local market support and subsidies—may have turned regional markets into protected laboratories, extending the incubation periods for independent domestic firm experimentation. While this diverse experimentation may be an important antecedent of technology transition, consolidation induced by national policy standardization or competitive pressure may be required for PEV innovations to scale beyond their early, protected regional markets.
Article
Leveraging a new measure of patent citation trees (Corredoira and Banerjee, 2015), we demonstrate that research funded by the federal government is associated with more active and diverse technological trajectories. Our findings tie government funding to breakthrough inventions. The differences are especially evident at the upper percentiles of the distribution of long term patent influence and stem primarily from research conducted outside the federal government and sponsored by the DOD, HHS and NSF. Government funded patents are inputs into a broader range of technologies. Additional analyses indicate that federal programs invest in some technological areas that private corporations eschew, and federally funded university patents are in different technological classes than non-federally funded university patents. In this sense, the government may play an irreplaceable role in the rate and direction of inventive activity. “Generally speaking, the scientific agencies of Government are not so concerned with immediate practical objectives as are the laboratories of industry nor, on the other hand, are they as free to explore any natural phenomena without regard to possible economic applications as are the educational and private research institutions.” - Vannevar Bush, Science the Endless Frontier” 1945.