Conference PaperPDF Available

Legality and Ethics of Web Scraping

Authors:

Abstract and Figures

Automatic retrieval of data from the Web (often referred to as Web Scraping) for industry and academic research projects is becoming a common practice. A variety of tools and technologies have been developed to facilitate Web Scraping. Unfortunately, the legality and ethics of using these Web Scraping tools are often overlooked. This work in progress reviews legal literature together with Information Systems literature on ethics and privacy to identify broad areas of concern together with a list of specific questions that need to be addressed by researchers employing Web Scraping for data collection. Reflecting on these questions and concerns can potentially help the researchers decrease the likelihood of ethical and legal controversies in their work. Further research is needed to refine the intricacies of legal and ethical issues surrounding Web Scraping and devise strategies and tactics that researchers can use for addressing these issues.
Content may be subject to copyright.
Legality and Ethics of Web Scraping
Twenty-fourth Americas Conference on Information Systems, New Orleans, 2018
1
Legality and Ethics of Web Scraping
Emergent Research Forum (ERF)
Vlad Krotov
Murray State University
vkrotov@murraystate.edu
Leiser Silva
University of Houston
lsivla@uh.edu
Abstract
Automatic retrieval of data from the Web (often referred to as Web Scraping) for industry and academic
research projects is becoming a common practice. A variety of tools and technologies have been developed
to facilitate Web Scraping. Unfortunately, the legality and ethics of using these Web Scraping tools are often
overlooked. This work in progress reviews legal literature together with Information Systems literature on
ethics and privacy to identify broad areas of concern together with a list of specific questions that need to
be addressed by researchers employing Web Scraping for data collection. Reflecting on these questions and
concerns can potentially help the researchers decrease the likelihood of ethical and legal controversies in
their work. Further research is needed to refine the intricacies of legal and ethical issues surrounding Web
Scraping and devise strategies and tactics that researchers can use for addressing these issues.
Keywords
Big data, web data, web scraping, web crawling, law, ethics
Introduction
In the past, social scientists were struggling to find data for their research (Munzert et al. 2015). Today, the
increasing digitalization and virtualization of social processes have resulted in zettabytes (billions of
gigabytes) of data available on the World Wide Web (the Web) (Cisco Systems 2016). This data provides a
granular and real-time representation of numerous processes, relationships, and interactions in the social
space (Krotov and Tennyson 2018). Thus, these vast volumes of Web data present academic researchers
with opportunities for data collection for the purpose of answering new and old research questions with
rigor, precision, and timeliness and improving organizational performance (Constantiou and Kallinikos
2015). Practitioners can leverage Web data for developing a better understanding of their customers and
formulating better strategies based on these findings (Ives et al. 2016).
Unfortunately, harnessing these vast volumes of Web data presents serious technical, legal, and ethical
challenges. While there has been a proliferation in tools and technologies that can be used for Web Scraping
(Munzert et al. 2015), legality and ethics of data collection from the Web are still a grey area (Snell and
Menaldo 2016). While existing legal frameworks can be applied, to some extent, to the emerging practice
of Web Scraping, the ethical issues surround Web scraping have largely been ignored. This work-in-
progress reviews the legal literature together with Information Systems literature on ethics and privacy to
identify a preliminary set of ethical and legal considerations together with specific questions that need to
be addressed when collecting data from the Web using automated tools. Compliance with these legal and
ethical requirements can help industry and academic researchers decrease the likelihood of legal problems
and ethical controversies in their work and, overall, foster research relying on Web data.
Literature Review
Big Web Data
The data available on the Web is comprised of structured, semi-structured, and unstructured quantitative
and qualitative data distributed in the form of Web pages, HTML tables, Web databases, emails, tweets,
Legality and Ethics of Web Scraping
Twenty-fourth Americas Conference on Information Systems, New Orleans, 2018
2
blog posts, photos, videos, etc. (Watson 2014). Harnessing Web data requires addressing a number of
technical issues related to volume, variety, velocity, and veracity of data on the Web (Goes 2014).
First, the data on the Web is often characterized by vast volume measured in Zettabytes (billions of
gigabytes) (Cisco Systems 2016). Second, these vast data repositories available on the Web come in a variety
of formats and rely on a variety of technological and regulatory standards (Basoglu and White 2015). Third,
the data on the Web is not static; it is generated with extreme velocity. The final characteristic of Big Data
is its veracity (Goes 2014). Due to the open, voluntary, and often anonymous interactions on the Web,
there is an inherent uncertainty associated with availability and quality of Web data. A researcher can never
be completely sure whether the needed data is or will be available on the Web and whether this data is
reliable enough to be used in research (IBM 2018).
Web Scraping
Given the volume, variety, velocity, and veracity of Big Data available on the Web, collection and
organization of this data can hardly be done manually by individual researchers or even large research
teams (Krotov and Tennyson 2018). Because of that, researchers often resort to various technologies and
tools to automate some aspects of data collection and organization. This emerging practice of using
technology for collecting data from the Web is often referred to as Web Scraping (Landers et al. 2016).
Web Scraping is defined here as using technology tools for automatic extraction and organization of data
from the Web for the purpose of further analysis of this data (Krotov and Tennyson 2018). Web Scraping
consists of the following main, intertwined phases: website analysis, website crawling, and data
organization (see Figure 1). Website analysis requires examining the underlying structure of a website or a
Web repository (e.g. an online database) for the purpose of understanding how the needed data is stored.
This requires a basic understanding of the World Wide Web architecture; mark-up languages (e.g. HTML,
CSS, XML, XBRL, etc.); and various Web databases (e.g. MySQL). Website crawling involves developing
and running a script that automatically browses the website and retrieves the needed data. These crawling
applications (or scripts) are often developed using such programming languages as R and Python. This has
to do with the overall popularity of these languages in Data Science and availability libraries (e.g. rvest
package in R or Beautiful Soup library in Python) for automatic crawling and parsing of Web data. After the
necessary data is parsed from the selected Web repository, it needs to be cleaned, pre-processed, and
organized in a way that enables further analysis of this data. Given the volume of data involved, a
programmatic approach may also be necessary to save time. Many programming languages, such as R and
Python, contain Natural Language Processing (NLP) libraries and data manipulation functions that are
useful for cleaning and organizing data. Oftentimes, these three phases of Web Scraping cannot be fully
automated and often require at least some degree of human involvement and supervision.
Figure 1: Web Scraping (Adapted from Krotov and Tennyson 2018)
Legality of Web Scraping
While numerous tools and technologies have been available to assist researchers with Web Scraping
(Munzert et al. 2015), the legality of Web Scraping is still a “grey area” in the legal field (Snell and Menaldo
Legality and Ethics of Web Scraping
Twenty-fourth Americas Conference on Information Systems, New Orleans, 2018
3
2016). Here we define legality as compliance with applicable laws and legal theories. There is no legislature
that addresses Web Scraping directly. As of now, Web Scraping is guided by a set of related, fundamental
legal theories and laws, such as copyright infringement, breach of contract, the Computer Fraud and
Abuse Act (CFAA), and trespass to chattels (Dreyer and Stockton 2013; Snell and Menaldo 2016). Some
specific details of how these fundamental legal theories apply to Web Scraping are provided below.
Terms of Use
It has been often argued in the legal field that a website owner can effectively prevent programmatic access
to a website by explicitly prohibiting this in the “terms of use policy posted on the website. Failure to
comply with these terms may lead to a breach of contract on the side of the website’s user (Dryer and
Stockton 2013). To prosecute someone for violating the “terms of use, the website user needs to enter an
explicit agreement with the website owner to comply with the “terms of use policy (e.g. by clicking on a
checkbox). Thus, simply prohibiting Web Crawling and Web Scraping on the website may not preclude
someone from crawling the website from a legal standpoint.
Copyrighted Material
Scraping and republishing data or information that is owned and explicitly copyrighted by the website
owner can lead to a copyright infringement case (Dryer and Stockton 2013). However, a website does not
necessarily own the data generated by its users. For example, a website devoted to product reviews does not
necessarily own the reviews generated by the users of this website. Moreover, ideas cannot be copyrighted
only the specific form or representation of those ideas. So one can use copyrighted data to create
summaries of copyrighted data. Finally, one can still use copyrighted material on a limited scale under the
“fair use” principle.
Purpose of Web Scraping
Any illegal or fraudulent use of data obtained through Web Scraping is prohibited by several laws. For
example, a person accessing data from the Web that is known to be confidential and protected may be
prosecuted under the Computer Fraud and Abuse Act if the damage that occurred is greater than $5,000
(Dryer and Stockton 2013). On the Web, this often occurs when somebody knowingly accesses “premium
content” and then resells it or continues accessing the content via an unauthorized channel after receiving
a “cease and desist letter from the owner of the website (Snell and Menaldo 2016).
Damage to the Website
If Web Scraping overloads or damages a website or a Web server, then the person responsible for the
damage can be prosecuted under the “trespass to chattels” law (Dryer and Stockton 2013). However, the
damage needs to be material and easy to prove in court in order for the owner of the Web server to be
eligible for a financial compensation.
Ethics of Web Scraping
While existing laws and legal theories have been applied to Web Scraping in both courts and legal literature,
the ethics of Web Scraping has not been addressed by prior literature. While there are many perspectives
on ethics, for the purposes of this research project we view ethics as “a set of concepts and principles that
guide us in determining what behavior helps or harms sentient creatures” (Paul and Elder 2006). In
addition to violating existing laws, Web scraping can result in unintended harm to the “sentient creatures”
that are associated with a particular website, such as the website’s owners or customers. These harmful
consequences, are by definition, hard to predict (Light and McGrath 2010). Yet some possible harmful
consequences of Web Scraping are discussed below.
Individual Privacy
A research project relying on data collected from a website may unintentionally compromise privacy of
individuals participating in the activities afforded by the website (Mason 1986). For example, by matching
the data collected from a website with other online and offline sources, a researcher can unintentionally
Legality and Ethics of Web Scraping
Twenty-fourth Americas Conference on Information Systems, New Orleans, 2018
4
reveal the identity of those who created the data (Ives and Krotov 2006). Even if individual privacy is not
violated, the problem is that a website’s customers may not have consented to any third party use of their
data. Thus, using this data without a consent is a violation of the rights of research subjects (Buchanan
2017). These privacy and rights violations can lead to serious consequences for the website owner given the
heightened concern with online privacy in the light of the recent privacy scandals involving such
organizations and Facebook and Cambridge Analytica.
Organizational Privacy and Trade Secrets
Just like individuals have the right to privacy, organizations also have the right to maintain certain aspects
of their operations confidential (Mason 1986). Automatic Web Scraping can unintentionally reveal trade
secrets or simply confidential information about the organization who owns a website. For example, by
automatically crawling and counting employment ads on an online recruitment website one can
approximate the website’s market share and revenues. It can also reveal some details and, possibly, flaws
in the way the data is stored by the website (Ives and Krotov 2006). All this can damage the reputation of
the company behind the website and lead to material financial losses.
Diminishing Value for the Organization
If one accesses the website omitting the Web interface made for humans, then the person will not be
exposed to the advertisements that the website is using to monetize its content. Moreover, a Web Scraping
project can lead to the creation of a data product (e.g. a report) that, without infringing on the copyright,
makes it less likely for a customer to purchase a data product from the original owner of the data. In other
words, the data product created with the help of Web Scraping, can directly or indirectly compete with the
business of the website’s owner (Hirschey 2014). All this may lead to financial losses to the owner of the
website or, at the minimum, and unfair distribution of value from data ownership (Mason 1986).
Implications
Based on the literature presented in this paper, we generate a list of questions that need to be addressed in
order to make a Web Scraping project legal (Dreyer and Stockton 2013; Hirschey 2014; Snell and Menaldo
2016) and ethical (Mason 1986; Ives and Krotov 2006; Buchanan 2017). These questions are as follows:
Is Web Crawling or Web Scraping explicitly prohibited by the website’s “terms of use” policy?
Is the website’s data explicitly copyrighted?
Does the project involve illegal or fraudulent use of the data?
Can crawling and scraping potentially cause material damage to the website or Web server hosting the
website?
Can the data obtained from the website compromise individual privacy?
Can the data obtained from the website reveal confidential information about operations of the
organizations providing data or the company owning the website?
Can the project requiring the Web data potentially diminish the value of the service provided by the
website?
A positive answer to any of these questions may suggest that the Web Scraping project can potentially result
in lawsuits or ethical controversies. It may not be necessary to halt a research project involving a potential
violation of one or more principles discussed in this paper. For example, copyrighted data can still be used
in accordance with the “fair use” principle. Even if “terms of use” prohibit crawling, a permission for
automatic collection of data can still be obtained from the website’s owner. Still, researchers behind the
projects involving a positive answer to any of these questions should reflect on how they will deal with a
potential issue so that the legal and ethical requirements are still upheld.
Conclusion
The Big Data available on the Web presents researchers and practitioners with numerous opportunities.
For researchers, these opportunities include leveraging this data for developing a more granular
understanding of various old and new social phenomena in more timely fashion. Practitioners can leverage
Legality and Ethics of Web Scraping
Twenty-fourth Americas Conference on Information Systems, New Orleans, 2018
5
Web data for developing a better understanding of their customers. But leveraging Big Data from the Web
presents both researchers and practitioners with big challenges as well. Apart from the need to learn and
deploy new tools and technologies capable of accommodating Big Data, researchers and practitioners
intending to use Web Scraping in their research projects need to comply with a number of legal and ethical
requirements. Unfortunately, due to the relative novelty of the Web Scraping phenomenon, legality and
ethics of Web Scraping are still a grey area”. This work in progress is a preliminary attempt to reflect on
some of the legal and ethical issues surrounding Web Scraping. A list of specific questions that need to be
addressed by researchers employing Web Scraping is formulated in this paper. A negative answer to all
these questions does not necessarily give a clearance to proceed with the research project. This list of
questions should rather be used as a starting point for reflecting on the legality and ethics of a research
project relying on Web Scraping for data acquisition. Further research aiming to refine, extend, and
integrate various legal and ethical principles from which these questions are derived is needed. This
research will have to adopt a multidisciplinary, socio-technical perspective. Web Scraping is truly a
multidisciplinary phenomenon requiring the use of modern Big Data tools and technologies as well as
knowledge of the principles of law and ethics.
REFERENCES
Basoglu, K. A., and White, Jr. C. E. 2015. Inline XBRL versus XBRL for SEC Reporting,” Journal of
Emerging Technologies in Accounting (12:1), pp. 189-199.
Buchanan, E. 2017. “Internet Research Ethics: Twenty Years Later,” in Internet Research Ethics for the
Social Age: New Challenges, Cases, and Contexts, M. Zimmer and K. Kinder-Kurlanda (eds.), Bern,
Switzerland: Peter Lang International Academic Publishers, pp. xxix-xxxiii.
Dryer, A.J., and Stockton, J. 2013. Internet 'Data Scraping': A Primer for Counseling Clients,” New York
Law Journal. Retrieved from https://www.law.com/newyorklawjournal/almID/1202610687621
Constantiou, I. D., and Kallinikos, J. 2015. New Games, New Rules: Big Data and the Changing Context of
Strategy,” Journal of Information Technology (30:1), pp. 44-57.
Cisco Systems. 2016. Cisco Visual Networking Index: Forecast and Methodology, 2014-2019,” White
Paper. Retrieved from http://www.cisco.com/c/en/us/solutions/collateral/service-provider/ip-ngn-
ip-next-generation-network/white_paper_c11-481360.html
Goes, P. B. 2014. Editor's Comments: Big Data and IS Research,” MIS Quarterly (38:3), pp. iii-viii.
Hirschey, J. K. 2014. Symbiotic Relationships: Pragmatic Acceptance of Data Scraping,” Berkeley
Technology Law Journal (29), pp. 897-927.
Ives, B., and Krotov, V. 2006. Anything You Search Can Be Used Against You in a Court Of Law: Data
Mining in Search Archives,Communications of the Association for Information Systems (18:1), pp.
593-611.
Ives, B., Palese, B., and Rodriguez, J. A. 2016. Enhancing Customer Service through the Internet of Things
and Digital Data Streams, MIS Quarterly Executive (15:4).
IBM. 2018. “The Four V’s of Big Data,” Retrieved from http://www.ibmbigdatahub.com/infographic/four-
vs-big-data
Krotov, V., and Tennyson, M. 2018. Scraping Financial Data from the Web Using the R Language,” Journal
of Emerging Technologies in Accounting, Forthcoming
Landers, R. N., Brusso, R. C., Cavanaugh, K. J., and Collmus, A. B. 2016. A Primer on Theory-Driven Web
Scraping: Automatic Extraction of Big Data from the Internet for use in Psychological Research,”
Psychological Methods (21:4), pp. 475-492.
Light, B., & McGrath, K. 2010. Ethics and Social Networking Sites: A Disclosive Analysis of Facebook,”
Information Technology & People (23:44), pp. 290-311.
Mason, R. O. 1986. “Four Ethical Issues of the Information Age,” MIS Quarterly, (10:1), pp. 5-12.
Munzert, S., Rubba, C., Meißner, P., and Nyhuis, D. 2015. Automated Data Collection with R: A Practical
Guide to Web Scraping and Text Mining, Chichester, UK: John Wiley & Sons, Ltd.
Paul, R., and Elder, L. 2006. The Thinker's Guide to Understanding the Foundations of Ethical Reasoning.
Foundation for Critical Thinking.
Snell, J., and Menaldo, N. 2016. Web Scraping in an Era of Big Data 2.0,” Bloomberg BNA. Retrieved from
https://www.bna.com/web-scraping-era-n57982073780/
Watson, H. J. 2014. “Tutorial: Big Data Analytics: Concepts, Technologies, and Applications,”
Communications of the Association for Information Systems (34:1), pp. 1247-1268.
... To validate these interpretations, future studies should examine the link between received comments and gambling abstinence in offline self-help forums, such as Gamblers Anonymous 50 . For best practices, academic web scraping must be ethically sensitive in terms of web crawling and data organization perspectives 81 . From the perspective of web crawling, researchers should state in the methods section that the site administrator has not prohibited web scraping in terms of use 81 , that the researcher has not accessed confidential data 53 , and that the researcher has taken care not to overload the site's server 52 . ...
... For best practices, academic web scraping must be ethically sensitive in terms of web crawling and data organization perspectives 81 . From the perspective of web crawling, researchers should state in the methods section that the site administrator has not prohibited web scraping in terms of use 81 , that the researcher has not accessed confidential data 53 , and that the researcher has taken care not to overload the site's server 52 . Moreover, to respect the site's stakeholders, researchers should notify the website administrator or members of their use in advance, as pointed out by the current ethics review committee and one anonymous referee. ...
... Therefore, before publishing data, researchers may delete some of the data through low-or high-pass filters to prevent the disclosure of private companies' trade secrets [72][73][74] . Since there is no appropriate legal system in place for web scraping, researchers are always required to give ethical considerations to individuals and organizations 81 . ...
Article
Full-text available
Habit formation occurs in relation to peer habits and comments. This general principle was applied to gambling abstinence in the context of online self-help forums to quit gambling. Participants in this study, conducted between September 2008 and March 2020, were 161 abstinent and 928 non-abstinent gamblers who participated in online self-help chat forums to quit gambling. They received 269,317 comments during their first 3 years of forum participation. Gamblers had an increased likelihood of 3-year continuous gambling abstinence if they had many peers in the forums. However, they had a decreased likelihood of gambling abstinence if they received rejective comments from the forums. Based on these results, online social network-based interventions may be a new treatment option for gamblers.
... Data mining also necessitates the use of sophisticated statistical techniques. (Krotov and Silva, 2018). Due to the wide number of accessible tools and libraries that offer efficient implementations of much of the required functionality, web scraping is a pretty simple process in general. ...
... Reflecting on these issues and concerns may assist researchers in reducing the risk of ethical and legal conflicts arising from their work. (Krotov and Silva, 2018) The legal landscape surrounding web crawling and scraping is still developing, and courts are only beginning to address claims arising from web scraping or crawling for analytics reasons. Furthermore, determining whether crawling or scraping for the purpose of analytics raises legal problems is a highly fact-specific determination. ...
... Compliance isn't necessarily enforced in these cases but ethical scrappers should abide by site owner wishes by abiding with robots.txt (Krotov and Silva, 2018) Proper cleaning and preprocessing are a requirement for most web scrapped date, performing this process required manual study of the original data format and structure (Tarannum, 2019). ...
Article
Full-text available
Web scraping or web crawling refers to the procedure of automatic extraction of data from websites using software. It is a process that is particularly important in fields such as Business Intelligence in the modern age. Web scrapping is a technology that allow us to extract structured data from text such as HTML. Web scrapping is extremely useful in situations where data isn’t provided in machine readable format such as JSON or XML. The use of web scrapping to gather data allows us to gather prices in near real time from retail store sites and provide further details, web scrapping can also be used to gather intelligence of illicit businesses such as drug marketplaces in the darknet to provide law enforcement and researchers valuable data such as drug prices and varieties that would be unavailable with conventional methods. It has been found that using a web scraping program would yield data that is far more thorough, accurate, and consistent than manual entry. Based on the result it has been concluded that Web scraping is a highly useful tool in the information age, and an essential one in the modern fields. Multiple technologies are required to implement web scrapping properly such as spidering and pattern matching which are discussed. This paper is looking into what web scraping is, how it works, web scraping stages, technologies, how it relates to Business Intelligence, artificial intelligence, data science, big data, cyber securityو how it can be done with the Python language, some of the main benefits of web scraping, and what the future of web scraping may look like, and a special degree of emphasis is placed on highlighting the ethical and legal issues. Keywords: Web Scraping, Web Crawling, Python Language, Business Intelligence, Data Science, Artificial Intelligence, Big Data, Cloud Computing, Cybersecurity, legal, ethical.
... Website crawling, which requires developing and running an automatic Xpath expression script that crawl and parse Document Object Model (DOM) structure of a web page to locate and retrieve needed data. Data organization using data manipulation functions to clean unwanted data and organizing the required data to a structured format (Vlad and Leiser, 2018). ...
... The unethical web scraping has raised many questions on the legality of web scraping, which is still a "grey area" in legal theories and laws. Hence, there are several implications for using web scraping to collect data or information from websites and republishing it without the consent of the owner (Vlad and Leiser, 2018). The legal implications may include copyright infringement, breach of contract, violation of the Computer Fraud and Abuse Act (CFAA), Trespass to chattels and Hot news misappropriation (Zamora, 2019). ...
Article
Full-text available
Web scrapers are bots that run automated scripts to extract data from webpages, which has a detrimental effect to website owners. As a result, it has become a critical problem that needs to be tackled especially with the use of markup randomization technique. This technique uses markup randomization algorithm to prevent web scraping from Xpath and CSS based scrapers. However, the process has some drawbacks as regards performance. Recent research considers randomizing all CSS selectors in a CSS file, which are then synchronized with the HTML file, thus, affecting the performance of the randomization process. This article proposed an enhanced markup randomization process which randomize only relevant CSS selectors thereby reducing the time taken to process a webpage. Experiments were carried out on thirty (30) websites. The result was analyzed with Paired t-Test Sample for Means at 95% level of significant (0.05), where t-Statistic 6.020 > 2.045 (critical) and P-value is 0.000 < 0.05 (alpha), thus indicating a significant reduction in processing time. The result revealed that, in average there is a 75.54% decrease in execution time compared with existing randomization process, thereby given website owners confidence to publish their data without the fear of web scraping from their competitors using fast markup randomization algorithm.
... As a result, researchers frequently use a variety of technologies and applications to automate some or all elements of data collecting and management on the Web. Web scraping is the term for the developing activity of automatically extracting and organizing data from the Web for the purpose of subsequent analysis [10]. ...
Conference Paper
Full-text available
In terms of computers, cybersecurity has experienced massive technological and operational transformations in recent years, with data science at the forefront. Cybersecurity in the Philippines is still believed to be in its infancy. Though it is commonplace in other areas of the world, just a few scholars in the Philippines have dared to concentrate in this field. As a result, a need for assistance in promoting cybersecurity in the Philippines must be identified. Because it is difficult to persuade scholars in the Philippines to work on a certain field of computing without figures, Scientometric Analysis is required to persuade scholars in the Philippines to handle cybersecurity. The findings of this study revealed that there is no link between the frequency with which authors publish cybersecurity and their chances of receiving a high number of citations, either by year or overall. The study also found that having a lot of publications does not ensure a higher Google Scholar ranking. According to the study, the most interesting themes in the field of cybersecurity are to connect it with another discipline, thus an author should choose an attractive topic in this field to acquire a higher rank in Googlescholar and especially in Scopus. Machine Learning and IoT are the two fields that most people want to see integrated into cybersecurity. Hence, this topics can be used to encourage cybersecurity experts in the Philippines to used this topic on their research or speaking engagements.
... • The legality and ethics of web-scraping [250]. • Search results are likely to contain irrelevant data. ...
Preprint
Full-text available
Despite the significant advances achieved in Artificial Neural Networks (ANNs), their design process remains notoriously tedious, depending primarily on intuition, experience and trial-and-error. This human-dependent process is often time-consuming and prone to errors. Furthermore, the models are generally bound to their training contexts, with no considerations of changes to their surrounding environments. Continual adaptability and automation of neural networks is of paramount importance to several domains where model accessibility is limited after deployment (e.g IoT devices, self-driving vehicles, etc). Additionally, even accessible models require frequent maintenance post-deployment to overcome issues such as Concept/Data Drift, which can be cumbersome and restrictive. The current state of the art on adaptive ANNs is still a premature area of research; nevertheless, Neural Architecture Search (NAS), a form of AutoML, and Continual Learning (CL) have recently gained an increasing momentum in the Deep Learning research field, aiming to provide more robust and adaptive ANN development frameworks. This study is the first extensive review on the intersection between AutoML and CL, outlining research directions for the different methods that can facilitate full automation and lifelong plasticity in ANNs.
... This may be achieved by web scraping as shown in Fig. 6. Web scraping [42] is a technique of automatic web data extraction to extract data from the HTML of a website by parsing the webpage [43]. ...
Article
Full-text available
The COVID-19 pandemic has been a menace to the World. According to WHO, a mortality rate of 1.99% is reported as of 28th November 2021. The need of the hour is to implement certain safety measures that may not eradicate but at least put a restriction on the rising number of COVID-19 cases all over the World. To ensure that the COVID-19 protocols are being abided by, a Convolutional Neural Network (CNN)-based framework “Co-Yudh” is being developed that comprises features like detecting face masks and social distancing, tracking the number of COVID-19 cases, and providing an online medical consultancy. The paper proposes two algorithms based on CNN for implementing the above features such as real-time face mask detection using the Transfer Learning approach in which the MobileNetV2 model is used which is trained on the Simulated Masked Face Dataset (SMFD). Further, the trained model is evaluated on the novel dataset—Mask Evaluation Dataset (MED). Additionally, the YOLOv4 model is used for detecting social distancing. It also uses web scraping for tracking the number of COVID-19 cases which updates on a daily basis. This is an easy-to-use framework that can be installed in various workplaces and can serve all the purposes to keep a check on the COVID-19 protocols in the area. Our preliminary results are quite satisfactory when tested against different environmental variables and show promising avenues for further exploration of the technique. The proposed framework is a more improved version of the existing works done so far.
... Web scraping, also known as web extraction, is a technique used to extract data from the World Wide Web and save it to a file system or database for later retrieval or analysis [42]. Web Scraping consists of three main phases, shown in Figure 2, namely: website analysis, website crawling, and data organization [21]. In this study, we created a custom web scraper software using Python [37]. ...
Article
Full-text available
Social networks have become popular among researchers and scientists. Specialized platforms for researchers offer many metrics and indicators which are used to evaluate various scientists and assess the strength of their impact. In this article the authors perform systematic comparison between the main university level ResearchGate (RG) metrics: total RG Score, number of publications, number of affiliated profiles and ARWU. A tool for acquiring the RG metrics of research units and a framework for calculating alternative university ranks was implemented and tested. As a point of reference the ranking system of the Academic Ranking of World Universities (ARWU, 2019) was used. The authors used a web scraping technique to acquire data. Data analysis was based on Spearman's rho and multiple linear regression (MLR). Ten additional ranks were developed and compared with the benchmark ranking. The k-means clustering method was used to identify the groups of ARWU universities. The research results show that the metrics provided by specialized social networks can be used for the assessment of universities, however, an in-depth evaluation requires a more advanced procedure and indicators to measure many areas of scholarly activity like research, integration, application, teaching, and co-creation. Clustering method showed also that the distance between the ARWU universities measured in values of RG metrics are bigger for the top of the ranking. The university authorities should encourage researchers to use specialized social networks, and train them how to do it, to promote not only their own achievements, but also to increase the impact and recognition of their respective research units. At the end of the article some limitations of the method used and some practical recommendations for the university authorities were formulated.
Article
Full-text available
AOL’s recent public release of user search information resulted in a heated privacy debate. This case study is a detailed account of this incident. The case is designed as an in-class teaching aid covering managerial, legal, and ethical issues related to privacy. It consists of four sections (A, B,C, and D). Each section is fairly short and is designed to be read in class, separated by discussion of the previous section. Alternatively, the first section might be distributed in advance; though this runs the risk of students identifying the case and jumping ahead in the discussion (AOL’s identity is concealed from students until the end of section B). A set of potential discussion questions for each section appears in the appendix. While there are too many questions to be covered in a single class, instructors can choose questions based on their particular teaching objective. A teaching note is also available from the authors.
Article
Full-text available
Organizations are facing a new era of low-cost, small electronic devices with sensing, communications and computing capabilities, commonly known as the "Internet of Things" (IoT). Changes driven by the IoT will likely be far more profound than those brought about by previous IT eras. In particular, the digital data streams (DDSs) generated by the widespread adoption of IoT devices will create opportunities to transform the business landscape. This article describes how organizations can apply the Customer Service Life Cycle (CSLC) framework to harness the IoT to enhance customer experiences.
Article
Full-text available
The term big data encompasses a wide range of approaches of collecting and analyzing data in ways that were not possible before the era of modern personal computing. One approach to big data of great potential to psychologists is web scraping, which involves the automated collection of information from webpages. Although web scraping can create massive big datasets with tens of thousands of variables, it can also be used to create modestly sized, more manageable datasets with tens of variables but hundreds of thousands of cases, well within the skillset of most psychologists to analyze, in a matter of hours. In this article, we demystify web scraping methods as currently used to examine research questions of interest to psychologists. First, we introduce an approach called theory-driven web scraping in which the choice to use web-based big data must follow substantive theory. Second, we introduce data source theories, a term used to describe the assumptions a researcher must make about a prospective big data source in order to meaningfully scrape data from it. Critically, researchers must derive specific hypotheses to be tested based upon their data source theory, and if these hypotheses are not empirically supported, plans to use that data source should be changed or eliminated. Third, we provide a case study and sample code in Python demonstrating how web scraping can be conducted to collect big data along with links to a web tutorial designed for psychologists. Fourth, we describe a 4-step process to be followed in web scraping projects. Fifth and finally, we discuss legal, practical and ethical concerns faced when conducting web scraping projects.
Article
Full-text available
Big data and the mechanisms by which it is produced and disseminated introduce important changes in the ways information is generated and made relevant for organizations. Big data often represents miscellaneous records of the whereabouts of large and shifting online crowds. It is frequently agnostic, in the sense of being produced for generic purposes or purposes different from those sought by big data crunching. It is based on varying formats and modes of communication (e.g., texts, image and sound), raising severe problems of semiotic translation and meaning compatibility. Crucially, the usefulness of big data rests on their steady updatability, a condition that reduces the time span within which this data is useful or relevant. Jointly, these attributes challenge established rules of strategy making as these are manifested in the canons of procuring structured information of lasting value that addresses speci␣c and long-term organizational objectives. The developments underlying big data thus seem to carry important implications for strategy making, and the data and information practices with which strategy has been associated. We conclude by placing the understanding of these changes within the wider social and institutional context of longstanding data practices and the significance they carry for management and organizations.
Article
The main goal of this research note is to educate business researchers on how to automatically scrape financial data from the World Wide Web using the R programming language. This paper is organized into the following main parts. The first part provides a conceptual overview of the web scraping process. The second part educates the reader about the Rvest package—a popular tool for browsing and downloading web data in R. The third part educates the reader about the main functions of the XBRL package. The XBRL package was developed specifically for working with financial data distributed using the XBRL format in the R environment. The fourth part of this paper presents an example of a relatively complex web scraping task implemented using the R language. This complex web scraping task involves using both the Rvest and XBRL packages for the purposes of retrieving, preprocessing, and organizing financial and nonfinancial data related to a company from various sources and using different data forms. The paper ends with some concluding remarks on how the web scraping approach presented in this paper can be useful in other research projects involving financial and nonfinancial data.
Article
As currently implemented by the SEC, financial statements in XBRL format are not as useful to analysts and investors or as transparent as they could be. In this paper, the authors discuss a number of issues contributing to this problem and compare current SEC XBRL filing to a new technique—Inline XBRL (XBRL 2013). Inline XBRL (iXBRL) is based on embedding the XBRL-tagged information within an XHTML document and has significant advantages to offer.
Article
This chapter describes databases, structured query language (SQL), and several R packages that enable to connect to databases and to access the data stored in them. It first provides a brief overview of how R and databases are related to one another and defines some of the vocabulary indispensable for talking about databases. The chapter then presents the conceptual basics of relational databases, followed by an introduction to SQL fundamentals, the language to handle relational databases. Relational database management systems (RDBMS) are a specific type of database management system (DBMS) based on the relational model and the most common form of database management systems. The chapter shows how to deal with databases using R'establishing connections, passing through SQL queries, and using convenient functions of the numerous R packages that provide database connectivity.
Article
This paper discusses the current legal regimes surrounding data scraping online. Although these doctrines can be used to protect data, the paper highlights situations when businesses can benefit from working with, instead of against, scrapers.