ChapterPDF Available

Web Scraping

Authors:
W
Web Scraping
Bo Zhao
College of Earth, Ocean, and Atmospheric
Sciences, Oregon State University, Corvallis, OR,
USA
Web scraping, also known as web extraction or
harvesting, is a technique to extract data from the
World Wide Web (WWW) and save it to a le
system or database for later retrieval or analysis.
Commonly, web data is scrapped utilizing Hyper-
text Transfer Protocol (HTTP) or through a web
browser. This is accomplished either manually by
a user or automatically by a bot or web crawler.
Due to the fact that an enormous amount of het-
erogeneous data is constantly generated on the
WWW, web scraping is widely acknowledged as
an efcient and powerful technique for collecting
big data (Mooney et al. 2015; Bar-Ilan 2001). To
adapt to a variety of scenarios, current web scrap-
ing techniques have become customized from
smaller ad hoc, human-aided procedures to the
utilization of fully automated systems that are
able to convert entire websites into well-organized
data set. State-of-the-art web scraping tools are
not only capable of parsing markup languages or
JSON les but also integrating with computer
visual analytics (Butler 2007) and natural lan-
guage processing to simulate how human users
browse web content (Yi et al. 2003).
The process of scraping data from the Internet
can be divided into two sequential steps; acquiring
web resources and then extracting desired infor-
mation from the acquired data. Specically, a web
scraping program starts by composing a HTTP
request to acquire resources from a targeted
website. This request can be formatted in either a
URL containing a GET query or a piece of HTTP
message containing a POST query. Once the
request is successfully received and processed by
the targeted website, the requested resource will
be retrieved from the website and then sent back to
the give web scraping program. The resource can
be in multiple formats, such as web pages that are
built from HTML, data feeds in XML or JSON
format, or multimedia data such as images, audio,
or video les. After the web data is downloaded,
the extraction process continues to parse,
reformat, and organize the data in a structured
way. There are two essential modules of a web
scraping program a module for composing
an HTTP request, such as Urllib2 or selenium
and another one for parsing and extracting infor-
mation from raw HTML code, such as Beautiful
Soup or Pyquery. Here, the Urllib2 module
denes a set of functions to dealing with HTTP
requests, such as authentication, redirections,
cookies, and so on, while Selenium is a web
browser wrapper that builds up a web browser,
such as Google Chrome or Internet Explorer, and
enables users to automate the process of browsing
a website by programming. Regarding data
extraction, Beautiful Soup is designed for
#Springer International Publishing AG (outside the USA) 2017
L.A. Schintler, C.L. McNeely (eds.), Encyclopedia of Big Data,
DOI 10.1007/978-3-319-32001-4_483-1
scraping HTML and other XML documents. It
provides convenient Pythonic functions for navi-
gating, searching, and modifying a parse tree; a
toolkit for decomposing an HTML le and extra-
cting desired information via lxml or html5lib.
Beautiful Soup can automatically detect the
encoding of the parsing under processing and
convert it to a client-readable encode. Similarly,
Pyquery provides a set of Jquery-like functions to
parse xml documents. But unlike Beautiful Soup,
Pyquery only supports lxml for fast XML
processing.
Of the various types of web scraping programs,
some are created to automatically recognize the
data structure of a page, such as Nutch or Scrapy,
or to provide a web-based graphic interface that
eliminates the need for manually written web
scraping code, such as Import.io. Nutch is a robust
and scalable web crawler, written in Java. It
enables ne-grained conguration, paralleling
harvesting, robots.txt rule support, and machine
learning. Scrapy, written in Python, is an reusable
web crawling framework. It speeds up the process
of building and scaling large crawling projects. In
addition, it also provides a web-based shell to
simulate the website browsing behaviors of a
human user. To enable nonprogrammers to har-
vest web contents, the web-based crawler with a
graphic interface is purposely designed to mitigate
the complexity of using a web scraping program.
Among them, Import.io is a typical crawler for
extracting data from websites without writing any
code. It allows users to identify and convert
unstructured web pages into a structured format.
Import.ios graphic interface for data identica-
tion allows user to train and learn what to extract.
The extracted data is then stored in a dedicated
cloud server, and can be exported in CSV, JSON,
and XML format. A web-based crawler with a
graphic interface can easily harvest and visualize
real-time data stream based on SVG or WebGL
engine but fall short in manipulating a large data
set.
Web scraping can be used for a wide variety of
scenarios, such as contact scraping, price change
monitoring/comparison, product review collec-
tion, gathering of real estate listings, weather
data monitoring, website change detection, and
web data integration. For examples, at a micro-
scale, the price of a stock can be regularly scraped
in order to visualize the price change over time
(Case et al. 2005), and social media feeds can be
collectively scraped to investigate public opinions
and identify opinion leaders (Liu and Zhao 2016).
At a macro-level, the metadata of nearly every
website is constantly scraped to build up Internet
search engines, such as Google Search or Bing
Search (Snyder 2003).
Although web scraping is a powerful technique
in collecting large data sets, it is controversial and
may raise legal questions related to copyright
(OReilly 2006), terms of service (ToS) (Fisher
et al. 2010), and trespass to chattels(Hirschey
2014). Aweb scraper is free to copy a piece of data
in gure or table form from a web page without
any copyright infringement because it is difcult
to prove a copyright over such data since only a
specic arrangement or a particular selection of
the data is legally protected. Regarding the ToS,
although most web applications include some
form of ToS agreement, their enforceability usu-
ally lies within a gray area. For instance, the
owner of a web scraper that violates the ToS
may argue that he or she never saw or ofcially
agreed to the ToS. Moreover, if a web scraper
sends data acquiring requests too frequently, this
is functionally equivalent to a denial-of-service
attack, in which the web scraper owner may be
refused entry and may be liable for damages under
the law of trespass to chattels,because the
owner of the web application has a property inter-
est in the physical web server which hosts the
application. An ethical web scraping tool will
avoid this issue by maintaining a reasonable
requesting frequency.
A web application may adopt one of the fol-
lowing measures to stop or interfere with a web
scrapping tool that collects data from the given
website. Those measures may identify whether an
operation was conducted by a human being or a
bot. Some of the major measures include the fol-
lowing: HTML ngerprintingthat investigates
the HTML headers to identify whether a visitor is
malicious or safe (Acar et al. 2013); IP reputation
determination, where IP addresses with a
recorded history of use in website assaults that
2 Web Scraping
will be treated with suspicion and are more likely
to be heavily scrutinized (Sadan and Schwartz
2012); behavior analysis for revealing abnormal
behavioral patterns, such as placing a suspiciously
high rate of requests and adhering to anomalous
browsing patterns; and progressive challenges
that lter out bots with a set of tasks, such as
cookie support, JavaScript execution, and
CAPTCHA (Doran and Gokhale 2011).
Further Readings
Acar, G., Juarez, M., Nikiforakis, N., Diaz, C., Gürses, S.,
Piessens, F., & Preneel, B. (2013). Fpdetective: Dusting
the web for ngerprinters. In Proceedings of the 2013
ACM SIGSAC conference on computer & communica-
tions security. New York: ACM.
Bar-Ilan, J. (2001). Data collection methods on the web for
infometric purposes A review and analysis.
Scientometrics, 50(1), 732.
Butler, J. (2007). Visual web page analytics. Google
Patents.
Case, K. E., Quigley, J. M., & Shiller, R. J. (2005). Com-
paring wealth effects: The stock market versus the
housing market. The BE Journal of Macroeconomics,
5(1), 1.
Doran, D., & Gokhale, S. S. (2011). Web robot detection
techniques: Overview and limitations. Data Mining
and Knowledge Discovery, 22(1), 183210.
Fisher, D., Mcdonald, D. W., Brooks, A. L., & Churchill,
E. F. (2010). Terms of service, ethics, and bias: Tapping
the social web for CSCW research. Computer
Supported Cooperative Work (CSCW), Panel
discussion.
Hirschey, J. K. (2014). Symbiotic relationships: Pragmatic
acceptance of data scraping. Berkeley Technology Law
Journal, 29, 897.
Liu, J. C.-E., & Zhao, B. (2016). Who speaks for climate
change in China? Evidence from Weibo. Climatic
Change, 140(3), 413422.
Mooney, S. J., Westreich, D. J., & El-Sayed, A. M. (2015).
Epidemiology in the era of big data. Epidemiology,
26(3), 390.
OReilly, S. (2006). Nominative fair use and Internet
aggregators: Copyright and trademark challenges
posed by bots, web crawlers and screen-scraping tech-
nologies. Loyola Consumer Law Review, 19, 273.
Sadan, Z., & Schwartz, D. G. (2012). Social network
analysis for cluster-based IP spam reputation. Informa-
tion Management & Computer Security, 20(4),
281295.
Snyder, R. (2003). Web search engine with graphic snap-
shots. Google Patents.
Yi, J., Nasukawa, T., Bunescu, R., & Niblack, W. (2003).
Sentiment analyzer: Extracting sentiments about a
given topic using natural language processing tech-
niques. Data Mining, 2003. ICDM 2003. Third IEEE
International Conference on, IEEE. Melbourne,
Florida, USA.
Web Scraping 3
... The use of web scraping in the housing sector was a second objective of the survey. 11 million rental posts were collected between May and July 2014 [30]. ...
Preprint
Full-text available
In this paper, the Customer Reviews Analysis of Products, collected from Tunisian online retail platforms, is performed by collecting the data with the help of the web scraping method. Approaches comprise a language detection process that will handle the reviews in any language, followed by sentiment analysis, which in turn forms the polarity of comments: positive, negative, or neutral. The classification of opinions has been done using some machine learning algorithms: Support Vector Machines, Decision Trees, Random Forests, and Naive Bayes. Besides, in this paper, the Tf-Idf method is applied to represent and classify the texts. This research will give some effective insights into improving customer satisfaction and optimizing business strategy regarding the context of Tunisian e-commerce.
... Upon selecting the social media platforms and data types, the data collection methodology warrants further consideration. The web crawler, an automated tool for web information retrieval, has been utilized to extract social media data effectively (61)(62)(63). It navigates within defined boundaries to isolate pertinent information, discarding irrelevant content (64). ...
Article
Full-text available
Introduction In the 2020s, particularly following 2022, the Chinese government introduced a series of initiatives to foster the development of the prepared dishes sector, accompanied by substantial investments from industrial capital. Consequently, China’s prepared dishes industry has experienced rapid growth. Nevertheless, this swift expansion has elicited varied public opinions, particularly concerning the potential health effects of prepared dishes. Therefore, this study aims to gather and analyze comments from social media on prepared dishes using machine learning techniques. The objective is to ascertain the perspectives of the Chinese populace on the health implications of consuming prepared dishes. Methods Social media comments, characterized by their broad distribution, objectivity, and timeliness, served as the primary data source for this study. Initially, the data underwent preprocessing to ensure its suitability for analysis. Subsequent steps in this study involved conducting sentiment analysis and employing the BERTopic model for topic clustering. These methods aimed to identify the principal concerns of the public regarding the impact of prepared dishes on health. The final phase of the study involved a comparative analysis of changes in public sentiment and thematic focus across different time frames. This approach provides a dynamic view of evolving public perceptions related to the health implications of prepared dishes. Results This study analyzed over 600,000 comments gathered from various social media platforms from mid-July 2022 to the end of March 2024. Following data preprocessing, 200,993 comments were assessed for sentiment, revealing that more than 64% exhibited negative emotions. Subsequent topic clustering using the BERTopic model identified that 11 of the top 50 topics were related to public health concerns. These topics primarily scrutinized the safety of prepared dish production processes, raw materials, packaging materials, and additives. Moreover, significant public’s interest was in the right to informed consumption across different contexts. Notably, the most pronounced public opposition emerged regarding introducing prepared dishes into primary and secondary school canteens, with criticisms directed at the negligence of educational authorities and the ethics of manufacturers. Additionally, there were strong recommendations for media organizations to play a more active role in monitoring public opinion and for government agencies to enhance regulatory oversight. Conclusion The findings of this study indicate that more than half of the Chinese public maintain a negative perception towards prepared dishes, particularly concerning about health implications. Chinese individuals display considerable sensitivity and intense reactions to news and events related to prepared dishes. Consequently, the study recommends that manufacturers directly address public psychological perceptions, proactively enhance production processes and service quality, and increase transparency in public communications to improve corporate image and people acceptance of prepared dishes. Additionally, supervisory and regulatory efforts must be intensified by media organizations and governmental bodies, fostering the healthy development of the prepared food industry in China.
Chapter
In this study, the authors aim to explore the potential of machine learning (ML) in real estate valuation, particularly in Morocco where challenges include intelligent and sustainable valuation methods and transitioning to smart urban planning aligned with the eleventh sustainable development goal. To tackle these, they analyzed, processed, and tested seven ML architectures using real estate ads from Casablanca and Rabat collected over three months (April to June 2022). Support vector regression (SVR) led with 92.6% accuracy, followed by neural networks at 90%, then random forest, gradient boosting, XGBoost, and ridge and lasso regressions. SVR, a validated model, produced predictions depicted in an interactive thematic map showing their distribution across the two cities, underscoring the influence of digital real estate on conventional valuation methods.
Article
Full-text available
This article introduces an Artificial Intelligent-driven system for Galliformes Farm Management, consulting, and disease control. Comprising both a web-based mobile app and a website, the system integrates physical electronic devices to regulate intelligent automation. The application employs Artificial Neural Network and computer vision for the identification of breeds and the recognition of diseases in galliform birds, utilizing TensorFlow for real-time detection. It notifies local farmers about bird flu infections with an integrated infection map using Google's map Application Programming Interface. The system also controls farm temperature and humidity through microcontrollers and sensors, offering consulting features such as cost calculation, food chart generation, and drug suggestions.
Article
Full-text available
Accurate identification and management of medical devices is of particular importance to ensure patient safety and regulatory compliance within healthcare systems. This paper presents a comprehensive exploration of medical device data retrieval, focusing on the integration of web scraping and API technologies. The utilization of Unique Device Identifiers (UDIs) and the Global Medical Device Nomenclature (GMDN) system is emphasized to enhance device authentication, attribute verification, and accurate categorization. This paper introduces a state-of-the-art code implementation that combines web scraping techniques and API integration to address the challenges of retrieving and verifying device information. The code facilitates both access to data and for healthcare professionals and stakeholders to make informed decisions based on reliable and up-to-date information. This is a significant and defining advance in the field, offering a powerful solution that is innovative as well as vital. The paper concludes by discussing the potential impact of these developments on patient safety, regulatory compliance, and the overall advancement of healthcare technology. In addition, the importance of accurate device identification, the role of UDIs and GMDN, and the significance of the provided cutting-edge code are highlighted, providing valuable insights into the field of medical device data retrieval.
Article
Full-text available
This paper introduces a process that is designed to harvest data automatically from a variety of online sources. The core of this process lies in its data-handling techniques, which include drawing, cleaning, deduplicating, extracting, and categorizing of raw data to convert unstructured data into a structured format represented and imported in a graph database. The data extraction step utilizes Large Language Model (LLMs) for Named Entity Recognition (NER). A case study on deploying course data collection illustrates the enhancements brought about by this automation, showcasing improvements in the accuracy, completeness, and timeliness of updates in the course data. An evaluation carried out on the extraction and matching methods shows that the F1-score and precision rates are high. Overall, this study contributes to advancement of the field by providing a methodology for automating the collection and processing of online data sources, significantly improving the quality of data collection from online sources.
Article
Full-text available
Data-driven technologies changed the way how service innovation is conducted in organisations. The literature discusses the potential of Data-Driven Service Innovation (DDSI) processes, but it is not clear yet what tool support for DDSI looks like. This structured literature review examines the tool support for each phase of DDSI processes. We found clear differences between the DDSI phases and different tools for each phase. In the first phase, tools with batch processing capability are employed for methods like text mining and sentiment analysis, helping to capture evolving customer behaviour and trends to increase the speed of innovation rate. In the second phase, immersive technologies, real-time sentiment analysis and stream analytics are used for the validation of service development processes to increase the likelihood of market success. In the third phase, AI tools with the capability to continuously learn from emerging data and Big Data Analytics tools are combined.
Article
Full-text available
Social media provides a new and expanding forum to discuss climate change. Existing research in this area has focused mainly on Twitter and discussions in the United States, while online discussion of climate change in China has been largely unexamined. To fill this gap, we analyzed discussion about climate change on China’s premiere microblogging website, Weibo, over a two month period surrounding the Paris Climate Summit. The results show that institutional users-state media and international actors-dominate the discussion, while Chinese NGOs and public intellectuals are mostly absent from the scene. Discussion on climate change is concentrated in major urban areas, especially in Beijing. A significant proportion of Weibo posts aim to raise climate change awareness; few users discuss topics such as climate science, climate change’s actual impacts on China, or China’s low-carbon policy measures. Climate change appears as a global threat that has little connection to China's national context.
Article
Full-text available
Social, collaborative web applications such as Facebook, YouTube, Flickr are invaluable sources of network, social, and behavioral data. They are also increasingly used to recruit participants for experimental, survey, interview and ethnographic studies. Two sets of issues arise for conducting relevant, valuable, ethical and meaningful research. First, issues derive from restrictions imposed by companies' legal Terms and Conditions - specifically, sites and services are laden with conditions of access that may constrain researchers from being able to collect the data they need to address research questions. Second, questions arise around research validity and research ethics - where access is possible, sampling can be skewed owing the specifics of the implementation and/or emergence of the site's social graph, but also from the inadvertent violation of researcher/participant privacy as a result of readily available personal content. In this panel, we bring researchers, and industrial representatives together to discuss these and related issues, in at attempt to figure out how valid and ethically responsible research can be conducted while users continue to share openly and while industry protects its goals.
Article
Full-text available
Most modern Web robots that crawl the Internet to support value-added services and technologies possess sophisticated data collection and analysis capabilities. Some of these robots, however, may be ill-behaved or malicious, and hence, may impose a significant strain on a Web server. It is thus necessary to detect Web robots in order to block undesirable ones from accessing the server. Such detection is also essential to ensure that the robot traffic is considered appropriately in the performance and capacity planning of Web servers. Despite a variety of Web robot detection techniques, there is no consensus regarding a single technique, or even a specific “type” of technique, that performs well in practice. Therefore, to aid in the development of a practically applicable robot detection technique, this survey presents a critical analysis and comparison of the prevalent detection approaches. We propose a framework to classify the existing detection techniques into four categories based on their underlying detection philosophy. We compare the different classes to gain insights into those characteristics that make up an effective robot detection scheme. Finally, we discuss why the contemporary techniques fail to offer a general solution to the robot detection problem and propose a set of key ingredients necessary for strong Web robot detection. KeywordsWeb Crawler–Web Robot–WWW–Web Robot Detection–Web User Classification
Article
Big Data has increasingly been promoted as a revolutionary development in the future of science, including epidemiology. However, the definition and implications of Big Data for epidemiology remain unclear. We here provide a working definition of Big Data predicated on the so-called "three V's": variety, volume, and velocity. From this definition, we argue that Big Data has evolutionary and revolutionary implications for identifying and intervening on the determinants of population health. We suggest that as more sources of diverse data become publicly available, the ability to combine and refine these data to yield valid answers to epidemiologic questions will be invaluable. We conclude that while epidemiology as practiced today will continue to be practiced in the Big Data future, a component of our field's future value lies in integrating subject matter knowledge with increased technical savvy. Our training programs and our visions for future public health interventions should reflect this future.
Article
This paper discusses the current legal regimes surrounding data scraping online. Although these doctrines can be used to protect data, the paper highlights situations when businesses can benefit from working with, instead of against, scrapers.
Article
Purpose IP reputation systems, which filter e‐mail based on the sender's IP address, are located at the perimeter – before the messages reach the mail server's anti‐spam filters. To increase IP reputation system efficacy and overcome the shortcomings of individual IP‐based filtering, recent studies have suggested exploiting the properties of IP clusters, such as those of Autonomous Systems (AS). Cluster‐based techniques can enhance accuracy and reduce false negative rates. However, clusters generally contain enormous amounts of IP addresses, which hinder cluster‐based systems from reaching their full spam filtering potential. The purpose of this paper is exploitation of social network metrics to obtain a more granular, i.e. sub‐divided, view of cluster‐based reputation, and thus enhance spam filtering accuracy. Design/methodology/approach The authors examined the performance of various social network metrics, including nodal degree, betweenness centrality, closeness centrality and valued graphs, to find an optimal element that enhances IP reputation prediction in AS clusters. Findings It was found that all measures contributed to prediction, yet the best predictor of spam reputation was the out‐degree metric, which showed a strong positive correlation with spam reputation prediction. This implies that more granular information can increase the accuracy of IP reputation prediction in AS clusters. Practical implications Used in conjunction with other technologies, the granular cluster‐based reputation system can be a valuable addition to commercial and open‐source spam filtering systems, or to standalone DNS‐based blacklists. Originality/value The authors' approach can promote mitigation of larger spam volumes at the perimeter, save bandwidth, and conserve valuable system resources.
Conference Paper
In the modern web, the browser has emerged as the vehicle of choice, which users are to trust, customize, and use, to access a wealth of information and online services. However, recent studies show that the browser can also be used to invisibly fingerprint the user: a practice that may have serious privacy and security implications. In this paper, we report on the design, implementation and deployment of FPDetective, a framework for the detection and analysis of web-based fingerprinters. Instead of relying on information about known fingerprinters or third-party-tracking blacklists, FPDetective focuses on the detection of the fingerprinting itself. By applying our framework with a focus on font detection practices, we were able to conduct a large scale analysis of the million most popular websites of the Internet, and discovered that the adoption of fingerprinting is much higher than previous studies had estimated. Moreover, we analyze two countermeasures that have been proposed to defend against fingerprinting and find weaknesses in them that might be exploited to bypass their protection. Finally, based on our findings, we discuss the current understanding of fingerprinting and how it is related to Personally Identifiable Information, showing that there needs to be a change in the way users, companies and legislators engage with fingerprinting.
Article
We present different methods of data collection from the Web for informetric purposes. For each method, some studies utilizing it are reviewed, and advantages and shortcomings of each technique are discussed. The paper emphasizes that data collection must be carried out with great care. Since the Web changes constantly, the findings of any study are valid only in the time frame in which it was carried out, and are dependent on the quality of the data collection tools, which are usually not under the control of the researcher. At the current time, the quality and the reliability of most of the available search tools are not satisfactory, thus informetric analyses of the Web mainly serve as demonstrations of the applicability of informetric methods to this medium, and not as a means for obtaining definite conclusions. A possible solution is for the scientific world to develop its own search and data collection tools.
Article
We examine the link between increases in housing wealth, financial wealth, and consumer spending. We rely upon a panel of 14 countries observed annually for various periods during the past 25 years and a panel of U.S. states observed quarterly during the 1980s and 1990s. We impute the aggregate value of owner-occupied housing, the value of financial assets, and measures of aggregate consumption for each of the geographic units over time. We estimate regression models in levels, first differences and in error-correction form, relating consumption to income and wealth measures. We find a statistically significant and rather large effect of housing wealth upon household consumption.