ChapterPDF Available

Abstract and Figures

Big data analytics gives organizations a way to analyze huge data sets and gather new information. It helps answer basic questions about business operations and business performance. It also helps discover unknown patterns in vast datasets or combinations thereof. In the current data-driven world, it becomes increasingly essential that big data techniques are applied and analyzed for organizational growth. More specifically, with the large availability of data on the Web, whether from social media, websites, online portals, or platforms, to name but a few, it is important for organizations to know how to mine that data in order to extract useful knowledge. Web scraping represents a fundamental approach in this regard. Therefore, this paper aims to provide an updated literature review about the most advanced Web Scraping techniques to better equip scholars and managers with helpful knowledge on how to mine most effectively online data. The paper starts with presenting the basic design of a web scraper and the applications of web scraping in diverse sectors and areas. Next, the different Web scraping methods and Web scraping technologies are presented. Finally, a procedure to develop Web scraping with various tools is proposed before a conclusion wraps up the paper.
Content may be subject to copyright.
Web Scraping Techniques and Applications: A Literature
Review
Chaimaa Lotfi1, Swetha Srinivasan1, Myriam Ertz1[0000- 0001-9959-2779] and Imen Latrous1
1 LaboNFC, University of Quebec at Chicoutimi, 555 Boulevard de l’Université, Saguenay (QC), Canada
Abstract
Big data analytics gives organizations a way to analyze huge data sets and gather new information. It helps
answer basic questions about business operations and business performance. It also helps discover unknown
patterns in vast datasets or combinations thereof. In the current data-driven world, it becomes increasingly
essential that big data techniques are applied and analyzed for organizational growth. More specifically, with
the large availability of data on the Web, whether from social media, websites, online portals, or platforms, to
name but a few, it is important for organizations to know how to mine that data in order to extract useful
knowledge. Web scraping represents a fundamental approach in this regard. Therefore, this paper aims to
provide an updated literature review about the most advanced Web Scraping techniques to better equip scholars
and managers with helpful knowledge on how to mine most effectively online data. The paper starts with
presenting the basic design of a web scraper and the applications of web scraping in diverse sectors and areas.
Next, the different Web scraping methods and Web scraping technologies are presented. Finally, a procedure
to develop Web scraping with various tools is proposed before a conclusion wraps up the paper.
Keywords: Big data, web scraping, business performance, web crawling, web mining
1 Introduction
Data has a vital role in business, marketing, engineering, social sciences, and other disciplines of study
since it may be utilized as a starting point for any operations activities that include the exploitation of
information and knowledge. The initial step of research is data collecting, followed by the systematic
measurement of information about important factors, allowing one to answer inquiries, formulate research
questions, test hypotheses, and assess outcomes.
Data collection methods differ depending on the subject or topic of study, the type of data sought, and the
user’s aims. Depending on the goals and conditions, the method’s application methodology can also change
without jeopardizing data integrity, correctness, or reliability [1]. There are numerous data sources on the
Internet that might be employed in the design process. The technique of extracting data from websites is
often known as web scraping, web extraction, web harvesting, web crawler.
This research will address how to build a web scraping tool to extract meaningful information from online
sources and look for recent web scraping methods and techniques. The study further helped us compare the
available tools and choose the most suitable one for the study.
The basic design of a web scraper is shown in Figure 1 as follows:
Fig. 1. The basic design of a web scraper
This figure depicts the general schema that we will follow in our resolution study.
The table below shows the stages of the process of literature review that was followed for this study.
Table 1. Process of the Study
No.
Stage
Description
1
Determination of research questions
How to collect data from the Internet?
What is Web data extraction, and how to perform it?
What are the techniques to perform web scraping, and what are its
applications?
2
Establishing criteria for inclusion and
exclusion
The research includes documents that were published as of June
2021. Peer-reviewed publications, conference proceedings, book
chapters, professional articles, industry data, and research reports
were all included in the search. Furthermore, only papers written in
English were considered.
3
Developing a research nomenclature
ABI/Inform, Academic Search Complete, Google Scholar, JSTOR,
Scopus, PubMed, Web of Science, and IEEE Xplore were used to
conduct a series of systematic retrievals in critical academic
databases.
The following search phrases were chosen: "web scraping," "web
data extraction," "web mining," and "web crawling" are all terms
used to describe the process of extracting information from the
Internet. The publications were thoroughly examined, with all parts
of the text scrutinized.
4
Independent Analysis and mapping
of the results
All the selected publications were analyzed individually and mapped
based on domains, tools, and technologies used and presented as a
summary for each publication.
5
Structure Analysis
Based on the findings of Stage 4, we summarized the papers and
publications under each head, namely, “Applications of Web
scraping” explaining the different domains where web scraping was
used, “Web Scraping Methods,” describing the different ways of
extracting data from the Internet, “Web Scraping Technology”
elucidating the different available technologies and “Development of
Web scraping tools” details how the web scraping tools used in the
publications are built.
Parse the contents of the
website and extract the
required content from
crawled URLs
Store the
extracted
contents into
the database
Query the web
using seed
URLs
Crawl through the
web using seed
URLs
2 Applications of Web Scraping
Web scraping is widely utilized for a variety of purposes, including comparing prices online, observing
changes in weather data, website change detection, research, integrating data from multiple sources, extracting
offers and discounts, scraping job postings information from job portals, brand monitoring, and market analysis
[2].
It is also used as a means of data collection quickly and efficiently. Web scraping has myriad applications
in various domains. It acts as a prerequisite to big data analytics. Discussed below are a few of the several domains
where web scraping is used.
2.1 In healthcare
Healthcare is no longer a domain that relies wholly on physical contact. Instead, in its unique manner, it
has gone digital. In this data-driven environment, web scraping in healthcare can save many lives by allowing
sensible decisions to be made.
Healthcare workers typically regard data collecting engaging many patients as a tedious and arduous process.
Even while clinical data is needed more than ever, the current patient load makes gathering it nearly impossible.
To that end, the author proposes implementing a system that collects clinical data from SARS-CoV2 patients who
visit the hospital automatically and autonomously for future research [3].
Another application of web data extraction techniques in the healthcare domain is research conducted by
Dascalu et al. [4], where crawlers extract drug leaflets.
2.2 In social media
Extracting data from social media proves to be a great help in improving the marketing campaigns for
companies. In this fast-paced world, companies can quickly analyze the customers' sentiment towards their
products, improve public relations and audience engagement.
For this purpose, they created a web-based Instagram account data download application that may be
utilized by numerous parties for this purpose, using a web scraping technology. The web scraping method was
chosen by the researchers eliminating the need to use Instagram's Application Programming Interface (API), which
has several restrictions for accessing and retrieving data on the platform. The web scraping method successfully
created an Instagram account data grabber application. Application testing was carried out on 15 accounts with a
total number of publications ranging from 100 to 11000 in this study. The web scraping solution was able to
successfully capture Instagram account data for 2412 accounts, based on the results of the analysis. This
application can help users save Instagram account data to a database manager and export data to several formats,
namely Excel, JSON, or CSV [3].
2.3 In finance
The author proposed a first approach to develop web-based innovation indicators that could address some of the
drawbacks of existing indicators. In particular, they created a strategy for identifying product innovator enterprises
on a wide scale at a minimal cost. Then, utilizing traditional company-level data from a questionnaire-based
innovation survey, trained an ANN classification model using labeled (product innovator/no product innovator)
online texts of surveyed enterprises (German Community Innovation Survey). They then used their categorization
model to forecast whether or not hundreds of thousands of German companies are product innovators by analyzing
their online texts. Next, they compared their predictions against patent statistics at the firm level, benchmark data
derived from survey analysis, and regional innovation indicators. Given its breadth and geographic granularity,
the findings show that this method yields solid projections and has the potential to be a valuable and cost-effective
addition to the existing set of innovation indicators [5].
The research conducted by Tharanya et al. [6] uses technical analysis of news articles scraped from the Internet.
The news is extracted from a reliable website, and the contents of the website are summarized to perform analysis
and event modeling.
2.4 In marketing
Boegershausen et al. [7], in the report, talk about the vast amount of customer data in the form of a digital footprint
available to analyze customer behavior and to answer customer research questions. In their paper, Saranya et al.
[8] propose to predict customer purchase intention during online purchases using machine learning models. The
data is collected using web scraping since the information on the Web is in an unstructured format. The data is
further analyzed to predict the purchase
intent.
Nguyen et al. [9] analyze social media engagement of Australian SMEs using web scraping. They collect the
data from Instagram using Instagram API and use the data to further find that tagging instead of hashtags garner
more engagement as it is more trustworthy.
2.5 Others
Deng, in his paper, has used web data extraction techniques to extract information on mineral intelligence in China
[10]. Kotouza et al. [11]. have taken advantage of web data extraction techniques to design a system that acts as
an assistant to a fashion designer to provide information about the newest fashion trends improving customization.
In [12], the authors have used the information available on the Internet to extract forestry information features.
Based on the reviews published on the Web, the authors Yaroslav et al. [13] performed the task of studying traffic
safety in Northwestern Federal District using Python libraries Scrapy to scrape the reviews from the Internet.
2.6 In research
Authors Suganya et al. [14], in their paper, use web scraping for web citation analysis which helps researchers in
finding related papers for further analysis. They study and compare three methods: Particle Swarm Optimization,
Hidden Markov Model algorithm, and Firefly Optimization algorithm-based Web scraping to extract information
regarding web citation based on the given query. Based on their experiments, it is found that Firefly Optimization
Algorithm-based web scraping (FOAWS) performs better than the rest of the techniques.
Similarly, authors Rahmatulloh et al. [15] in their paper employ HTML DOM-based web scraping to make
recapitulations of scientific article publications from Google Scholar to aid in research studies. The recapitulations
are further programmed to be presented as a report either in a PDF or Excel file.
The authors Kolli et al. [16] show a customized news Internet search engine that focuses on constructing a
repository of reporting stories by relating adept content data mining from a network information sheet from shifted
e-information entrances.
In Li [17], the author proposes employing web scraping and natural language processing to decrease the time
required to detect the research gap. This strategy is tested by looking at three different areas: safety awareness,
home prices, sentiment, and artificial intelligence. First, the titles of the publications are scraped from Google
Scholar and using tokenization. The titles are parsed. By ranking the collocations based on descending range of
frequency, the set of keywords that are not used in the paper title is obtained, and the research void is determined.
In Breno et al. [18], the paper proposes a scholarly production dataset focusing on COVID-19 to provide an
overview of scientific research activities, making it easier to identify countries, scientists, and research groups
most active in this coronavirus disease task force. Between January 2019 and July 2020, a dataset containing
40,212 records of article metadata was extracted from various databases, namely Scopus, PubMed, arXiv, and
bioRxiv using Python Web Scraping techniques and pre-processed with Pandas Data Wrangling using a pipeline
versioned with the Data Version Control tool (DVC), making it easy to replicate and audit. To extract data from
PubMed and Scopus, API was used, and Scrapy was used for scraping data from arXiv and bioRxiv databases
3 Web Scraping Methods
Web scraping is the process of autonomous data mining or gathering information from the Internet and other
common databases. Different Web scraping methods have been developed in multiple types of research and are
presented in the following sub-sections.
3.1 Traditional copy and paste
The copy-pasting method is simple: access the page using your browser, then manually copy and paste it onto
other media. However, even though the method is pretty easy and straightforward if the website employs a barrier
program, it makes it difficult to use [1], which requires a human selection of objects or sentences that are somewhat
long. At the same time, other methods are more challenging to utilize and necessitate an extra program.
3.2 HTML parsing
Extensive collections of pages are produced programmatically from a fundamental organized source, such as a
database, on many websites. A common script or template encodes data from the same category into similar pages.
A wrapper is a program in data mining that recognizes templates in a given source of information, extracts its
content, and converts it into a relational form [19]. Wrapper generation techniques presume that a wrapper
induction system’s input pages follow a common pattern and can be determined pretty easily by using a common
URL format. Furthermore, semi-structured data query languages like XQuery and HTQL can be used to analyze
HTML websites and extract and change their information [20,21].
3.3 DOM parsing
Programs can obtain dynamic material generated by client-side scripts by placing a developed web browser, such
as Internet Explorer or the Mozilla browser control. These applications also parse web pages into a Document
Object Model (DOM) tree, from which applications can extract sections of the pages [19,21]. Also, a tree structure
Document Object Model can represent a web page. For example, it translates and saves a specified website address
page into a DOM tree from a search engine.
This method provides a lot of flexibility and agility. For example, if it’s on the page, it can be tracked without
waiting for the web development team to expose it through the data layer [21].
3.4 HTML DOM
The HTML DOM (Hyper Text Markup Language Document Object Model) is a yardstick for obtaining, altering,
and editing HTML elements [22]. By defining objects and properties for all HTML components, as well as ways
to access them, DOM efficiency can be improved. For example, JavaScript can access all elements in an HTML
document using the DOM. To access objects, the HTML DOM employs computer languages, most often
JavaScript [22].
Every HTML element is considered an object. Each object’s method and property make up the programming
interface [1,22].
3.5 Regular Expression (Regex)
Regex is a formula that explains a group of words that spans numerous alphabets and follows a precise pattern. It
can be used to match specific character patterns across several strings. Ordinary characters and metacharacters are
the two sorts of regular expressions [1]. Some of these patterns look pretty strange because they contain both the
material to match and special characters that modify how the pattern is perceived. Regular expressions are a must-
know tool for parsing string data and should be learned at the very least at a basic level [23].
3.6 XPath
The main component of the XSLT standard is XPath (Stylesheet Language Transformation). In eXtensible Markup
Language (XML) documents, XPath can explore elements and attributes [20]. XPath is a node selection language
for XML documents that may also be used with HTML. The most useful XPath expression is the location path. A
path location employs at least one step location to determine a group of nodes in a document. The simplest is a
location path that selects the document root node. This road has a slash "/" in the centre of it. The symbol is both
the root of a Unix system file and a document.
3.7 Vertical aggregation platform
Various companies have created vertical-specific harvesting platforms. With no manual intervention and effort
tied to a single target site, these systems build and monitor a slew of bots for specific verticals. The preparatory
phase entails creating a knowledge base for the whole vertical, after which the platform builds the bots on its own.
The resilience of the platform is determined by the quality of the data it retrieves and its scalability. This scalability
is mainly utilized to choose the Long Tail of sites that are too difficult or time-consuming for traditional
aggregators to extract content from.
3.8 Semantic annotation recognizing
Metadata, semantic markups, and annotations may be included on the scraped pages, which can discover particular
data pieces. For example, this technique can be considered a specific case of DOM parsing if the annotations are
incorporated in the pages as Microformat does. In another case, the annotations are saved and handled
independently from the web pages, arranged into a semantic layer, so scrapers can acquire the schema and
instructions from this layer before scraping the pages.
3.9 Computer Vision Web Page Analyzer
Machine learning and computer vision are being used to recognize and extract information from web pages in a
visual manner, analyzing them as a human would. Based on the image of the rendered page, a computer vision-
based system is used to analyze the semantic structure of web pages, and a rich representation of the page is
produced as a tree of regions labeled according to their semantic role.
3.10 Comparison between web scraping methods
The comparison is conducted by putting each method to the test when extracting data from the required website,
then computation and comparing the results. Process time, memory utilization, and data consumption are the
experiment's measurement parameters. The findings of the experiment show that web scraping with the Regex
method uses the least amount of RAM when compared to the HTML DOM approach and XPath. In addition,
HTML DOM takes the least amount of time and consumes the least amount of data when compared to Regular
Expression and XPath approaches [1].
4 Web Scraping Technology
4.1 Web crawlers
A web crawler is a bot that visits websites and extracts data from them. According to Mahto and Singh [24], a web
crawler works by loading a tiny list of links. The program then looks for more links on those pages and adds them
to a new list called crawl frontier for further exploration. First, the crawler must determine whether a URL is
absolute or relative. In the case of relative URLs, the crawler must first determine the URL’s base [19]. In order
to extract and store data efficiently, a decent crawler must be able to recognize circular references and minor
modifications of the same page.
There are several types of web crawlers, namely:
1. Focused web crawler: This type of crawler searches for web pages related to certain user fields or subjects.
It makes an effort to find more relevant pages with a greater level of precision. It only downloads pages
relevant to the topic and ignores ones that are not relevant, which is enabled by ranking the Web pages.
2. Incremental crawler: Incremental crawlers are web crawlers that visit and access updated web pages.
These crawlers visit the web pages frequently and update website material by saving the most recent
version of pages.
3. Distributed crawler: These crawlers function by assigning crawling to other crawlers. A central server
manages the communication and synchronization of the nodes.
4. Parallel crawler: Multiple crawler processes are combined to make a parallel crawler where each process
performs the process of filtering and retrieving the URLs, and the URLs are collected from each process.
5. Hidden crawler: The content which is behind websites that are not accessible to general users is known
as hidden Web. The crawler which collects this data is known as a hidden crawler [25].
In the table below, we carry out a comparison between web crawler types to choose the convenient one to work
with.
Table 2. Comparison between Web crawler types
Hidden
crawler
Distributed
crawler
Incremental
crawler
Parallel
crawler
Focused
crawler
No
No
Yes
No
No
DFS1
BFS2
BFS
BFS
DFS
-
No
No
Yes
-
Yes
No
No
Yes
Yes
-
No
Yes
No
-
-
No
No
Yes
-
Form
analyzer
From seed
URLs
From
priority
queue
From
seed
URLs
Related to
specific
topic
Notes: 1 DES: Depth First Search, 2 BFS: Breadth First Search.
4.2 Web scraping parsers
Web scrapers must use parsers in order to extract useful information from scraped data. Programmers use them to
format and extract certain details from data, such as a CV parser extracting a person’s name and contact
information from an email’s text. Simple HTML parser functionality is included in most Web Scraping libraries.
Parsers for particular data such as PDF, CSV, QR code, or JSON are also available. Parsers are built into real web
browsers like Firefox and Chrome. Web scraping done with a genuine web browser can also take advantage of the
browser’s built-in parser.
4.3 Web scraping policies
Selection, re-visit, politeness, and parallelization, according to Asikri et al. [19] and Mahto and Singh [24], are the
four fundamental policies that a crawler must follow in order to act efficiently. The crawler can eliminate most
useless links and considerably reduce its search space by focusing on vital links first. When pages are dynamic,
the crawler must check for updates regularly.
5 Development of Web Scraping tools
The web data extraction tool can be tailor-made for each specific application. The following section discusses how
the web data extraction tool has been built using different techniques.
In Suganya et al. [14], the authors used web scraping and crawling to obtain information from 12,250 web pages
in a Google Scholar web citation database. First, the information about the authors and the manuscript is extracted
and saved in a .csv file. The Seed URL for the web crawler is the user's query to Google Scholar. After that, the
web crawler crawls the HTML pages and downloads the content that the user requires. The URL is parsed for
citation information and links, then placed in the database. The web scraper uses a selector gadget to choose t he
citation material and collects citation information from a particular URL. The information is taken from the web
document once it has been parsed. The text is then filtered by keywords or matching a certain pattern before being
saved in a structured fashion as a.csv file. Web scraping and the firefly optimization technique are combined in
the suggested algorithm. The web scraping technique scrapes/extracts citation information from the Web, but the
firefly algorithm assigns random values, refreshes the light intensity, and evaluates the relevance of the paper's
title and the user query at each step. As a result, the suggested approach retrieves information with more precision.
5.1 Web Scraping using PHP
In Melchor et al. [26, 27], the authors designed a daemon in PHP programming language connected to a MySQL
MariaDB database that continuously searches for new patients consulting at the hospital. Medical records of
various types have been collected applying web scraping that uses HTTP protocol to their hospital web interface.
The collected data were further analyzed for medical observations using machine learning.
The paper titled “Web Scraping with HTML DOM Method for Data Collection of Scientific Articles from
Google Scholar” attempts to summarize the scientific articles in Google Scholar. An HTML DOM parser using
the PHP programming language is used to extract the articles from Google Scholar. Data related to the paper,
namely the title, authors, links, citation, and the year it was published, are extracted. The scraped data is then stored
in the MySQL server database for further analysis [15].
5.2 Web Scraping using BeautifulSoup
Beautiful Soup is a Python library that allows you to parse HTML and XML files. A parse tree is created for the
pages parsed, using which data from HTML pages are extracted. Research carried out by Lunn et al. [28] comprises
data extraction from Indeed.com, a job searching website with keywords and locations specified using two
libraries, namely Beautifulsoup4, to extract data from HTML and XML files and lxml to process XML and HTML
information in Python. Kasereka [29], in his paper, suggested the use of BeautifulSoup to retrieve particular content
from a web page, remove HTML tags, and save the information. In their paper, Clement et al. [30] use
BeautifulSoup to extract digital notices from government portals regarding smart city strategy.
5.3 Web Scraping using Java libraries
Crawler4j is an open-source Java crawler designed for crawling webpages with a simple interface. One can set up
a multi-threaded crawler using this library in a few minutes. For example, Dascalu et al. [31] implemented two
crawlers to extract relevant drug information from Biofarm and HelpNet websites. The extracted content was then
parsed using the jsoup library to extract needed information and stored in Elasticsearch. On the other hand, Kolli
et al. [16] used the Crawler4j library to extract data from online news websites, and APIs provided by JTidy were
used to clean the extracted data for further analysis. The authors also used the DOM hierarchy for parsing the
contents and filtered to provide the user required content.
Authors Hassanien et al. [32] used a web scraper tool named WebScraper, an extension in Google Chrome, to
extract information from Google scholar.
In their paper, Ahmed et al. [5] propose a framework by modifying the behavior of focused crawler using a
domain distiller using Optimized Naïve Bayes (ONB) Classifier. By using a domain distiller, the performance of
focused crawlers is improved.
In Arumi and Sukmasetya [33], the depth-first technique is combined with the technique of web scraping. It
implements a keywords-based data searching approach. The user can give an input based on which the scraper
uses depth-first search technique to fetch the required data comprising dates, headlines, links to pictures, news,
links, categories based on which group is done. This study begins with the process of loading the URL in online
news intended for the keyword "education." After that, the depth-first search starts by taking the start date, and the
expiration date of the news, the URL news, and a category and will be repeated until news that matches the search
is found. Search result URL continues to scratch and crawl data following keywords. After scraping and crawling
process data, news data is exported to an Excel file format (.csv) and stored in a NoSQL database.
5.4 Web Scraping using Selenium
Selenium is an open-source web-based automation tool that is quite good at scraping websites. Selenium's web
driver has several features that allow users to move across web pages and retrieve different page parts depending
on their needs. As a result, many data from several web pages related to the user's query can be retrieved and
organized [34].
In Manjari et al. [34], extractive text summarization of Web pages is performed with the help of the Selenium
framework and TF-IDF algorithm. The data is extracted from the Web pages, and the extracted content is then
summarized using the TF-IDF algorithm. The extraction framework proposed comprises of the following steps:
The user enters a query;
The user query is concatenated with the pre-defined URL, and a user query related URL is generated;
The data is then retrieved from the URLs and saved into a text file.
In Fang et al. [35], the authors propose to provide a web-based platform giving information about pesticides,
including scientific information, by integrating data from several public databases. To extract data, the authors
used several techniques to crawl the Web to extract information about pesticides and, by evaluating the
performance, used a combined approach of Selenium-based crawler and footprint preservation method to crawl
the websites and provide the filtered information.
5.5 Web Scraping using Apache Nutch
Apache Nutch is an open-source large-scale distributed web-crawler and is developed in Java language that can
be extended very easily.
In the study of Shafiq et al. [36], the authors attempt to build a Web crawling tool, NCL Crawl, for specific
languages. NCL Crawl, using Apache Nutch Crawler and Compact Language Detector (CLD2).
Barman et al. [37] aimed to develop a Monolingual Information Retrieval (IR) system for the Assamese
language. A list of Government and General Assamese URLs was compiled for crawling purposes. Authors have
utilized Apache Lucene and Apache Nutch to index the web content crawled by Apache Nutch.
5.6 Web Scraping using Scrapy
Asikri et al. [19], in their paper, employ Scrapy framework to scrape information from an e -commerce website
called “http ://www.jumia.ma/”. CSS Selectors have been used to parse and extract the required content from the
website. The components of the Scrapy framework are shown in Figure 2.
Fig. 2. The Scrapy framework
Source: Huy Phan [4]
The engine is the centre of the Scrapy framework. It controls the flow of data between Scrapy’s components.
It’s also in charge of listening for and generating events in reaction to events like re- quest errors, response errors,
and exceptions.
The Scheduler controls when a task should be completed and directly links to task queues. In addition, it can
control the amount of time each request takes.
The Downloader is where HTTP requests are made. In the normal case, where no real browser is used, it stores
and returns the HTTP response data to the engine [38]. However, if a genuine browser is used to make requests,
the Downloader will be totally replaced by a middleware that can control the browser.
Spiders are developer-created classes that specify what actions the Scraper should take to obtain and interpret
certain web material [38]. Custom options for Downloader and associated middleware can also be set here. Finally,
item Pipeline receives the parsed contents.
Item Pipeline parses data returned by Spiders and performs validation, custom transformations, cleaning, and
data persistence to Redis, MongoDB, or Postgres.
Downloader middleware intercepts requests and responses sent to and from Downloader and adds custom
metadata to the request and response data [38].
5.7 Web Scraping using R
RCrawler is a package developed by Khalil et al. [39] for the R language. It's used for content scraping and domain-
based Web crawling. The RCrawler can crawl, parse, store, and extract material from online sites, as well as
generate data that may be used directly in web content mining applications. Multi-threaded crawling, content
extraction, and duplicate content detection are the core characteristics of RCrawler.
Marchi et al. [40] utilize R language to scrape data from official websites of the city and official tourism
promotion websites of the destinations to study the sustainability communication in websites for informing and
motivating visitors to adopt sustainable practices and behaviors.
5.8 Comparison of web scraping tools
Table 3 and Table 4 summarize and compare the different tools that can be used for web scraping purposes. Table
3 focuses more specifically on the Python Web scraping libraries and frameworks.
Table 3. Comparison between Open-Source Web scraping techniques and frameworks
Parameters
Type1
API/standalone
Language
Extraction
facilities2
Jsoup
CP
API
Java
H, C
HttpClient
C
API
Java
Scrapy
F
Both
Python
R, X, C
BeautifulSoup
P
No
Python
H
Apache Nutch
F
Both
Java
R, X, H, C
Selenium
P
API
Java, Python
R, X, C
Notes: 1 Type: C = HTTP Client
2 Extraction facilities:
R = Regular expressions
P = Parsing
H = HTML parsed tree
F = Framework
X = XPath
C = CSS selectors
Table 4. Comparison between Python Web scraping libraries and frameworks
Factors
BeautifulSoup
Scrapy
Selenium
Extensibility
Suitable for low-level
complex projects
Best choice for large or
complex projects
Best for projects dealing
with Core JavaScript
Performance
Pretty slow compared to
other libraries while
performing a certain task
Rapid processing due
to the use of
asynchronous system
calls
Can handle up to some level
but not as much as Scrapy
Ecosystem
It has a lot of
dependencies on the
ecosystem
It has a flexible
ecosystem making it
easy to integrate with
proxies and VPNs
It has a good ecosystem for
the development
6 Conclusion
Building on previous topical work [41,42], this study reviews the recent literature relating to the applications of
web scraping in various domains, web scraping techniques, and tools that employ web scraping techniques. We
use this study to improve our web scraping process, and we discovered that most of the web scrapers are often
quite similar and general in nature designed to carry out generic and easy jobs. By comparing the performance and
features of different tools and frameworks, we found that Scrapy provides better results as it is fast, extensible,
and powerful. Since Scrapy handles requests asynchronously, the results can be scraped rapidly. Furthermore,
Scrapy’s architecture is based on a web crawler which enables easy data extraction. Scrapy’s selectors like CSS
and XPath can be employed to extract the required data. Scrapy is the perfect tool for complex projects because of
its flexible and extensible capabilities, making integration with VPNs and proxies easier. In addition, ScraperAPI
supported browsers, proxies, and CAPTCHAs, allowing you to get raw HTML from any website with a single API
call.
References
[1] Gunawan, R., Rahmatulloh, A., Darmawan, I., and Firdaus, F. (2019). Comparison of web scraping techniques: regular
expression, HTML DOM and Xpath. In International Conference on Industrial Enterprise and System Engineering
(IcoIESE 2018) Comparison (Vol. 2):283-287.
[2] Sirisuriya, D. S. (2015). A comparative study on web scraping. Proc. 8th Int. Res. Conf. KDU, no. November:135140.
[3] Spangher, A., and May, J. (2021). \textit {StateCensusLaws. org}: A Web Application for Consuming and Annotating
Legal Discourse Learning. arXiv preprint arXiv:2104.10263.
[4] Phan, H. (2019). Building Application Powered by Web Scraping. Doctoral thesis.
[5] Saleh, A. I., Abulwafa, A. E., and Al Rahmawy, M. F. (2017). A web page distillation strategy for efficient focused crawling
based on optimized Naïve bayes (ONB) classifier. Applied Soft Computing, 53:181-204.
[6] Tharaniya, B., Liyanapathirana, C., Rupasinghe, L., and Sampath, K. K. Extracting Unstructured Data and Analysis and
Prediction of Financial Event Modeling. In Conference proceedings of the Annual Conference 2018 IET Sri Lanka
Network:6-11.
[7] Boegershausen, J., Borah, A., and Stephen, A. T. (2021). Fields of Gold: Web Scraping for Consumer Research. Marketing
Science Institute Working Paper Series 2021, Report No. 21-101:1-58.
[8] Saranya, G., Gopinath, N., Geetha, G., Meenakshi, K., and Nithya, M. (2020, December). Prediction of Customer Purchase
Intention Using Linear Support Vector Machine in Digital Marketing. In Journal of Physics: Conference Series (Vol. 1712,
No. 1). IOP Publishing:012024.
[9] Nguyen, V. H., Sinnappan, S., and Huynh, M. (2021). Analyzing Australian SME Instagram Engagement via Web Scraping.
Pacific Asia Journal of the Association for Information Systems, 13(2):11-43.
[10] Deng, S. (2020, June). Research on the Focused Crawler of Mineral Intelligence Service Based on Semantic Similarity.
In Journal of Physics: Conference Series (Vol. 1575, No. 1). IOP Publishing:012042.
[11] Kotouza, M. T., Tsarouchis, S. F., Kyprianidis, A. C., Chrysopoulos, A. C., and Mitkas, P. A. (2020, June). Towards
fashion recommendation: an AI system for clothing data retrieval and analysis. In IFIP International Conference on
Artificial Intelligence Applications and Innovations. Springer, Cham:433-444.
[12] Wang, H., and Song, J. (2019). Fast Retrieval Method of Forestry Information Features Based on Symmetry Function in
Communication Network. Symmetry, 11(3):416.
[13] Seliverstov, Y., Seliverstov, S., Malygin, I., and Korolev, O. (2020). Traffic safety evaluation in Northwestern Federal
District using sentiment analysis of Internet users’ reviews. Transportation Research Procedia, 50:626-635.
[14] Suganya, E., and Vijayarani, S. (2021). Firefly Optimization Algorithm Based Web Scraping for Web Citation Extraction.
Wireless Personal Communications, 118(2):1481-1505.
[15] Rahmatulloh, A., and Gunawan, R. (2020). Web Scraping with HTML DOM Method for Data Collection of Scientific
Articles from Google Scholar. Indonesian Journal of Information Systems, 2(2):95-104.
[16] Kolli, S., Krishna, P. R., and Reddy, P. B. (2006). A novel NLP and Machine Learning based text extraction approach
from online news feed. ARPN Journal of Engineering and Applied Sciences, 16(6):679-685.
[17] Li, R. Y. M. (2020). Building updated research agenda by investigating papers indexed on Google scholar: A natural
language processing approach. In International Conference on Applied Human Factors and Ergonomics. Springer,
Cham:298-305.
[18] Santos, B. S., Silva, I., da Câmara Ribeiro-Dantas, M., Alves, G., Endo, P. T., and Lima, L. (2020). COVID-19: A scholarly
production dataset report for research analysis. Data in Brief, 32:106178.
[19] Asikri, M. E., Krit, S., and Chaib, H. (2020). Using Web Scraping In A Knowledge Environment To Build Ontologies
Using Python And Scrapy. European Journal of Molecular and Clinical Medicine, 7(3):433-442.
[20] El Asikri, M., Krit, S., Chaib, H., Kabrane, M., Ouadani, H., Karimi, K., ... and Elbousty, H. (2017, May). Mining the
Web for learning ontologies: State of art and critical review. In 2017 International Conference on Engineering & MIS
(ICEMIS), IEEE:1-7.
[21] Zheng, S., Song, R., Wen, J. R., and Wu, D. (2007, August). Joint optimization of wrapper generation and template
detection. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data
mining:894-902.
[22] W3C (2016). What is the Document Object Model? https://www.w3.org/TR/WD-DOM/introduction.html
[23] Mitchell, R. (2018). Web scraping with Python: Collecting more data from the modern web. O'Reilly Media, Inc.
[24] Mahto, D. K., and Singh, L. (2016, March). A dive into Web Scraper world. In 2016 3rd International Conference on
Computing for Sustainable Global Development (INDIACom), IEEE:689-693.
[25] Chaitra, P. G., Deepthi, V., Vidyashree, K. P., and Rajini, S. (2020). A study on different types of web crawlers. In
Intelligent communication, control and devices. Springer, Singapore:781-789.
[26] Melchor, R. A., Fonseca, M., Rey, B., Hernandez, A., Puertas, B., Gomez, S., ... and Mateos, M. V. (2020). CT-152:
Application of Web-Scraping Techniques for Autonomous Massive Retrieval of Hematologic Patients' Information During
SARS-CoV2 Pandemic. Clinical Lymphoma Myeloma and Leukemia, 20:S214.
[27] Melchor, R. A., Fonseca, M., Rey, B., Hernandez, A., Puertas, B., Gomez, S., ... and Mateos, M. V. (2020). CT-152:
Application of Web-Scraping Techniques for Autonomous Massive Retrieval of Hematologic Patients' Information During
SARS-CoV2 Pandemic. Clinical Lymphoma Myeloma and Leukemia, 20:S214.
[28] Lunn, S., Zhu, J., and Ross, M. (2020, October). Utilizing web scraping and natural language processing to better inform
pedagogical practice. In 2020 IEEE Frontiers in Education Conference (FIE), IEEE:1-9.
[29] Henrys, K. Importance of web scraping in e-commerce and e-marketing:1-10.
[30] Nicolas, C., Kim, J., and Chi, S. (2021). Natural language processing-based characterization of top-down communication
in smart cities for enhancing citizen alignment. Sustainable Cities and Society, 66:102674.
[31] Dascalu, M. D., Paraschiv, I. C., Nicula, B., Dascalu, M., Trausan-Matu, S., and Nuta, A. C. (2019, October). Intelligent
Platform for the Analysis of Drug Leaflets Using NLP Techniques. In 2019 18th RoEduNet Conference: Networking in
Education and Research (RoEduNet), IEEE :1-6.
[32] Hassanien, H. E. D. (2019). Web Scraping Scientific Repositories for Augmented Relevant Literature Search Using
CRISP-DM. Applied System Innovation, 2(4):37.
[33] Arumi, E. R., and Sukmasetya, P. (2020). Exploiting Web Scraping for Education News Analysis Using Depth-First Search
Algorithm. Jurnal Online Informatika, 5(1):19-26.
[34] Manjari, K. U., Rousha, S., Sumanth, D., and Devi, J. S. (2020, June). Extractive Text Summarization from Web pages
using Selenium and TF-IDF algorithm. In 2020 4th International Conference on Trends in Electronics and Informatics
(ICOEI)(48184), IEEE:648-652.
[35] Fang, T., Han, T., Zhang, C., and Yao, Y. J. (2020). Research and construction of the online pesticide information center
and discovery platform based on web crawler. Procedia Computer Science, 166:9-14.
[36] Shafiq, H. M., and Mehmood, M. A. (2020). NCL-Crawl: A large scale language-specific Web crawling system. Languaue
& Technology:79.
[37] Barman, A. K., Sarmah, J., and Sarma, S. K. (2019). Developing Assamese Information Retrieval System Considering
NLP Techniques: an attempt for a low resourced language. ADBU Journal of Engineering Technology, 8(2):1-12.
[38] Scrapy (2021). Scrapy 2.5 documentation. https://docs.scrapy.org/en/latest/
[39] Khalil, S., and Fakir, M. (2017). RCrawler: An R package for parallel web crawling and scraping. SoftwareX, 6:98-106.
[40] Marchi, V., Apicerni, V., and Marasco, A. (2021). Assessing Online Sustainability Communication of Italian Cultural
DestinationsA Web Content Mining Approach. In Information and Communication Technologies in Tourism 2021.
Springer, Cham :58-69.
[41] Ertz, M., Sun, S., and Latrous, I. (2021). The Impact of Big Data on Firm Performance. In International Conference on
Advances in Digital Science. Springer, Cham:451-462.
[42] Ertz, M. (2022). Handbook of research on the platform economy and the evolution of e-commerce. Hershey, PA: IGI
Global.
... This cross-lingual generalization underscores the potential adaptability of neural approaches in diverse linguistic contexts. Complementing these innovations, Lotfi et al. [11] provided a comprehensive classification of web scraping techniques, including DOM parsing, Regular Expressions, and machine learning-based methods. Their findings indicate that such techniques can effectively evade detection by emulating human browsing behaviors, posing challenges to existing web security measures. ...
... Also, specific techniques are applied for each data source or format as they appear. Web scraping has grown to become a pivotal method for obtaining enormous quantities of diverse information nowadays that modern research projects relying on data analysis are largely based on [18], [19]. ...
Article
This paper presents an empirical analysis of the sentiment dynamics surrounding the 2023 presidential election in Montenegro, employing the VADER sentiment analysis tool to scrutinise text data from various media sources, including social media platforms such as Twitter, Facebook, and Instagram, over three distinct phases: pre-election, election day, and postelection, with validation using perplexity and coherence measures. The study systematically quantifies shifts in public sentiment, demonstrating how these fluctuations correlate with key electoral events. Pre-election analysis revealed a cautiously optimistic sentiment, with a slight predominance of positive over negative sentiments. On election day, sentiment polarised significantly, evidenced by increased negative and positive sentiments, reflecting heightened public engagement and anxiety. Post-election sentiment saw a marked decline in negativity and a rise in neutrality, suggesting a societal shift towards acceptance and reflection following the electoral outcome. This paper elucidates the complex emotional landscape of electoral processes and discusses the implications of these sentiment shifts in understanding democratic transitions. The findings highlight the potent role of public sentiment as both a reflection of and a response to political developments, offering insights into the broader socio-political repercussions of elections. Future research could extend these analyses to other electoral contexts to further refine our understanding of sentiment dynamics as predictors of political and social change.
Conference Paper
For companies, having ICT innovation capabilities on their websites is an essential factor for staying competitive in the current market. Additionally, collecting this type of public information can be very useful for analytical and statistical purposes. Our project focuses on developing a system that allows us to automatically collect information about the technological innovations companies add to their websites, search for it, extract it, and turn it into exploitable data. To achieve this, we are going to use the well-known technique of Web Scraping, which allows us to track and extract data from the various websites we are interested in. The extracted information will be processed and exported into JSON format files, ensuring its future exploitation.
Article
The main objective of this research is to leverage Python modules like Requests and BeautifulSoup for web scraping, which allows data extraction from malicious websites. The Requests package makes use of HTTP requests to make it easier to retrieve web pages, while BeautifulSoup is used to parse and browse HTML content so that important information can be extracted quickly. The process entails actions like locating potentially harmful URLs, examining webpage components, and extracting pertinent information including URLs, IP addresses, and possibly malicious scripts. As part of the study, the extracted data will also be stored in an organized format for further examination in the field of digital forensics. The project shows how online scraping may be used for cybersecurity, giving analysts and researchers important information about malware distribution channels, phishing URLs, and possible dangers. The outcomes highlight how useful these tools are for automating data collecting, which can improve threat intelligence and help identify cyberthreats early on. The ethical and legal aspects of online scraping are emphasized as being dependent on legal concerns, particularly in delicate situations. To sum up, integrating BeautifulSoup with the Requests library provides a useful method for obtaining useful information from harmful websites, which can help cybersecurity experts reduce risks. To strengthen cybersecurity defenses even further, future study may examine vulnerability scanning of the gathered data using programs like Nessus. Keywords Web Scraping, Malicious Websites, Request Library, BeautifulSoup, Cybersecurity, Digital Forensics, Data Extraction, Threat Intelligence, HTML Parsing, Vulnerability Scanning.
Article
Full-text available
Resource conservation through extended product lifetimes has emerged as a rising mantra in various domains related to the circular economy. Meanwhile, it appears that product lifetime extension (PLE) is increasingly achievable through sophisticated technological production systems encapsulated in the concept of industry 4.0. To help managers and researchers understand the potential of PLE offered by crucial Industry 4.0 technologies, this study provides a systematic literature review synthesizing conceptual and empirical research demonstrating the PLE-Industry 4.0 nexus. Using the Digital Twin as a Service (DTaaS) as an architecture reference model for Industry 4.0, we identify four key constitutive technologies of Industry 4.0 (i.e., Additive Manufacturing, Artificial Intelligence, Internet-of-Things, and Big Data) that may contribute to improved product design, access, maintenance, redistribution, and recovery. The findings provide meaningful strategies that are actionable by managers to extend product lifetimes.
Article
Full-text available
The COVID-19 pandemic has severe impacts on global health and social and economic safety. The present study discusses strategies for turning the COVID-19 crisis into opportunities to use artificial intelligence (AI) and big data in business operations. Based on the shared experience and theoretical ground, researchers identified five major business challenges during the COVID-19 pandemic: production and supply-chain disruption, appropriate business model selection, inventory management, budget planning, and workforce management. These five challenges were outlined with eight business cases as examples of companies that had already utilized AI and big data for their business operations during the COVID-19 pandemic. The outcomes of this study provide valuable insights into contemporary social science research and business management with AI and big data applications as a business response to any crisis in the future.
Article
Full-text available
In recent years, with the rapid development of big data, traditional offline transactions have been moved to online in large numbers driven by the Internet. The virtual nature of online transactions has caused it to have problems such as difficulty in guaranteeing product quality and difficulty in user consultation. In addition, consumers are paying more and more attention to the quality of services, and the participation of customer service in the process of online transactions is very important. However, the current e-commerce market in our country is large and the number of online shopping users is extremely large. Customer service personnel are facing great work pressure. In addition, customer service has the characteristics of difficulty in recruiting, high labor costs, and high turnover rate. Such a dilemma is not conducive to our country. The sound development of e-commerce needs to be solved urgently. In order to solve these problems, it is a good method to apply related technologies to realize the automatic response of customer service. The purpose of this article is to design and research a customer service system based on big data machine learning. This article first through the understanding of the basic concepts of big data, and then extend the core technology of big data. Combining with the design ideas and concepts of contemporary customer service systems in our country, we will discuss the design and research of customer service systems based on big data machine learning. Research shows that traditional customer service in the era of big data can no longer meet people’s growing needs, and customer service systems based on big data machine learning are more efficient and convenient.
Article
Full-text available
Big data analytics provides many opportunities to develop new avenues for understanding hospitality management and to support decision making in this field. User-generated content (UGC) provides benefits for hotel managers to gain feedback from customers and enhance specific product attributes or service characteristics in order to increase business value and support marketing activities. Many scholars have provided significant findings about the determinants of customers’ satisfaction in hospitality. However, most researchers primarily used research methodologies such as customer surveys, interviews, or focus groups to examine the determinants of customers’ satisfaction. Thus, more studies must explore how to use UGC to bridge the gap between guest satisfaction and online reviews. This paper examines and compares the aspects of satisfaction and dissatisfaction of Greek hotels’ guests. Text analytics was implemented to deconstruct hotel guest reviews and then examine their relationship with hotel satisfaction. This paper helps hotel managers determine specific product attributes or service characteristics that impact guest satisfaction and dissatisfaction and how hotel guests’ attitudes to those characteristics are affected by hotels’ market positioning and strategies.
Article
Full-text available
Extracting text information from a web intelligence page is a difficult task as a great piece of the E-News substance is given assistance from the backend Content supervision method. Web content extraction is a vital innovation for empowering a variety of utilizations pointed toward accepting the network. While mechanized web harvesting has been concentrated widely, they regularly center around separating organized information that shows up multiple times on a solitary website page, similar to item indexes. In this Work, we present a customized news internet searcher that centers around constructing a storehouse of reporting stories by relating proficient mining of content data from a network information sheet from shifted e-information entrances. Our approach characterizes text blocks utilizing a combination of visual and language autonomous highlights. The framework depends on the idea of the Document Object Model (DOM) hierarchy control for separating content and changing the site page configuration to prohibit unessential substance like advertisements and client remarks. We additionally utilize WordNet, a vocabulary of the English speech dependent on psycho bilingual person reads for coordinating the separated substance equivalent to heading of website page. TF-IDF
Article
Online news is one source of data that is always up to date and provides information or factual data. The search engine is one of the features for users to be able to enter keywords based on the expected category quickly. The development of education in Indonesia makes it essential to discuss, in this study using unstructured data in online news with the keyword Education included as a parameter, and adding search methods in the field of Artificial Intelligence so that the data becomes more accurate. Data that used here was from online news, namely CNN Indonesia, Detikcom, and Liputan6. Using Python Programming with depth-first search method (DFS), when compared with the results data for relevant news. Web erosion using DFS will be very helpful in searching because this method can check the date data was sent and then track the destination URL. Of the three online media sites, Detikcom produces the highest monthly data yielding an average of 885 news about education. At the same time, Liputan6 has the least amount of data on average, 28 news per month, but the data obtained are very relevant compared to Detikcom and CNN Indonesia.
Article
Background: Social media has grown into a prominent marketing and communication tool, and Instagram is a leader in this regard due to its distinctive supports for medial rich contents. Due to the challenges faced by Small-and-medium-enterprises (SMEs), they are suggested to embrace social media technology. Previous studies in Interactive value creation theory (IVF) indicated that SMEs-followers co-creation processes enable Instagram engagement. However, little is known of how to promote the value co-creation process via practical Instagram micro-practices that SMEs can employ regarding their constraints. This study is among the first to examine SMEs' practices for Instagram engagement via the process of value co-creation. Method: We collected data regarding Instagram engagements of 10 Australian SMEs in eight months period using Instagram API. Qualitative analysis is conducted based on 2110 contents in a Poisson regression model. Content analysis and time-series visualizations are employed to investigate the uniqueness of several outliners in the dataset. Results: Findings imply that SMEs should bank on @Tagging instead of #Hashtags to get more engagement as contents with @Tagging show higher levels of inclusiveness and trustworthiness. Also, customized postings for the loci-specific audience effectively encourage followers to participate in conversations, while commercial intensive posting style shows adverse effects due to its low level of credibility. Rich media contents would not necessarily reel in more engagement; the commitment and consistent postings amassed a considerable number of followers over time, leading to a higher engagement rate. Conclusions: This study confirmed the benefits of data scraping in building business intelligence and validate the IVF framework as the theoretical background to investigate the effectiveness of Instagram micro-practices. IVF should be considered in designing social media practices to enable SMEs and followers' collaborations for business value generation. This study provides extra interpretations of the interrelationships between IVF, visual-rich contents, and social media engagement.
Article
This paper presents WebT-IDC, an end-to-end Web Tool for Intelligent Dataset Creation, able to construct noiseless corpora of feedback on different topics in any language, from web forums and blogs. The method is based on a unique extraction model of pagination element pattern, independent of DOM structure. WebT-IDC is a holistic tool covering all stages, from user’s query, crawling and scraping web pages and extracting relevant data, to building text corpora, free from boilerplate and repetitive data, useful in machine learning tasks. The output is a partially labeled dataset, accurate to reflecting a realistic vision of the market, tested in a Transformer based model in order to show the relevance for immediate use in ML task, offering high accuracy in polarity, language and product recognition. The system is evaluated in terms of noise filtering efficiency and computing time as well as precision and recall.
Article
The role of facial expression recognition in social science and human-computer interaction has received a lot of attention. Deep learning advancements have resulted in advances in this field, which go beyond human-level accuracy. This article discusses various common deep learning algorithms for emotion recognition, all while utilising the eXnet library for achieving improved accuracy. Memory and computation, on the other hand, have yet to be overcome. Overfitting is an issue with large models. One solution to this challenge is to reduce the generalization error. We employ a novel Convolutional Neural Network (CNN) named eXnet to construct a new CNN model utilising parallel feature extraction. The most recent eXnet (Expression Net) model improves on the previous model's inaccuracy while having many fewer parameters. Data augmentation techniques that have been in use for decades are being utilized with the generalized eXnet. It employs effective ways to reduce overfitting while maintaining overall size under control.
Article
Big data is making crossroads in promoting sustainable consumption behavior and has attracted the attention of researchers worldwide. Despite increased focussed, findings of previous studies are fragmented, and future researchers need a systematic understanding of existential literature for the identification of research scope. This study offers a systematic review of the impact of bigdata in promoting sustainable consumer behavior with the help of bibliometric analysis followed by thematic analysis. Findings suggest that businesses deploy big data to create sustainable consumer experience, predict consumer buying patterns, designing and altering business models and creating nudge for sustainable consumption, while the consumers are forcing the business to develop green operations and supply chains to reduce their carbon footprint. The major research gaps for future researchers are the impact of BDA on consumerism, role of BD in formation of sustainable habits and consumer knowledge creation for sustainable consumption, and prediction of green consumer behavior.