ArticlePDF Available

The Use of Web-scraping Software in Searching for Grey Literature

Authors:

Abstract

Searches for grey literature can require substantial resources to undertake but their inclusion is vital for research activities such as systematic reviews. Web scraping, the extraction of patterned data from web pages on the internet, has been developed in the private sector for business purposes, but it offers substantial benefits to those searching for grey literature. By building and sharing protocols that extract search results and other data from web pages, those looking for grey literature can drastically increase their transparency and resource efficiency. Various options exist in terms of web-scraping software and they are introduced herein.
TGJ Volume 11, Number 3 2015 Haddaway
186
The Use of Web-scraping Software in
Searching for Grey Literature
Neal R. Haddaway (Sweden)
Abstract
Searches for grey literature can require substantial resources to undertake but their inclusion is
vital for research activities such as systematic reviews. Web scraping, the extraction of patterned
data from web pages on the internet, has been developed in the private sector for business
purposes, but it offers substantial benefits to those searching for grey literature. By building and
sharing protocols that extract search results and other data from web pages, those looking for
grey literature can drastically increase their transparency and resource efficiency. Various options
exist in terms of web-scraping software and they are introduced herein.
The Challenge of Searching for Grey Literature
The editorial scrutiny and peer-review that form integral parts of commercial academic
publishing are useful in assuring reliability and standardised reporting in published research.
However, publication bias can cause an overestimation of effect sizes in syntheses of the
(commercially) published literature (Gurevitch and Hedges 1999; Lortie et al. 2007). In a recent
study by Kicinski et al. (2015), the largest analysis of publication bias in meta-analyses to-date,
publication bias was detected across the Cochrane Library of systematic reviews, although there
was evidence that more recent research suffered to a lesser degree, thanks to mitigation
measures applied in medical research in recent decades.
Some applied subject areas, such as conservation biology, are particularly likely to be reported in
sources other than academic journals, so called practitioner-held data (Haddaway and Bayliss in
press); for example reports of the activities of non-governmental organisations. Such grey
literature is vital for a range of research, policy and practical applications, particularly informing
policy decision-making. Documents produced by governments, business, non-governmental
organisation and academics can provide a range of useful information, but are often overlooked
in traditional meta-analyses and literature reviews.
Systematic reviews were established in the medical sciences to collate and synthesise research
on particular clinical interventions in a reliable, transparent and objective manner, and were a
response to the susceptibility to bias common to traditional literature reviews (Allen and
Richmond 2011). In the last decade systematic review methodology has been translated into a
range of other subjects, including social science (Walker et al. 2013) and environmental
management (CEE 2013). A key aspect of systematic review methodology is that searches are
undertaken for grey literature to mitigate possible publication bias and to include practitioner-
held data. These searches may fail to find any research that is ultimately included (e.g. Haddaway
et al. 2014), but it is important for the reliability and transparency of the review to demonstrate
that this is the case: other reviews have demonstrated significant proportions of grey literature in
the synthesised evidence base (Bernes et al. 2015).
Systematic review searches for grey literature can be particularly challenging and time-
consuming. No comprehensive database resources exist in the environmental sciences for grey
literature, as in many other disciplines, and so searches must include web-based search engines,
specialist databases such as repositories for theses, organisational web sites such as non-
governmental organisations, governmental databases and university repositories. Typically
between 30 (Pullin et al. 2013) and 70 (Haddaway et al. 2014) individual web sites are searched.
Systematic reviews often complement these manual searches using web-based search engines;
both general (e.g. Google) and academic (e.g. Google Scholar). Not only are searches of this
number of resources time-consuming, but they are also typically undertaken in a very non-
transparent manner: excluded articles are rarely recorded and searches are not readily updatable
or repeatable. Furthermore, included resources must be listed individually by hand in any
documentation of search activities, whilst search results from academic databases, such as Web
of Science, can be downloaded as full citations.
TGJ Volume 11, Number 3 2015 Haddaway
187
Web Scraping Software: a potential solution
Data scraping is a term used to describe the extraction of data from an electronic file using a
computer program. Web scraping describes the use of a program to extract data from HTML files
on the internet. Typically this data is in the form of patterned data, particularly lists or tables.
Programs that interact with web pages and extract data use sets of commands known as
application programming interfaces (APIs). These APIs can be ‘taught’ to extract patterned data
from single web pages or from all similar pages across an entire web site. Alternatively,
automated interactions with websites can be built into APIs, such that links within a page can be
‘clicked’ and data extracted from subsequent pages. This is particularly useful for extracting data
from multiple pages of search results. Furthermore, this interactivity allows users to automate
the use of websites’ search facilities, extracting data from multiple pages of search results and
only requiring users to input search terms rather than having to navigate to and search each web
site first.
One major current use for web scraping is for businesses to track pricing activities of their
competitors: pricing can be established across an entire site in relatively short time scales and
with minimal manual effort. Various other commercial drivers have caused a large number and
variety of web scraping programs to have been developed in recent years (see Table 1). Some of
these programs are free, whilst others are purely commercial and charge a one off or regular
subscription fee.
These web scraping tools are equally as useful in the research realm. Specifically, they can
provide valuable opportunities in the search for grey literature, by: i) making searches of multiple
websites more resource-efficient; ii) drastically increasing transparency in search activities; and
iii) allowing researches to share trained APIs for specific websites, further increasing resource-
efficiency.
A further benefit of web scraping APIs relates to their use with traditional academic databases,
such as Web of Science. Whilst citations, including abstracts, are readily extractable from most
academic databases, many databases hold more useful information that is not readily exportable,
for example corresponding author information. Web scraping tools can be used to extract this
information from search results, allowing researchers to assemble contact lists that may prove
particularly useful in requests for additional data, calls for submission of evidence, or invitations
to take part in surveys, for example.
Table 1. List of Major web scraping tools
(adapted from http://scraping.pro/software-for-web-scraping/). Prices were accurate at the time of publication.
Platform Description
a
Cost
b
URL
Import.io “Instantly Turn Web Pages into Data. No Plugin, No
Training, No Setup. Create custom APIs or crawl entire
websites using our desktop app - no coding required!“
Free (charge for
premium service)
www.import.io
DataToolBar The Data Toolbar is an intuitive web scraping tool that
automates web data extraction process for your browser.
Simply point to the data fields you want to collect and the
tool does the rest for you.”
Free to try (no
export facility),
$24 for full
version
www.datatoolbar.com
Visual Web
Ripper
“Visual Web Ripper is a powerful visual tool used for
automated web scraping, web harvesting and content
extraction from the web. Our data extraction software can
automatically walk through whole web sites and collect
complete content structures such as product catalogs or
search results.”
$299 (including 1
year of
maintenance)
www.visualwebripper.co
m
Helium Scraper “Extract data from any website. Choose what to extract
with a few clicks. Create your own actions. Export
extracted data to a variery [sic] of file formats.
Basic version $99
to Enterprise
version $699
www.heliumscraper.com
OutWit Hub “OutWit Hub breaks down Web pages into their different
constituents. Navigating from page to page automatically,
it extracts information elements and organizes them into
usable collections.”
Lite version free,
Pro version at
$89.90
www.outwit.com
Screen Scraper “Screen Scraper automates copying text from a web page,
clicking links, entering data into forms and submitting
them, iterating through search results pages, downloading
files (PDF, MS Word, images, etc.).”
Basic edition free,
commercial
versions from
$549 to $2,799
www.screen-scraper.com
Web Content
Extractor
“Web Context Extractor is a professional web data
extraction software designed not only to perform the most
of [sic] dull operations automatically, but also to greatly
$89 www.newprosoft.com/w
eb-content-extractor.htm
TGJ Volume 11, Number 3 2015 Haddaway
188
aDescriptions are taken from product web sites (31/01/2015)
bCosts correct at time of publication (2015)
increase productivity and effectiveness of the web data
scraping process. Web Content Extractor is highly accurate
and efficient for extracting data from websites.”
Kimono “Kimono lets you turn websites into APIs in seconds. You
don’t need to write any code or install any software to
extract data with Kimono. The easiest way to use Kimono
is to add our bookmarklet to your browser’s bookmark
bar. Then go to the website you want to get data from and
click the bookmarklet. Select the data you want and
Kimono does the rest.”
Free with
additional
features costing
up to $180
www.kimonolabs.com
FMiner “FMiner is a software for web scraping, web data
extraction, screen scraping, web harvesting, web crawling
and web macro support for windows and Mac OS X. It is an
easy to use web data extraction tool that combines best-
in-class features with an intuitive visual project design
tool, to make your next data mining project a breeze.”
Free 15 day trial,
$168-£248
www.fminer.com
Data Extractor
by Mozenda
“The Mozenda data scraper tool is very basic; all you have
to do is use the program to scrape up information you
need off of [sic] websites without all the tiring work of
searching websites one by one. Whether you are working
for the government such as a police officer or a detective,
in the medical field, or even a large business or
entrepreneur, website scraping is fast, easy and
affordable, plus it saves you or your employees a ton of
stressful work and time; use Mozenda’s data scraper and
let the program do all the hard work for you.”
Free trial (500
page credits), $99
to $199 per
month
www.mozenda.com/data
-extractor
WebHarvy
Data Extractor
Tool
“WebHarvy is a visual web scraper. There is absolutely no
need to write any scripts or code to scrape data. You will
be using WebHarvy's in-built browser to navigate web
pages. You can select the data to be scraped with mouse
clicks.”
$99 - $399 www.webharvy.com
Web Data
Extractor
“Web Data Extractor [is] a powerful and easy-to-use
application which helps you automatically extract specific
information from web pages which is necessary in your
day-to-day internet / email marketing or SEO activities.
Extract targeted company contact data (email, phone, fax)
from web for responsible b2b communication. Extract url,
meta tag (title, desc, keyword) for website promotion,
search directory creation, web research.”
$89 - $199 www.webextractor.com
Easy Web
Extractor
“An easy-to-use tool for web scrape solutions (web data
extracting, screen scraping) to scrape desired web content
(text, url, image, html) from web pages just by few screen
clicks. No programing required.”
$69.99 (with in-
app upgrades)
www.webextract.net
WebSundew “WebSundew is a powerful web scraping tool that extracts
data from the web pages with high productivity and speed.
WebSundew enables users to automate the whole process
of extracting and storing information from the web sites.
You can capture large quantities of bad-structured data in
minutes at any time in any place and save results in any
format. Our customers use WebSundew to collect and
analyze the wide range of data that exists on the Internet
related to their industry.”
$69 - $2,495 www.websundew.com
Handy Web
Extractor
Handy Web Extractor is a simple tool for everyday web
content monitoring. It will periodically download the web
page, extract the necessary content and display it in the
window on your desktop. One may consider it as the data
extraction software, taking its own nitch [sic] in the
scraping software and plugins.”
Free www.scraping.pro/handy
-web-extractor
TGJ Volume 11, Number 3 2015 Haddaway
189
Figure 1 shows a screen shot of one web scraping program being used to establish an API for an
automated search for grey literature from the website of the US Environmental Protection
Agency. This particular web-scraping platform is in the form of a downloadable, desktop-based
program; a web browser. The browser is then used to visit and train APIs by identifying rows and
columns in the patterned data: in practice rows will typically be search records, whilst columns
will be different aspects of the patterned data, such as titles, authors, sources, publication dates,
descriptions, etc. Detailed methods for the use of web scrapers are available elsewhere
(Haddaway et al. in press). In this way, citation-like information can be extracted for search
results according to the level of detail provided by the website. In addition to extracting search
results, as described above, static lists and individual, similarly patterned pages can also be
extracted. Furthermore, active links can be maintained, allowing the user to examine linked
information directly from the extracted database.
Figure 1. Screenshot of web scraping software being used to train an API for searching for grey
literature on the Environmental Protection Agency website. Program used is Import.io.
Just as search results from organisational websites can be extracted as citations, as described
above, search results from web-based search engines can be extracted and downloaded into
databases of quasi-citations. Microsoft Academic Search
(http://academic.research.microsoft.com) results can be extracted in this way, and in fact a pre-
trained API is available from Microsoft for extracting data from search results automatically
(http://academic.research.microsoft.com/about/Microsoft%20Academic%20Search%20API%20U
ser%20Manual.pdf).
Perhaps a more comprehensive alternative to Microsoft Academic Search is Google Scholar
(http://scholar.google.com). Google Scholar, however, does not support the use of bots
(automated attempts to access the Google Scholar server), and repeated querying of the server
by a single IP address (approximately 180 queries or citation extractions in succession) can results
in a an IP address being blocked for an extended period (approximately 48-72 hours) (personal
observation)1. Whilst it is understandable that automated traffic could be a substantial problem
for Google Scholar, automation of activities that would otherwise be laboriously undertaken by
hand is arguably of great value to researchers with limited resources. Thus, a potential work-
around involves the scraping of locally saved search results HTML pages after they had been
downloaded individually or in bulk (this may still constitute an infringement of the Google Scholar
conditions of use, however). A further cautionary note relates to demands on the servers that
host the web sites being scraped. Scraping a significant volume of pages from one site or scraping
multiple pages in a short period of time can put significant strain on smaller servers. However,
the level of scraping necessary to extract 100s to 1,000s of search results is unlikely to have
detrimental impacts on server functionality.
1Details of Google Scholar’s acceptable use policy are available from the following web page:
https://scholar.google.co.uk/intl/en/scholar/about.html.
TGJ Volume 11, Number 3 2015 Haddaway
190
Systematic reviewers must download hundreds or thousands of search results for later screening
from a suite of different databases. At present, Google Scholar is only cursorily searched in most
reviews (i.e. by examining the first 50 search results). The addition of Google Scholar as a
resource for finding additional academic and grey literature has been demonstrated to be useful
for systematic reviews (Haddaway et al. in press). Automating searches and transparency
documenting the results would increase transparency and comprehensiveness of the reviews
with a highly resource-efficient activity at little additional effort for reviewers. These implications
apply equally to other situations where web-based searching is beneficial but potentially time-
consuming.
Web scrapers are an attractive technological development in the field of grey literature. The
availability of a wide range of free and low-cost web scraping software provides an opportunity
for significant benefits to those with limited resources, particularly researchers working alone or
small organisations. Future developments will make use of the software even easier; for example
the one click, automatic training provided by Import.io (https://magic.import.io). Web scrapers
can increase resource-efficiency and drastically improve transparency, and existing networks can
benefit through readily sharable trained APIs. Furthermore, many programs can be easily used by
those with minimal or no skill or prior knowledge of this form of information technology.
Researchers could benefit substantially by investigating the applicability of web scraping to their
own work.
Acknowledgments
The author wishes to thank MISTRA EviEM for support during the preparation of this manuscript.
References
Allen C, Richmond K (2011) The Cochrane Collaboration: International activity within Cochrane Review Groups in the first decade of
the twenty-first century. Journal of Evidence-Based Medicine 4(1):2–7
Bernes, C., Carpenter, S.R., Gårdmark, A., Larsson, P., Persson, L., Skov, C., Speed, J.D.M., Van Donk, E. (2015). What is the influence
of a reduction of planktivorous and benthivorous fish on water quality in temperate eutrophic lakes? A systematic review.
Environmental Evidence
Collaboration for Environmental Evidence. 2013. Guidelines for systematic review and evidence synthesis in environmental
management. Version 4.2. Environmental Evidence. Available from
www.environmentalevidence.org/Documents/Guidelines/Guidelines4.2.pdf (accessed January 2014).
Gurevitch, J. and Hedges, L.V. 1999. Statistical issues in ecological meta-analyses. Ecol. 80: 1142-1149.
Haddaway, N.R. and Bayliss, H.R. 2015. Shades of grey: two forms of grey literature important for conservation reviews. In press
Haddaway, N. R., Burden, A., Evans, C., Healey, J. R., Jones, D. L., Dalrymple, S. E., Pullin, A. S. (2014) Evaluating effects of land
management on greenhouse gas fluxes and carbon balances in boreo-temperate lowland peatland systems. Environmental
Evidence, 3:5.
Haddaway, N.R., Collins, A.M., Coughlin, D., Kirk, S. (2015) The role of Google Scholar in academic searching and its applicability to
grey literature searching. PLOS ONE, in press.
Kicinski, M., Springate, D. A., Kontopantelis, E. (2015). Publication bias in metaanalyses from the Cochrane Database of
Systematic Reviews. Statistics in Medicine, 34: 2781-2793.
Lortie, C.J. 2014. Formalized synthesis opportunities for ecology: systematic reviews and meta-analyses. Oikos 123: 897-902.
Pullin, A. S., Bangpan, M., Dalrymple, S. E., Dickson, K., Haddaway, N. R., Healey, J. R., Hauari, H., Hockley, N., Jones, J. P. G, Knight,
T., Vigurs, C., Oliver, S. (2013) Human well-being impacts of terrestrial protected areas. Environmental Evidence, 2:19.
Walker, D., G. Bergh, E. Page, and M. Duvendack. 2013. Adapting a Systematic Review for Social Research in International
Development: A Case Study from the Child Protection Sector. London: ODI.
... Extracting and processing information from these websites, commonly known as web crawling, scraping, website data extracting, and vectorization, are key techniques within the broader scope of website processing and web mining [3][4][5][6]. Those methods enable the automated retrieval of data from web pages and the conversion of unstructured web data into structured formats for storage and processing, making them extremely valuable for various applications such as data analysis, market research, and content aggregation [5,7,8]. Organizations can utilize those methods to gain insights, monitor trends, and make data-driven decisions. ...
... Furthermore, services like WHOIS, 5 SimilarWeb, 6 and the Google Search 7 Index do not provide free access through APIs, offering only HTML responses instead. These responses require parsing and analysis, often involving advanced techniques like rendering HTML documents with tools such as Selenium, 8 which adds complexity and demands specialized knowledge for effective data collection. ...
Article
Full-text available
Web2Vec is a Python library designed to simplify website analysis by converting websites into vector representations through feature extraction from their content and structure. Utilizing Scrapy-based web crawlers, it automates data collection and supports both single-page analysis and large-scale crawling. This flexibility allows users to adapt the library to their specific needs, whether for quick, focused analysis or systematic data collection. Integrating over 200 website parameters into a single, easy-to-use framework, Web2Vec simplifies analytical tasks, making it a valuable resource across various fields. By serving as a centralized code repository for researchers, it eliminates the need to repeatedly implement similar code, providing an all-in-one integrator to streamline workflows and save time.
... EE Quantitative Network Analysis Method (adapted fromGueneau, Chabaud and Chalus-sauvannet, 2021) In line with QGT(Dehmer, Emmert-Streib, and Shi, 2017), we used big data from online media(Von Bloh et al., 2019), along with webscraping techniques(Haddaway, 2015;Zhao, 2017;Bradley and James, 2019). To start, we collected a list of 251 actors in Lyon's EE from a local institution (Figure 2,Step 1). ...
... Web scraping was used in the stock market to present a visualization of how prices change over time [31]. ...
Preprint
Full-text available
In this paper, the Customer Reviews Analysis of Products, collected from Tunisian online retail platforms, is performed by collecting the data with the help of the web scraping method. Approaches comprise a language detection process that will handle the reviews in any language, followed by sentiment analysis, which in turn forms the polarity of comments: positive, negative, or neutral. The classification of opinions has been done using some machine learning algorithms: Support Vector Machines, Decision Trees, Random Forests, and Naive Bayes. Besides, in this paper, the Tf-Idf method is applied to represent and classify the texts. This research will give some effective insights into improving customer satisfaction and optimizing business strategy regarding the context of Tunisian e-commerce.
... For high volumes of data, complex networks of literature or long timespans of analysis this method can be tedious and a nonefficient process [2]. The second one is an automated program that extracts patterned data and speeds everything a normal user can do in website: do queries, search for specific keywords or data, request it, parse it and save it [4], [5] but without the time consuming manual iteration. The third one is Application Programming Interfaces (APIs, from now on). ...
... Such grey literatures are not shown in Scopus database, leading to an underestimation of actual research effort. Therefore, further studies should collaborate more with local researchers from different countries and include grey literature as an additional data source to provide a more complete picture of research effort (Haddaway, 2015;Haddaway and Bayliss, 2015;Bickley et al., 2020). Second, we used the current number of publications and IUCN threatened status to represent research effort and extinction risk, respectively, while the number of publications and IUCN threatened status will change over time, which needs further exploration. ...
... Data yang digunakan, diperoleh dari Google Maps Reviews yang memiliki fitur memberikan dan menampilkan rating dan rasio serta teks ulasan pada suatu tempat yang terdeteksi pada Google Map (Khofifah, Rahayu & Yusuf, 2022). Web Scraping mengekstrak data dari file HTML yang ada di internet dimana data berupa patterned data khususnya berupa daftar atau tabel (Haddaway, 2016). Proses web scraping mengambil data pada situs menggunakan bantuan tools dan diekstraksi lalu disimpan dengan format comma-separated values(csv). ...
Article
Full-text available
Peran rumah sakit dalam kehidupan masyarakat sangatlah penting terkait tingkat kepuasan Masyarakat terhadap pelayanan, fasilitas, dan aspek lainnya. Opini dan penilaian masyarakat turut menjadi penilaian terhadap kinerja pelayanan rumah sakit. Pada Google Maps Reviews banyak ulasan dari berbagai rumah sakit.Penilaian yang sangat besar dapat kita lihat pada Google Maps Reviews akan memakan waktu bagi masyarakat. Keluhan-keluhan Masyarakat disekitar penulis terhadap pelayanan rumah sakit di Malang menjadikan penilaian pelayanan rumah sakit di Malang menjadi objek dari penelitian dasar ini. Penulis memanfaatkan algoritma Naïve Bayes Classifier dan Cross Validation untuk mengkategorikan penilaian berdasarkan sentimen positif dan negatif serta aspek agar mempermudah pengkategorian. Aspek yang dipergunakan tersebut adalah aspek penanganan, fasilitas, administrasi, dan biaya. Penulis juga menggunakan analisis Root Cause untuk mempermudah masyarakat dan pihak terkait dalam menemukan masalah dan rekomendasi pemecahan masalah. Awalnya data di proses dengan text preprocessing lalu pembobotan kata TF-IDF, pelabelan data, penerapan algoritma Naïve Bayes Classifier dan mengambil sentimen negatif untuk menentukan Root Cause. Hasil pengujian dengan menggunakan Cross Validation dengan fold k-9 memiliki nilai accuracy 82,97% , precision sebesar 83,13%, recall 82,93%, dan f-measure sebesar 82,92%. Hasil uji dengan menggunakan 20% data tes diperoleh akurasi 90%. Abstract The role of hospitals in society is crucial in terms of the level of satisfaction that the community derives from their services, facilities, and other aspects. Public opinions and assessments also contribute to evaluating hospital service performance. On Google Maps Reviews, there are numerous reviews from various hospitals. A significant evaluation can be observed on Google Maps Reviews, which might take time for the community. The complaints of the community around the writer regarding the hospital services in Malang make the assessment of hospital services in Malang the subject of this basic research. The author utilizes the Naive Bayes Classifier algorithm and Cross Validation to categorize assessments based on positive and negative sentiments, as well as 4 aspects to facilitate categorization. The author also employs Root Cause analysis to aid the public and relevant parties in identifying issues and providing problem-solving recommendations. After processing the data through text preprocessing and TF-IDF word weighting, data labeling, applying the Naive Bayes Classifier algorithm, and extracting negative sentiments to determine the Root Cause in negative hospital sentiments. Based on this process, applying Cross Validation with k-9 folds yields the highest values: an accuracy of 82.97%, precision of 83.13%, recall of 82.93%, and an f-measure of 82.92%. Through the sentiment classification and Cross Validation process, the accuracy results in 90% for hospital reviews with the highest number of assessments divided into 2 sentiments and 4 aspects: positive and negative sentiments, as well as treatment, facilities, administration, and costs.
... The basis of these methods is web crawling and web scraping techniques. Web crawling and web scraping are the primary techniques that can be used as a basis for developing systems that will perform literature searches autonomously (Haddaway, 2015). Bots that are coded to scan the entire World Wide Web (WWW) environment to search the web and obtain the associated site addresses are defined as web crawler bots and bots that are used to extract meaningful information from the obtained site addresses are defined as web scraper bots. ...
Article
Full-text available
Detailed literature search and writing is very important for the success of long research projects, publications and theses. Search engines provide significant convenience in research processes. However, conducting a comprehensive and systematic research on the web requires a long working process. In order to make literature searches effective, simple and comprehensive, various libraries and development tools have been created and made available. By using these development tools, research processes that may take days can be reduced to hours or even minutes. Literature review is not only necessary for academic studies, but it is a process that should be used and performed in every field where new approaches are adopted. Literature review is a process that gives us important ideas about whether similar studies have been conducted before, which methods have been used before and what has not been addressed in previous studies. It is also of great importance in terms of preventing possible copyright problems in future studies. The main purpose of this study is to propose an application that will facilitate, speed up and increase the efficiency of literature searches. In existing systems, literature searches are performed by browsing search sites or various article sites one by one and using the search tools provided by these sites. It is simple to use, allows the entire World Wide Web environment to be searched, and provides the user with the search findings. In this study, we have implemented an application that allows the crawling of the entire World Wide Web environment, is very simple to use, and quickly presents the crawl findings to the user.
... Then, data scraping was used to extract the 'Quantitative Elements' details from the downloaded websites. Data scraping involves the extraction of useful data from a given electronic file using a software program [55]. Methods currently used for data extraction mostly involve heuristics to extract certain features of the document, for example, the number of 'Hyperlinks' present, the page's text 'Density', etc. [56]. ...
Article
Full-text available
This paper addresses the challenge of objectively determining a website’s personality by developing a methodology based on automated quantitative analysis, thus avoiding the biases inherent in human surveys. Utilizing a database of 3000 websites, data extraction tools gather relevant data, which are then analyzed using Artificial Intelligence (AI) techniques, including machine learning (ML) and natural language processing. Four ML algorithms—K-means, Expectation Maximization, Hierarchical Agglomerative Clustering, and DBSCAN—are implemented to assess and classify website personality traits. Each algorithm’s strengths and weaknesses are evaluated in terms of data organization, cluster flexibility, and handling of outliers. A software tool is developed to facilitate the research process, from database creation and data extraction to ML application and results analysis. Experimental validation, conducted with identical training and testing datasets, achieves a success rate of up to 94% (with an Error of ≤50%) in accurately identifying website personality, which is validated by subsequent surveys. The research highlights significant relationships between website attributes and personality traits, offering practical applications for website developers. For instance, developers can use these insights to design websites that align with business goals, enhance customer engagement, and foster brand loyalty. Additionally, the methodology can be applied to creating culturally resonant websites, thus supporting New Zealand’s cultural initiatives and promoting cross-cultural understanding. This research lays the groundwork for future studies and has broad applicability across various domains, demonstrating the potential for automated, unbiased website personality classification.
Article
Full-text available
The strong communication strategies for inclusive education policies are intended to offer equality across the board for access to education for all students, including those with special needs. This systematic review used ROSES (Reporting Standards for Systematic Evidence Syntheses) methodology to investigate the use of communication strategies for inclusive education by countries in Asia. The articles were based exclusively on these three databases in conjunction with Google Scholar: Scopus, ProQuest, and ERIC, subjected to a series of screening, data collection, and evaluation of quality processes as guided by ROSES protocols. Criteria for inclusion included research on events on communication strategies regarding policy concerning inclusive education in Asia. Themes that emerged included the negotiation of partnerships, channels for communication, and cultural sensitivity. The paper identifies communication strategies that are culturally and contextually relevant towards the successful implementation of inclusive education policies. This will add to the body of knowledge in practice and provide directions for future research, policy development, and practical implementation.
Article
Full-text available
Google Scholar (GS), a commonly used web-based academic search engine, catalogues between 2 and 100 million records of both academic and grey literature (articles not formally published by commercial academic publishers). Google Scholar collates results from across the internet and is free to use. As a result it has received considerable attention as a method for searching for literature, particularly in searches for grey literature, as required by systematic reviews. The reliance on GS as a standalone resource has been greatly debated, however, and its efficacy in grey literature searching has not yet been investigated. Using systematic review case studies from environmental science, we investigated the utility of GS in systematic reviews and in searches for grey literature. Our findings show that GS results contain moderate amounts of grey literature, with the majority found on average at page 80. We also found that, when searched for specifically, the majority of literature identified using Web of Science was also found using GS. However, our findings showed moderate/poor overlap in results when similar search strings were used in Web of Science and GS (10–67%), and that GS missed some important literature in five of six case studies. Furthermore, a general GS search failed to find any grey literature from a case study that involved manual searching of organisations' websites. If used in systematic reviews for grey literature, we recommend that searches of article titles focus on the first 200 to 300 results. We conclude that whilst Google Scholar can find much grey literature and specific, known studies, it should not be used alone for systematic review searches. Rather, it forms a powerful addition to other traditional search methods. In addition, we advocate the use of tools to transparently document and catalogue GS search results to maintain high levels of transparency and the ability to be updated, critical to systematic reviews.
Article
Full-text available
We used a Bayesian hierarchical selection model to study publication bias in 1106 meta-analyses from the Cochrane Database of Systematic Reviews comparing treatment with either placebo or no treatment. For meta-analyses of efficacy, we estimated the ratio of the probability of including statistically significant outcomes favoring treatment to the probability of including other outcomes. For meta-analyses of safety, we estimated the ratio of the probability of including results showing no evidence of adverse effects to the probability of including results demonstrating the presence of adverse effects. Results: in the meta-analyses of efficacy, outcomes favoring treatment had on average a 27% (95% Credible Interval (CI): 18% to 36%) higher probability to be included than other outcomes. In the meta-analyses of safety, results showing no evidence of adverse effects were on average 78% (95% CI: 51% to 113%) more likely to be included than results demonstrating that adverse effects existed. In general, the amount of over-representation of findings favorable to treatment was larger in meta-analyses including older studies. Conclusions: in the largest study on publication bias in meta-analyses to date, we found evidence of publication bias in Cochrane systematic reviews. In general, publication bias is smaller in meta-analyses of more recent studies, indicating their better reliability and supporting the effectiveness of the measures used to reduce publication bias in clinical trials. Our results indicate the need to apply currently underutilized meta-analysis tools handling publication bias based on the statistical significance, especially when studies included in a meta-analysis are not recent. Copyright © 2015 John Wiley & Sons, Ltd.
Article
Full-text available
Background Peatlands cover 2 to 5 percent of the global land area, while storing 30 and 50 percent of all global soil carbon (C). Peatlands constitute a substantial sink of atmospheric carbon dioxide (CO2) via photosynthesis and organic matter accumulation, but also release methane (CH4),nitrous oxide (N2O), and CO2 through respiration, all of which are powerful greenhouse gases (GHGs). Lowland peats in boreo-temperate regions may store substantial amounts of C and are subject to disproportionately high land-use pressure. Whilst evidence on the impacts of different land management practices on C cycling and GHG fluxes in lowland peats does exist, these data have yet to be synthesised. Here we report on the results of a Collaboration for Environmental Evidence (CEE) systematic review of this evidence. Methods Evidence was collated through searches of literature databases, search engines, and organisational websites using tested search strings. Screening was performed on titles, abstracts and full texts using established inclusion criteria for population, intervention/exposure, comparator, and outcome key elements. Remaining relevant full texts were critically appraised and data extracted according to pre-defined strategies. Meta-analysis was performed where sufficient data were reported. Results Over 26,000 articles were identified from searches, and screening of obtainable full texts resulted in the inclusion of 93 relevant articles (110 independent studies). Critical appraisal excluded 39 studies, leaving 71 to proceed to synthesis. Results indicate that drainage increases the N2O emission and the ecosystem respiration of CO2, but decreases CH4 emission. Secondly, naturally drier peats release more N2O than wetter soils. Finally, restoration increases the CH4 release. Insufficient studies reported C cycling, preventing quantitative synthesis. No significant effect was identified in meta-analyses of the impact of drainage and restoration on DOC concentration. Conclusions Consistent patterns in C concentration and GHG release across the evidence-base may exist for certain land management practices: drainage increases N2O production and CO2 from respiration; drier peats release more N2O than wetter counterparts; and restoration increases CH4 emission. We identify several problems with the evidence-base; experimental design is often inconsistent between intervention and control samples, pseudoreplication is extremely common, and variability measures are often unreported.
Article
Full-text available
Background Establishing Protected Areas (PAs) is among the most common conservation interventions. Protecting areas from the threats posed by human activity will by definition inhibit some human actions. However, adverse impacts could be balanced by maintaining ecosystem services or introducing new livelihood options. Consequently there is an ongoing debate on whether the net impact of PAs on human well-being at local or regional scales is positive or negative. We report here on a systematic review of evidence for impacts on human well-being arising from the establishment and maintenance of terrestrial PAs. Methods Following an a priori protocol, systematic searches were conducted for evidence of impacts of PAs post 1992. After article title screening, the review was divided into two separate processes; a qualitative synthesis of explanations and meaning of impact and a review of quantitative evidence of impact. Abstracts and full texts were assessed using inclusion criteria and conceptual models of potential impacts. Relevant studies were critically appraised and data extracted and sorted according to type of impact reported. No quantitative synthesis was possible with the evidence available. Two narrative syntheses were produced and their outputs compared in a metasynthesis. Results The qualitative evidence review mapped 306 articles and synthesised 34 that were scored as high quality. The quantitative evidence review critically appraised 79 studies and included 14 of low/medium susceptibility to bias. The meta-synthesis reveals that a range of factors can lead to reports of positive and negative impacts of PA establishment, and therefore might enable hypothesis generation regarding cause and effect relationships, but resulting hypotheses cannot be tested with the current available evidence. Conclusions The evidence base provides a range of possible pathways of impact, both positive and negative, of PAs on human well-being but provides very little support for decision making on how to maximise positive impacts. The nature of the research reported to date forms a diverse and fragmented body of evidence unsuitable for the purpose of informing policy formation on how to achieve win-win outcomes for biodiversity and human well-being. To better assess the impacts of PAs on human well-being we make recommendations for improving research study design and reporting.
Data
Full-text available
Background In lakes that have become eutrophic due to sewage discharges or nutrient runoff from land, problems such as algal blooms and oxygen deficiency often persist even when nutrient supplies have been reduced. One reason is that phosphorus stored in the sediments can exchange with the water. There are indications that the high abundance of phytoplankton, turbid water and lack of submerged vegetation seen in many eutrophic lakes may represent a semi-stable state. For that reason, a shift back to more natural clear-water conditions could be difficult to achieve. In some cases, though, temporary mitigation of eutrophication-related problems has been accomplished through biomanipulation: stocks of zooplanktivorous fish have been reduced by intensive fishing, leading to increased populations of phytoplankton-feeding zooplankton. Moreover, reduction of benthivorous fish may result in lower phosphorus fluxes from the sediments. An alternative to reducing the dominance of planktivores and benthivores by fishing is to stock lakes with piscivorous fish. These two approaches have often been used in combination. The implementation of the EU Water Framework Directive has recently led to more stringent demands for measures against eutrophication, and a systematic review could clarify whether biomanipulation is efficient as a measure of that kind. Methods The review will examine primary field studies of how large-scale biomanipulation has affected water quality and community structure in eutrophic lakes or reservoirs in temperate regions. Such studies can be based on comparison between conditions before and after manipulation, on comparison between treated and non-treated water bodies, or both. Relevant outcomes include Secchi depth, concentrations of oxygen, nutrients, suspended solids and chlorophyll, abundance and composition of phytoplankton, zooplankton and fish, and coverage of submerged macrophytes.
Article
Full-text available
Meta-analysis is the use of statistical methods to summarize research findings across studies. Special statistical methods are usually needed for meta-analysis, both because effect-size indexes are typically highly heteroscedastic and because it is desirable to be able to distinguish between-study variance from within-study sampling-error variance. We outline a number of considerations related to choosing methods for the meta-analysis of ecological data, including the choice of parametric vs. resampling methods, reasons for conducting weighted analyses where possible, and comparisons fixed vs. mixed models in categorical and regression-type analyses.
Article
Narrative reviews are dead. Long live systematic reviews (and meta-analyses). Synthesis in many forms is now a driving force in ecology. Advances in open data for ecology and new tools provide vastly improved capacity for novel, emergent knowledge synthesis in our discipline. Systematic reviews and meta-analyses are two formal synthesis opportunities for ecologists that are now accepted as traditional publications, but the scope of validated syntheses will continue to expand. To date, systematic reviews are rarely used whilst the rate of meta-analyses published in ecological journals is increasing exponentially. Systematic reviews provide an overview of the literature landscape for a topic, and meta-analyses examine the strength of evidence integrated across different studies. Effective synthesis benefits from both approaches, but better data reporting and additional advances in the culture of sharing data, code, analytics, workflows, methods and also ideas will further energize these efforts. At this junction, synthetic efforts that include systematic reviews and meta-analyses should continue as stand-alone publications. This is a necessary step in the evolution of synthesis in our discipline. Nonetheless, they are still evolving tools, and meta-analyses in particular are simply an extended set of statistical tests. Admittedly, understanding the statistics and assumptions influence how we conduct synthesis much as statistical choices often shape experimental design, i.e. ANOVA versus regression-based experiments, but statistics do not make the paper. Current steps – primary research articles need to more effectively report evidence, sharing scientific products should expand, systematic reviews should be used to identify research gaps/delineate literature landscapes, and meta-analyses should be used to examine evidence patterns to further predictive ecology.
Article
The Cochrane Collaboration (www.cochrane.org) is the world's largest organisation dedicated to preparing, maintaining and promoting the accessibility of systematic reviews of the effects of healthcare interventions. It is an international organisation with participants in more than 100 countries, principally focused around the Cochrane Review Groups that are responsible for the preparation and maintenance of Cochrane reviews. Since 2000, a periodic audit has been done to count the number of active members in the Cochrane Review Groups, subdivided by the countries in which these people are based. At the beginning of 2010, there were almost 28,000 people involved, an increase from about 5500 in 2000. The growth of activity has been dramatic, and especially large for authors of Cochrane reviews and protocols. In the year 2000, 2840 people were listed as authors by the Cochrane Review Groups. At the beginning of 2010, this had risen to over 21,000 people.
Adapting a Systematic Review for Social Research in International Development: A Case Study from the Child Protection Sector
  • D Walker
  • G Bergh
  • E Page
  • M Duvendack
Walker, D., G. Bergh, E. Page, and M. Duvendack. 2013. Adapting a Systematic Review for Social Research in International Development: A Case Study from the Child Protection Sector. London: ODI.