ArticlePDF Available

The Use of Web-scraping Software in Searching for Grey Literature

Authors:

Abstract

Searches for grey literature can require substantial resources to undertake but their inclusion is vital for research activities such as systematic reviews. Web scraping, the extraction of patterned data from web pages on the internet, has been developed in the private sector for business purposes, but it offers substantial benefits to those searching for grey literature. By building and sharing protocols that extract search results and other data from web pages, those looking for grey literature can drastically increase their transparency and resource efficiency. Various options exist in terms of web-scraping software and they are introduced herein.
TGJ Volume 11, Number 3 2015 Haddaway
186
The Use of Web-scraping Software in
Searching for Grey Literature
Neal R. Haddaway (Sweden)
Abstract
Searches for grey literature can require substantial resources to undertake but their inclusion is
vital for research activities such as systematic reviews. Web scraping, the extraction of patterned
data from web pages on the internet, has been developed in the private sector for business
purposes, but it offers substantial benefits to those searching for grey literature. By building and
sharing protocols that extract search results and other data from web pages, those looking for
grey literature can drastically increase their transparency and resource efficiency. Various options
exist in terms of web-scraping software and they are introduced herein.
The Challenge of Searching for Grey Literature
The editorial scrutiny and peer-review that form integral parts of commercial academic
publishing are useful in assuring reliability and standardised reporting in published research.
However, publication bias can cause an overestimation of effect sizes in syntheses of the
(commercially) published literature (Gurevitch and Hedges 1999; Lortie et al. 2007). In a recent
study by Kicinski et al. (2015), the largest analysis of publication bias in meta-analyses to-date,
publication bias was detected across the Cochrane Library of systematic reviews, although there
was evidence that more recent research suffered to a lesser degree, thanks to mitigation
measures applied in medical research in recent decades.
Some applied subject areas, such as conservation biology, are particularly likely to be reported in
sources other than academic journals, so called practitioner-held data (Haddaway and Bayliss in
press); for example reports of the activities of non-governmental organisations. Such grey
literature is vital for a range of research, policy and practical applications, particularly informing
policy decision-making. Documents produced by governments, business, non-governmental
organisation and academics can provide a range of useful information, but are often overlooked
in traditional meta-analyses and literature reviews.
Systematic reviews were established in the medical sciences to collate and synthesise research
on particular clinical interventions in a reliable, transparent and objective manner, and were a
response to the susceptibility to bias common to traditional literature reviews (Allen and
Richmond 2011). In the last decade systematic review methodology has been translated into a
range of other subjects, including social science (Walker et al. 2013) and environmental
management (CEE 2013). A key aspect of systematic review methodology is that searches are
undertaken for grey literature to mitigate possible publication bias and to include practitioner-
held data. These searches may fail to find any research that is ultimately included (e.g. Haddaway
et al. 2014), but it is important for the reliability and transparency of the review to demonstrate
that this is the case: other reviews have demonstrated significant proportions of grey literature in
the synthesised evidence base (Bernes et al. 2015).
Systematic review searches for grey literature can be particularly challenging and time-
consuming. No comprehensive database resources exist in the environmental sciences for grey
literature, as in many other disciplines, and so searches must include web-based search engines,
specialist databases such as repositories for theses, organisational web sites such as non-
governmental organisations, governmental databases and university repositories. Typically
between 30 (Pullin et al. 2013) and 70 (Haddaway et al. 2014) individual web sites are searched.
Systematic reviews often complement these manual searches using web-based search engines;
both general (e.g. Google) and academic (e.g. Google Scholar). Not only are searches of this
number of resources time-consuming, but they are also typically undertaken in a very non-
transparent manner: excluded articles are rarely recorded and searches are not readily updatable
or repeatable. Furthermore, included resources must be listed individually by hand in any
documentation of search activities, whilst search results from academic databases, such as Web
of Science, can be downloaded as full citations.
TGJ Volume 11, Number 3 2015 Haddaway
187
Web Scraping Software: a potential solution
Data scraping is a term used to describe the extraction of data from an electronic file using a
computer program. Web scraping describes the use of a program to extract data from HTML files
on the internet. Typically this data is in the form of patterned data, particularly lists or tables.
Programs that interact with web pages and extract data use sets of commands known as
application programming interfaces (APIs). These APIs can be ‘taught’ to extract patterned data
from single web pages or from all similar pages across an entire web site. Alternatively,
automated interactions with websites can be built into APIs, such that links within a page can be
‘clicked’ and data extracted from subsequent pages. This is particularly useful for extracting data
from multiple pages of search results. Furthermore, this interactivity allows users to automate
the use of websites’ search facilities, extracting data from multiple pages of search results and
only requiring users to input search terms rather than having to navigate to and search each web
site first.
One major current use for web scraping is for businesses to track pricing activities of their
competitors: pricing can be established across an entire site in relatively short time scales and
with minimal manual effort. Various other commercial drivers have caused a large number and
variety of web scraping programs to have been developed in recent years (see Table 1). Some of
these programs are free, whilst others are purely commercial and charge a one off or regular
subscription fee.
These web scraping tools are equally as useful in the research realm. Specifically, they can
provide valuable opportunities in the search for grey literature, by: i) making searches of multiple
websites more resource-efficient; ii) drastically increasing transparency in search activities; and
iii) allowing researches to share trained APIs for specific websites, further increasing resource-
efficiency.
A further benefit of web scraping APIs relates to their use with traditional academic databases,
such as Web of Science. Whilst citations, including abstracts, are readily extractable from most
academic databases, many databases hold more useful information that is not readily exportable,
for example corresponding author information. Web scraping tools can be used to extract this
information from search results, allowing researchers to assemble contact lists that may prove
particularly useful in requests for additional data, calls for submission of evidence, or invitations
to take part in surveys, for example.
Table 1. List of Major web scraping tools
(adapted from http://scraping.pro/software-for-web-scraping/). Prices were accurate at the time of publication.
Platform Description
a
Cost
b
URL
Import.io “Instantly Turn Web Pages into Data. No Plugin, No
Training, No Setup. Create custom APIs or crawl entire
websites using our desktop app - no coding required!“
Free (charge for
premium service)
www.import.io
DataToolBar “ The Data Toolbar is an intuitive web scraping tool that
automates web data extraction process for your browser.
Simply point to the data fields you want to collect and the
tool does the rest for you.”
Free to try (no
export facility),
$24 for full
version
www.datatoolbar.com
Visual Web
Ripper
“Visual Web Ripper is a powerful visual tool used for
automated web scraping, web harvesting and content
extraction from the web. Our data extraction software can
automatically walk through whole web sites and collect
complete content structures such as product catalogs or
search results.”
$299 (including 1
year of
maintenance)
www.visualwebripper.co
m
Helium Scraper “Extract data from any website. Choose what to extract
with a few clicks. Create your own actions. Export
extracted data to a variery [sic] of file formats. “
Basic version $99
to Enterprise
version $699
www.heliumscraper.com
OutWit Hub “OutWit Hub breaks down Web pages into their different
constituents. Navigating from page to page automatically,
it extracts information elements and organizes them into
usable collections.”
Lite version free,
Pro version at
$89.90
www.outwit.com
Screen Scraper “Screen Scraper automates copying text from a web page,
clicking links, entering data into forms and submitting
them, iterating through search results pages, downloading
files (PDF, MS Word, images, etc.).”
Basic edition free,
commercial
versions from
$549 to $2,799
www.screen-scraper.com
Web Content
Extractor
“Web Context Extractor is a professional web data
extraction software designed not only to perform the most
of [sic] dull operations automatically, but also to greatly
$89 www.newprosoft.com/w
eb-content-extractor.htm
TGJ Volume 11, Number 3 2015 Haddaway
188
aDescriptions are taken from product web sites (31/01/2015)
bCosts correct at time of publication (2015)
increase productivity and effectiveness of the web data
scraping process. Web Content Extractor is highly accurate
and efficient for extracting data from websites.”
Kimono “Kimono lets you turn websites into APIs in seconds. You
don’t need to write any code or install any software to
extract data with Kimono. The easiest way to use Kimono
is to add our bookmarklet to your browser’s bookmark
bar. Then go to the website you want to get data from and
click the bookmarklet. Select the data you want and
Kimono does the rest.”
Free with
additional
features costing
up to $180
www.kimonolabs.com
FMiner “FMiner is a software for web scraping, web data
extraction, screen scraping, web harvesting, web crawling
and web macro support for windows and Mac OS X. It is an
easy to use web data extraction tool that combines best-
in-class features with an intuitive visual project design
tool, to make your next data mining project a breeze.”
Free 15 day trial,
$168-£248
www.fminer.com
Data Extractor
by Mozenda
“The Mozenda data scraper tool is very basic; all you have
to do is use the program to scrape up information you
need off of [sic] websites without all the tiring work of
searching websites one by one. Whether you are working
for the government such as a police officer or a detective,
in the medical field, or even a large business or
entrepreneur, website scraping is fast, easy and
affordable, plus it saves you or your employees a ton of
stressful work and time; use Mozenda’s data scraper and
let the program do all the hard work for you.”
Free trial (500
page credits), $99
to $199 per
month
www.mozenda.com/data
-extractor
WebHarvy
Data Extractor
Tool
“WebHarvy is a visual web scraper. There is absolutely no
need to write any scripts or code to scrape data. You will
be using WebHarvy's in-built browser to navigate web
pages. You can select the data to be scraped with mouse
clicks.”
$99 - $399 www.webharvy.com
Web Data
Extractor
“Web Data Extractor [is] a powerful and easy-to-use
application which helps you automatically extract specific
information from web pages which is necessary in your
day-to-day internet / email marketing or SEO activities.
Extract targeted company contact data (email, phone, fax)
from web for responsible b2b communication. Extract url,
meta tag (title, desc, keyword) for website promotion,
search directory creation, web research.”
$89 - $199 www.webextractor.com
Easy Web
Extractor
“An easy-to-use tool for web scrape solutions (web data
extracting, screen scraping) to scrape desired web content
(text, url, image, html) from web pages just by few screen
clicks. No programing required.”
$69.99 (with in-
app upgrades)
www.webextract.net
WebSundew “WebSundew is a powerful web scraping tool that extracts
data from the web pages with high productivity and speed.
WebSundew enables users to automate the whole process
of extracting and storing information from the web sites.
You can capture large quantities of bad-structured data in
minutes at any time in any place and save results in any
format. Our customers use WebSundew to collect and
analyze the wide range of data that exists on the Internet
related to their industry.”
$69 - $2,495 www.websundew.com
Handy Web
Extractor
“ Handy Web Extractor is a simple tool for everyday web
content monitoring. It will periodically download the web
page, extract the necessary content and display it in the
window on your desktop. One may consider it as the data
extraction software, taking its own nitch [sic] in the
scraping software and plugins.”
Free www.scraping.pro/handy
-web-extractor
TGJ Volume 11, Number 3 2015 Haddaway
189
Figure 1 shows a screen shot of one web scraping program being used to establish an API for an
automated search for grey literature from the website of the US Environmental Protection
Agency. This particular web-scraping platform is in the form of a downloadable, desktop-based
program; a web browser. The browser is then used to visit and train APIs by identifying rows and
columns in the patterned data: in practice rows will typically be search records, whilst columns
will be different aspects of the patterned data, such as titles, authors, sources, publication dates,
descriptions, etc. Detailed methods for the use of web scrapers are available elsewhere
(Haddaway et al. in press). In this way, citation-like information can be extracted for search
results according to the level of detail provided by the website. In addition to extracting search
results, as described above, static lists and individual, similarly patterned pages can also be
extracted. Furthermore, active links can be maintained, allowing the user to examine linked
information directly from the extracted database.
Figure 1. Screenshot of web scraping software being used to train an API for searching for grey
literature on the Environmental Protection Agency website. Program used is Import.io.
Just as search results from organisational websites can be extracted as citations, as described
above, search results from web-based search engines can be extracted and downloaded into
databases of quasi-citations. Microsoft Academic Search
(http://academic.research.microsoft.com) results can be extracted in this way, and in fact a pre-
trained API is available from Microsoft for extracting data from search results automatically
(http://academic.research.microsoft.com/about/Microsoft%20Academic%20Search%20API%20U
ser%20Manual.pdf).
Perhaps a more comprehensive alternative to Microsoft Academic Search is Google Scholar
(http://scholar.google.com). Google Scholar, however, does not support the use of bots
(automated attempts to access the Google Scholar server), and repeated querying of the server
by a single IP address (approximately 180 queries or citation extractions in succession) can results
in a an IP address being blocked for an extended period (approximately 48-72 hours) (personal
observation)1. Whilst it is understandable that automated traffic could be a substantial problem
for Google Scholar, automation of activities that would otherwise be laboriously undertaken by
hand is arguably of great value to researchers with limited resources. Thus, a potential work-
around involves the scraping of locally saved search results HTML pages after they had been
downloaded individually or in bulk (this may still constitute an infringement of the Google Scholar
conditions of use, however). A further cautionary note relates to demands on the servers that
host the web sites being scraped. Scraping a significant volume of pages from one site or scraping
multiple pages in a short period of time can put significant strain on smaller servers. However,
the level of scraping necessary to extract 100s to 1,000s of search results is unlikely to have
detrimental impacts on server functionality.
1Details of Google Scholar’s acceptable use policy are available from the following web page:
https://scholar.google.co.uk/intl/en/scholar/about.html.
TGJ Volume 11, Number 3 2015 Haddaway
190
Systematic reviewers must download hundreds or thousands of search results for later screening
from a suite of different databases. At present, Google Scholar is only cursorily searched in most
reviews (i.e. by examining the first 50 search results). The addition of Google Scholar as a
resource for finding additional academic and grey literature has been demonstrated to be useful
for systematic reviews (Haddaway et al. in press). Automating searches and transparency
documenting the results would increase transparency and comprehensiveness of the reviews
with a highly resource-efficient activity at little additional effort for reviewers. These implications
apply equally to other situations where web-based searching is beneficial but potentially time-
consuming.
Web scrapers are an attractive technological development in the field of grey literature. The
availability of a wide range of free and low-cost web scraping software provides an opportunity
for significant benefits to those with limited resources, particularly researchers working alone or
small organisations. Future developments will make use of the software even easier; for example
the one click, automatic training provided by Import.io (https://magic.import.io). Web scrapers
can increase resource-efficiency and drastically improve transparency, and existing networks can
benefit through readily sharable trained APIs. Furthermore, many programs can be easily used by
those with minimal or no skill or prior knowledge of this form of information technology.
Researchers could benefit substantially by investigating the applicability of web scraping to their
own work.
Acknowledgments
The author wishes to thank MISTRA EviEM for support during the preparation of this manuscript.
References
Allen C, Richmond K (2011) The Cochrane Collaboration: International activity within Cochrane Review Groups in the first decade of
the twenty-first century. Journal of Evidence-Based Medicine 4(1):2–7
Bernes, C., Carpenter, S.R., Gårdmark, A., Larsson, P., Persson, L., Skov, C., Speed, J.D.M., Van Donk, E. (2015). What is the influence
of a reduction of planktivorous and benthivorous fish on water quality in temperate eutrophic lakes? A systematic review.
Environmental Evidence
Collaboration for Environmental Evidence. 2013. Guidelines for systematic review and evidence synthesis in environmental
management. Version 4.2. Environmental Evidence. Available from
www.environmentalevidence.org/Documents/Guidelines/Guidelines4.2.pdf (accessed January 2014).
Gurevitch, J. and Hedges, L.V. 1999. Statistical issues in ecological meta-analyses. – Ecol. 80: 1142-1149.
Haddaway, N.R. and Bayliss, H.R. 2015. Shades of grey: two forms of grey literature important for conservation reviews. In press
Haddaway, N. R., Burden, A., Evans, C., Healey, J. R., Jones, D. L., Dalrymple, S. E., Pullin, A. S. (2014) Evaluating effects of land
management on greenhouse gas fluxes and carbon balances in boreo-temperate lowland peatland systems. Environmental
Evidence, 3:5.
Haddaway, N.R., Collins, A.M., Coughlin, D., Kirk, S. (2015) The role of Google Scholar in academic searching and its applicability to
grey literature searching. PLOS ONE, in press.
Kicinski, M., Springate, D. A., Kontopantelis, E. (2015). Publication bias in metaanalyses from the Cochrane Database of
Systematic Reviews. Statistics in Medicine, 34: 2781-2793.
Lortie, C.J. 2014. Formalized synthesis opportunities for ecology: systematic reviews and meta-analyses. – Oikos 123: 897-902.
Pullin, A. S., Bangpan, M., Dalrymple, S. E., Dickson, K., Haddaway, N. R., Healey, J. R., Hauari, H., Hockley, N., Jones, J. P. G, Knight,
T., Vigurs, C., Oliver, S. (2013) Human well-being impacts of terrestrial protected areas. Environmental Evidence, 2:19.
Walker, D., G. Bergh, E. Page, and M. Duvendack. 2013. Adapting a Systematic Review for Social Research in International
Development: A Case Study from the Child Protection Sector. London: ODI.
... Some of these programs are free, whilst others are purely commercial and charge a one-off or regular subscription fee. Such web scraping software are (1) Visual Web Ripper, (2) Web Content Extractor and (3) WebSundew amongst others [3,4]. Asides web scrapping software, there are programming languages with in-built powerful web scrapping libraries use for building a custom web data extractor with a specific requirement. ...
... They are (1) Traditional copy and paste, (2) Text grapping and regular expression, (3) Hypertext Transfer Protocol (HTTP) Programming, (4) Hyper Text Markup Language (HTML) Parsing, (5) Document Object Model (DOM) Parsing, (6) Web Scraping Software, (7) Vertical aggregation platforms, (8) Semantic annotation recognizing, (9) Computer vision web page analyzers . Specifically, they can provide valuable opportunities in the search for products updates by making searches of multiple websites (or web pages of a website) more resource-efficient [3]. ...
... Web scraping is used for different purposes such as research, analysis of market and comparison of price, collection of opinion of public in business, jobs advertisements, and collection of contact detail of required business. [3] Used web scrapping to increase transparency and resource efficiency in building and sharing protocols that extract search results and other data from web pages for those looking for grey literature. [2] Proposed an intelligent system to automatically monitor the firms' engagement in e-commerce by analyzing online data retrieved from their corporate websites. ...
Article
Full-text available
Websites are regarded as domains of limitless information which anyone and everyone can access. The new trend of technology has shaped the way we do and manage our businesses. Today, advancements in Internet technology has given rise to the proliferation of e-commerce websites. This, in turn made the activities and lifestyles of marketers/vendors, retailers and consumers (collectively regarded as users in this paper) easier as it provides convenient platforms to sale/order items through the internet. Unfortunately, these desirable benefits are not without drawbacks as these platforms require that the users spend a lot of time and efforts searching for best product deals, products updates and offers on e-commerce websites. Furthermore, they need to filter and compare search results by themselves which takes a lot of time and there are chances of ambiguous results. In this paper, we applied web crawling and scraping methods on an e-commerce website to obtain HTML data for identifying products updates based on the current time. These HTML data are preprocessed to extract details of the products such as name, price, post date and time, etc. to serve as useful information for users.
... In this research, 1522 reviews from 29 Bangladeshi agriculture apps were collected from a web crawler named 'WebHarvy' [17]. The reviews were manually classified to relate with human values based on a human values theory which is explained in detail in section 2. Each review was coded and analyzed manually by not only matching the words with values but also reflecting on the semantics of the reviews. ...
... The file was last updated in September 2018. For each app, all the user reviews were crawled through a web crawler named 'WebHarvy' [17] on 6 September, 2019. A total of 3991 user reviews along with user ratings on a 1-5 scale and date were crawled and kept in an online spreadsheet. ...
Conference Paper
Full-text available
Limited consideration of users' values in mobile applications (apps) can lead to user disappointments and negative socioeconomic consequences. Therefore, it is important to consider values in app development to avoid such adverse effects and to secure the optimum use of apps. With this aim, we conducted a case study to identify the users' desired values that are either reflected or missing in the existing Bangladeshi agriculture mobile apps. We manually analyzed 1522 reviews from 29 existing Bangladeshi agriculture apps in Google Play by following a widely used human values theory , Schwartz's theory of basic human values. Our results show that users of the selected apps have twenty one (21) desired individual values where eleven (11) values are reflected in the apps and ten (10) values are missing. This research provides a basis for the developers to design apps that consider users' values. It also provides a direction on which values they should address while developing apps. Moreover, repeating this research in different domains or societies would result in society-oriented apps that are more sensitive to users' values.
... Web scraping is used for different purposes such as research, analysis of market and comparison of price, collection of opinion of public in business, jobs advertisements, and collection of contact detail of required business. [3] Used web scrapping to increase transparency and resource efficiency in building and sharing protocols that extract search results and other data from web pages for those looking for grey literature. [2] Proposed an intelligent system to automatically monitor the firms' engagement in e-commerce by analyzing online data retrieved from their corporate websites. ...
... They are (1) Traditional copy and paste, (2) Text grapping and regular expression, (3) Hypertext Transfer Protocol (HTTP) Programming, (4) Hyper Text Markup Language (HTML) Parsing, (5) Document Object Model (DOM) Parsing, (6) Web Scraping Software, (7) Vertical aggregation platforms, (8) Semanticannotation recognizing, (9) Computer vision web page analyzers 1 . Specifically, they can provide valuable opportunities in the search for products updates by making searches of multiple websites (or web pages of a website) more resource-efficient[3]. ...
Preprint
Full-text available
Websites are regarded as domains of limitless information which anyone and everyone can access. The new trend of technology put us to change the way we are doing our business. The Internet now is fastly becoming a new place for business and the advancement in this technology gave rise to the number of e-commerce websites. This made the lifestyle of marketers/vendors, retailers and consumers (collectively regarded as users in this paper) easy, because it provides easy platforms to sale/order items through the internet. This also requires that the users will have to spend a lot of time and effort to search for the best product deals, products updates and offers on e-commerce websites. They have to filter and compare search results by themselves which takes a lot of time and there are chances of ambiguous results. In this paper, we applied web crawling and scraping methods on an e-commerce website to get HTML data for identifying products updates based on the current time. The HTML data is preprocessed to extract details of the products such as name, price, post date and time, etc. to serve as useful information for users.
... As a result, numerous of reseachs which used web scrapping in diffrante forms. A few examples are as follows: In Haddaway (2015), the researchers search for gray literature by web scraping. They employ a pattern of data extracted from websites in this study to do business in the private sector. ...
Article
Full-text available
As a result of the COVID-19 epidemic, there has been an increase in the demand for electronic education apps in schools and colleges in recent years. The purpose of this paper is to develop an educational application for chemistry students at different levels of study that will allow them to obtain precise information on chemical substances in a timely and safe manner. This program uses a web scraping technique by applied a RESRful API to extract information from websites, which is then sent to the student's account. Furthermore, due to the use of the Internet of Things, the application has an anti-explosives and narcotics property using (IOT). The application can retrieve and save information entered by the user on chemical compounds with a high level of security. The medication's chemical formula, as well as the covalent and ionic bonding of compounds, can be displayed. It also has a database that lists all of the hazardous substances. If a user enters a dangerous compound containing narcotics more than four times, an alarm message is sent to the administrator via the Internet of Things.
... Building of a multi-source, multi-rights hub (2,3) and multi-enrichments where the contents are linked as a structured LD, web harvesting process from multiple data sources, with their own unstructured data model remains a challenge. To achieve this there are four main processes: metadata harvesting (MH) (4)(5)(6)(7)(8)(9)(10)(11)(12)(13), data integration (DI) ( DI and CL are two axes of the Big Data research field. Big Data may be defined as "datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze" or "data too big to be handled and analyzed by traditional database protocols such as SQL" (40 (23). ...
Article
Full-text available
The wide proliferation of various wireless communication systems and devices has led to the arrival of a massive amount of Digital Resources (DR) from multi-sources, various metadata and media. However, data integration has allowed the ability to provide to users a uniform interface for multiple heterogonous data sources, metadata and users. Hence, the problem of matching which contents or DR belong to a specific user interest that demands more attention. In this article, we proposed a different model named: Learning & Boosting Architecture Model (LBAM). LBAM has goals to identify evolving interests of a person and to potentially propose a personal agenda, channels and activities. The first process is based on the creation of a hub of multiple sources of Micro Metadata (MM) using a Semantic Enriched MM Harvestor, a Watch & Notify Engine and a Semantic Shared Knowledge Notice (SSKN). They are harvested through a process able to catalogue the rights, interests and novelties in a scorm notice. It uses Machine Learning Models to improve the auto cataloguing of the DRs. It includes a Semantic Learning Watch and Notify engine using SSKN that allows ways to find DR or Event novelties of DR according to the evolving user interests. Using simulation studies and prototypes, we demonstrate that LBAM slightly improves accuracy in harvesting treatment from Entity Resolution and Linked Data compared to existing models using SSKN. We also demonstrate the integration of MM rights in a notice compared to other existing architectures. This article is the first paper of multiple for the LBAM project.
... Studies on the web-scraping method often include the use of an automated process or program as part of the definition of web-scraping (Haddaway, 2016;Persson, 2019). However, automation is not necessarily a part of web-scraping; it just may make the process of web-scraping more efficient. ...
Thesis
Full-text available
The industrial engineering discipline is inherently interdisciplinary and broad, with many areas of specialization. It interfaces with both the hard sciences and the soft sciences to solve a wide variety of problems in all sorts of industries. However, the discipline’s broadness has been both beneficial and problematic for its development. On one hand, it has kept industrial engineering flexible, dynamic, and timeless. On the other hand, it has led to misunderstanding, confusion, and a lack of recognition of the discipline. To clarify the misperceptions surrounding the discipline, the potential job titles of the industrial engineer were gathered through web-scraping. These job titles were then categorised based on job role/function and areas of concern. The categories that were based on areas of concern were: Method/Process/Operation/System/Project, Supply Chain/Logistics, Manufacturing/Production, Business/Management, Data/Infor¬mation/Technology, Quality/Reliability/Safety, Lean/Six-Sigma/Continuous Improvement, Ergonomics/Human Factors, Facility/Field/Plant, and Procurement/Purchasing/Investment. The categories that were based on job role/function were: Engineer, Analyst, Manager, and Consultant. Ultimately, the job potential of the industrial engineer can be summarized as a vast and flexible combination of job roles/functions, areas of concern, and industries or contexts.
... It uses import.io for grey literature [14]. Researchers around the world use google scholar profile to update their research articles and citations. ...
... The importance of using references as planning tools has consistently been reinforced in recommended standards of practice (SER 2004;Balaguer et al. 2014;Gann et al. 2019). Though there may be many examples in management literature globally, internationally published scientific literature is more highly organized and accessible to wide audiences than 'gray' literature (Haddaway 2015;Haddaway & Bayliss 2015). Thus, having published examples of references defined prior to restoration implementation, and used for restoration design, will be key to collaboratively advancing the reference concept in this role. ...
Article
Restoration has long used the reference concept as a cornerstone in setting targets, designing interventions, and benchmarking success. Following the initial applications of restoration references, however, the definition and broader relevance has been debated. Particularly in an era of directional global change, using historic or even contemporary ecosystem models has been contentious among restoration scientists and practitioners. In response, there have been calls for increasing flexibility in how references are defined and diversifying sources of data used to describe a reference. Previous frameworks suggest reference information can be drawn from sources across two main axes of time and space, covering historic to contemporary sources, and near to far spatial scales. We extend these axes by including future projections of climate and species composition and regional ecological information that is spatially disconnected from defined ecosystem types. Using this new framework, we conducted a review of restoration literature published between 2010-2020, extracting the temporal and spatial scales of reference data and classifying reference metrics by data type. The studies overwhelmingly focused on contemporary, ecosystem-specific references to benchmark a completed project's success. The most commonly reported reference metrics were plant-based, and contemporary reference data sources were more diverse than historical or future reference data. As global conditions continue to shift, we suggest that restoration projects would benefit by expanding reference site information to include forecasted and spatially diverse data. A greater diversity of data sources can enable higher flexibility and long-term restoration success in the face of global change. This article is protected by copyright. All rights reserved.
Article
TeachersPayTeachers.com (TpT) is the largest online teacher resource exchange, boasting over 3 million materials and over 1 billion downloads of those materials and not enough is known about the kinds of materials teachers access through TpT. We used web-scraping, cluster analysis, and natural language processing to break down the pre-pandemic TpT marketplace of over 500,000 resources along dimensions of grade level, content focus, resource type, cost, authorship, and ratings. We also relate the features we observe to the number of ratings received, offering a glimpse into the factors that may predict teachers’ use of these materials. We draw three main conclusions from this work: First, TpT predominantly serves elementary school grades; second, Common Core standards, while present, are not a focal point of most content; and third, close to 70% of material is characterized as being a “printable” or an “activity,” which suggest direct pedagogical values. We discuss the implications of this work and suggest that TpT suffers from pitfalls associated with being a market-based platform that is not bound by the need to adhere to sound learning theory or best pedagogical practices.
Article
Full-text available
Google Scholar (GS), a commonly used web-based academic search engine, catalogues between 2 and 100 million records of both academic and grey literature (articles not formally published by commercial academic publishers). Google Scholar collates results from across the internet and is free to use. As a result it has received considerable attention as a method for searching for literature, particularly in searches for grey literature, as required by systematic reviews. The reliance on GS as a standalone resource has been greatly debated, however, and its efficacy in grey literature searching has not yet been investigated. Using systematic review case studies from environmental science, we investigated the utility of GS in systematic reviews and in searches for grey literature. Our findings show that GS results contain moderate amounts of grey literature, with the majority found on average at page 80. We also found that, when searched for specifically, the majority of literature identified using Web of Science was also found using GS. However, our findings showed moderate/poor overlap in results when similar search strings were used in Web of Science and GS (10–67%), and that GS missed some important literature in five of six case studies. Furthermore, a general GS search failed to find any grey literature from a case study that involved manual searching of organisations' websites. If used in systematic reviews for grey literature, we recommend that searches of article titles focus on the first 200 to 300 results. We conclude that whilst Google Scholar can find much grey literature and specific, known studies, it should not be used alone for systematic review searches. Rather, it forms a powerful addition to other traditional search methods. In addition, we advocate the use of tools to transparently document and catalogue GS search results to maintain high levels of transparency and the ability to be updated, critical to systematic reviews.
Article
Full-text available
We used a Bayesian hierarchical selection model to study publication bias in 1106 meta-analyses from the Cochrane Database of Systematic Reviews comparing treatment with either placebo or no treatment. For meta-analyses of efficacy, we estimated the ratio of the probability of including statistically significant outcomes favoring treatment to the probability of including other outcomes. For meta-analyses of safety, we estimated the ratio of the probability of including results showing no evidence of adverse effects to the probability of including results demonstrating the presence of adverse effects. Results: in the meta-analyses of efficacy, outcomes favoring treatment had on average a 27% (95% Credible Interval (CI): 18% to 36%) higher probability to be included than other outcomes. In the meta-analyses of safety, results showing no evidence of adverse effects were on average 78% (95% CI: 51% to 113%) more likely to be included than results demonstrating that adverse effects existed. In general, the amount of over-representation of findings favorable to treatment was larger in meta-analyses including older studies. Conclusions: in the largest study on publication bias in meta-analyses to date, we found evidence of publication bias in Cochrane systematic reviews. In general, publication bias is smaller in meta-analyses of more recent studies, indicating their better reliability and supporting the effectiveness of the measures used to reduce publication bias in clinical trials. Our results indicate the need to apply currently underutilized meta-analysis tools handling publication bias based on the statistical significance, especially when studies included in a meta-analysis are not recent. Copyright © 2015 John Wiley & Sons, Ltd.
Article
Full-text available
Background Peatlands cover 2 to 5 percent of the global land area, while storing 30 and 50 percent of all global soil carbon (C). Peatlands constitute a substantial sink of atmospheric carbon dioxide (CO2) via photosynthesis and organic matter accumulation, but also release methane (CH4),nitrous oxide (N2O), and CO2 through respiration, all of which are powerful greenhouse gases (GHGs). Lowland peats in boreo-temperate regions may store substantial amounts of C and are subject to disproportionately high land-use pressure. Whilst evidence on the impacts of different land management practices on C cycling and GHG fluxes in lowland peats does exist, these data have yet to be synthesised. Here we report on the results of a Collaboration for Environmental Evidence (CEE) systematic review of this evidence. Methods Evidence was collated through searches of literature databases, search engines, and organisational websites using tested search strings. Screening was performed on titles, abstracts and full texts using established inclusion criteria for population, intervention/exposure, comparator, and outcome key elements. Remaining relevant full texts were critically appraised and data extracted according to pre-defined strategies. Meta-analysis was performed where sufficient data were reported. Results Over 26,000 articles were identified from searches, and screening of obtainable full texts resulted in the inclusion of 93 relevant articles (110 independent studies). Critical appraisal excluded 39 studies, leaving 71 to proceed to synthesis. Results indicate that drainage increases the N2O emission and the ecosystem respiration of CO2, but decreases CH4 emission. Secondly, naturally drier peats release more N2O than wetter soils. Finally, restoration increases the CH4 release. Insufficient studies reported C cycling, preventing quantitative synthesis. No significant effect was identified in meta-analyses of the impact of drainage and restoration on DOC concentration. Conclusions Consistent patterns in C concentration and GHG release across the evidence-base may exist for certain land management practices: drainage increases N2O production and CO2 from respiration; drier peats release more N2O than wetter counterparts; and restoration increases CH4 emission. We identify several problems with the evidence-base; experimental design is often inconsistent between intervention and control samples, pseudoreplication is extremely common, and variability measures are often unreported.
Article
Full-text available
Background Establishing Protected Areas (PAs) is among the most common conservation interventions. Protecting areas from the threats posed by human activity will by definition inhibit some human actions. However, adverse impacts could be balanced by maintaining ecosystem services or introducing new livelihood options. Consequently there is an ongoing debate on whether the net impact of PAs on human well-being at local or regional scales is positive or negative. We report here on a systematic review of evidence for impacts on human well-being arising from the establishment and maintenance of terrestrial PAs. Methods Following an a priori protocol, systematic searches were conducted for evidence of impacts of PAs post 1992. After article title screening, the review was divided into two separate processes; a qualitative synthesis of explanations and meaning of impact and a review of quantitative evidence of impact. Abstracts and full texts were assessed using inclusion criteria and conceptual models of potential impacts. Relevant studies were critically appraised and data extracted and sorted according to type of impact reported. No quantitative synthesis was possible with the evidence available. Two narrative syntheses were produced and their outputs compared in a metasynthesis. Results The qualitative evidence review mapped 306 articles and synthesised 34 that were scored as high quality. The quantitative evidence review critically appraised 79 studies and included 14 of low/medium susceptibility to bias. The meta-synthesis reveals that a range of factors can lead to reports of positive and negative impacts of PA establishment, and therefore might enable hypothesis generation regarding cause and effect relationships, but resulting hypotheses cannot be tested with the current available evidence. Conclusions The evidence base provides a range of possible pathways of impact, both positive and negative, of PAs on human well-being but provides very little support for decision making on how to maximise positive impacts. The nature of the research reported to date forms a diverse and fragmented body of evidence unsuitable for the purpose of informing policy formation on how to achieve win-win outcomes for biodiversity and human well-being. To better assess the impacts of PAs on human well-being we make recommendations for improving research study design and reporting.
Data
Full-text available
Background In lakes that have become eutrophic due to sewage discharges or nutrient runoff from land, problems such as algal blooms and oxygen deficiency often persist even when nutrient supplies have been reduced. One reason is that phosphorus stored in the sediments can exchange with the water. There are indications that the high abundance of phytoplankton, turbid water and lack of submerged vegetation seen in many eutrophic lakes may represent a semi-stable state. For that reason, a shift back to more natural clear-water conditions could be difficult to achieve. In some cases, though, temporary mitigation of eutrophication-related problems has been accomplished through biomanipulation: stocks of zooplanktivorous fish have been reduced by intensive fishing, leading to increased populations of phytoplankton-feeding zooplankton. Moreover, reduction of benthivorous fish may result in lower phosphorus fluxes from the sediments. An alternative to reducing the dominance of planktivores and benthivores by fishing is to stock lakes with piscivorous fish. These two approaches have often been used in combination. The implementation of the EU Water Framework Directive has recently led to more stringent demands for measures against eutrophication, and a systematic review could clarify whether biomanipulation is efficient as a measure of that kind. Methods The review will examine primary field studies of how large-scale biomanipulation has affected water quality and community structure in eutrophic lakes or reservoirs in temperate regions. Such studies can be based on comparison between conditions before and after manipulation, on comparison between treated and non-treated water bodies, or both. Relevant outcomes include Secchi depth, concentrations of oxygen, nutrients, suspended solids and chlorophyll, abundance and composition of phytoplankton, zooplankton and fish, and coverage of submerged macrophytes.
Article
Full-text available
Meta-analysis is the use of statistical methods to summarize research findings across studies. Special statistical methods are usually needed for meta-analysis, both because effect-size indexes are typically highly heteroscedastic and because it is desirable to be able to distinguish between-study variance from within-study sampling-error variance. We outline a number of considerations related to choosing methods for the meta-analysis of ecological data, including the choice of parametric vs. resampling methods, reasons for conducting weighted analyses where possible, and comparisons fixed vs. mixed models in categorical and regression-type analyses.
Article
Narrative reviews are dead. Long live systematic reviews (and meta-analyses). Synthesis in many forms is now a driving force in ecology. Advances in open data for ecology and new tools provide vastly improved capacity for novel, emergent knowledge synthesis in our discipline. Systematic reviews and meta-analyses are two formal synthesis opportunities for ecologists that are now accepted as traditional publications, but the scope of validated syntheses will continue to expand. To date, systematic reviews are rarely used whilst the rate of meta-analyses published in ecological journals is increasing exponentially. Systematic reviews provide an overview of the literature landscape for a topic, and meta-analyses examine the strength of evidence integrated across different studies. Effective synthesis benefits from both approaches, but better data reporting and additional advances in the culture of sharing data, code, analytics, workflows, methods and also ideas will further energize these efforts. At this junction, synthetic efforts that include systematic reviews and meta-analyses should continue as stand-alone publications. This is a necessary step in the evolution of synthesis in our discipline. Nonetheless, they are still evolving tools, and meta-analyses in particular are simply an extended set of statistical tests. Admittedly, understanding the statistics and assumptions influence how we conduct synthesis much as statistical choices often shape experimental design, i.e. ANOVA versus regression-based experiments, but statistics do not make the paper. Current steps – primary research articles need to more effectively report evidence, sharing scientific products should expand, systematic reviews should be used to identify research gaps/delineate literature landscapes, and meta-analyses should be used to examine evidence patterns to further predictive ecology.
Article
The Cochrane Collaboration (www.cochrane.org) is the world's largest organisation dedicated to preparing, maintaining and promoting the accessibility of systematic reviews of the effects of healthcare interventions. It is an international organisation with participants in more than 100 countries, principally focused around the Cochrane Review Groups that are responsible for the preparation and maintenance of Cochrane reviews. Since 2000, a periodic audit has been done to count the number of active members in the Cochrane Review Groups, subdivided by the countries in which these people are based. At the beginning of 2010, there were almost 28,000 people involved, an increase from about 5500 in 2000. The growth of activity has been dramatic, and especially large for authors of Cochrane reviews and protocols. In the year 2000, 2840 people were listed as authors by the Cochrane Review Groups. At the beginning of 2010, this had risen to over 21,000 people.
Adapting a Systematic Review for Social Research in International Development: A Case Study from the Child Protection Sector
  • D Walker
  • G Bergh
  • E Page
  • M Duvendack
Walker, D., G. Bergh, E. Page, and M. Duvendack. 2013. Adapting a Systematic Review for Social Research in International Development: A Case Study from the Child Protection Sector. London: ODI.