ArticlePDF Available

The Use of Web-scraping Software in Searching for Grey Literature

Authors:

Abstract

Searches for grey literature can require substantial resources to undertake but their inclusion is vital for research activities such as systematic reviews. Web scraping, the extraction of patterned data from web pages on the internet, has been developed in the private sector for business purposes, but it offers substantial benefits to those searching for grey literature. By building and sharing protocols that extract search results and other data from web pages, those looking for grey literature can drastically increase their transparency and resource efficiency. Various options exist in terms of web-scraping software and they are introduced herein.
TGJ Volume 11, Number 3 2015 Haddaway
186
The Use of Web-scraping Software in
Searching for Grey Literature
Neal R. Haddaway (Sweden)
Abstract
Searches for grey literature can require substantial resources to undertake but their inclusion is
vital for research activities such as systematic reviews. Web scraping, the extraction of patterned
data from web pages on the internet, has been developed in the private sector for business
purposes, but it offers substantial benefits to those searching for grey literature. By building and
sharing protocols that extract search results and other data from web pages, those looking for
grey literature can drastically increase their transparency and resource efficiency. Various options
exist in terms of web-scraping software and they are introduced herein.
The Challenge of Searching for Grey Literature
The editorial scrutiny and peer-review that form integral parts of commercial academic
publishing are useful in assuring reliability and standardised reporting in published research.
However, publication bias can cause an overestimation of effect sizes in syntheses of the
(commercially) published literature (Gurevitch and Hedges 1999; Lortie et al. 2007). In a recent
study by Kicinski et al. (2015), the largest analysis of publication bias in meta-analyses to-date,
publication bias was detected across the Cochrane Library of systematic reviews, although there
was evidence that more recent research suffered to a lesser degree, thanks to mitigation
measures applied in medical research in recent decades.
Some applied subject areas, such as conservation biology, are particularly likely to be reported in
sources other than academic journals, so called practitioner-held data (Haddaway and Bayliss in
press); for example reports of the activities of non-governmental organisations. Such grey
literature is vital for a range of research, policy and practical applications, particularly informing
policy decision-making. Documents produced by governments, business, non-governmental
organisation and academics can provide a range of useful information, but are often overlooked
in traditional meta-analyses and literature reviews.
Systematic reviews were established in the medical sciences to collate and synthesise research
on particular clinical interventions in a reliable, transparent and objective manner, and were a
response to the susceptibility to bias common to traditional literature reviews (Allen and
Richmond 2011). In the last decade systematic review methodology has been translated into a
range of other subjects, including social science (Walker et al. 2013) and environmental
management (CEE 2013). A key aspect of systematic review methodology is that searches are
undertaken for grey literature to mitigate possible publication bias and to include practitioner-
held data. These searches may fail to find any research that is ultimately included (e.g. Haddaway
et al. 2014), but it is important for the reliability and transparency of the review to demonstrate
that this is the case: other reviews have demonstrated significant proportions of grey literature in
the synthesised evidence base (Bernes et al. 2015).
Systematic review searches for grey literature can be particularly challenging and time-
consuming. No comprehensive database resources exist in the environmental sciences for grey
literature, as in many other disciplines, and so searches must include web-based search engines,
specialist databases such as repositories for theses, organisational web sites such as non-
governmental organisations, governmental databases and university repositories. Typically
between 30 (Pullin et al. 2013) and 70 (Haddaway et al. 2014) individual web sites are searched.
Systematic reviews often complement these manual searches using web-based search engines;
both general (e.g. Google) and academic (e.g. Google Scholar). Not only are searches of this
number of resources time-consuming, but they are also typically undertaken in a very non-
transparent manner: excluded articles are rarely recorded and searches are not readily updatable
or repeatable. Furthermore, included resources must be listed individually by hand in any
documentation of search activities, whilst search results from academic databases, such as Web
of Science, can be downloaded as full citations.
TGJ Volume 11, Number 3 2015 Haddaway
187
Web Scraping Software: a potential solution
Data scraping is a term used to describe the extraction of data from an electronic file using a
computer program. Web scraping describes the use of a program to extract data from HTML files
on the internet. Typically this data is in the form of patterned data, particularly lists or tables.
Programs that interact with web pages and extract data use sets of commands known as
application programming interfaces (APIs). These APIs can be ‘taught’ to extract patterned data
from single web pages or from all similar pages across an entire web site. Alternatively,
automated interactions with websites can be built into APIs, such that links within a page can be
‘clicked’ and data extracted from subsequent pages. This is particularly useful for extracting data
from multiple pages of search results. Furthermore, this interactivity allows users to automate
the use of websites’ search facilities, extracting data from multiple pages of search results and
only requiring users to input search terms rather than having to navigate to and search each web
site first.
One major current use for web scraping is for businesses to track pricing activities of their
competitors: pricing can be established across an entire site in relatively short time scales and
with minimal manual effort. Various other commercial drivers have caused a large number and
variety of web scraping programs to have been developed in recent years (see Table 1). Some of
these programs are free, whilst others are purely commercial and charge a one off or regular
subscription fee.
These web scraping tools are equally as useful in the research realm. Specifically, they can
provide valuable opportunities in the search for grey literature, by: i) making searches of multiple
websites more resource-efficient; ii) drastically increasing transparency in search activities; and
iii) allowing researches to share trained APIs for specific websites, further increasing resource-
efficiency.
A further benefit of web scraping APIs relates to their use with traditional academic databases,
such as Web of Science. Whilst citations, including abstracts, are readily extractable from most
academic databases, many databases hold more useful information that is not readily exportable,
for example corresponding author information. Web scraping tools can be used to extract this
information from search results, allowing researchers to assemble contact lists that may prove
particularly useful in requests for additional data, calls for submission of evidence, or invitations
to take part in surveys, for example.
Table 1. List of Major web scraping tools
(adapted from http://scraping.pro/software-for-web-scraping/). Prices were accurate at the time of publication.
Platform Description
a
Cost
b
URL
Import.io “Instantly Turn Web Pages into Data. No Plugin, No
Training, No Setup. Create custom APIs or crawl entire
websites using our desktop app - no coding required!“
Free (charge for
premium service)
www.import.io
DataToolBar “ The Data Toolbar is an intuitive web scraping tool that
automates web data extraction process for your browser.
Simply point to the data fields you want to collect and the
tool does the rest for you.”
Free to try (no
export facility),
$24 for full
version
www.datatoolbar.com
Visual Web
Ripper
“Visual Web Ripper is a powerful visual tool used for
automated web scraping, web harvesting and content
extraction from the web. Our data extraction software can
automatically walk through whole web sites and collect
complete content structures such as product catalogs or
search results.”
$299 (including 1
year of
maintenance)
www.visualwebripper.co
m
Helium Scraper “Extract data from any website. Choose what to extract
with a few clicks. Create your own actions. Export
extracted data to a variery [sic] of file formats. “
Basic version $99
to Enterprise
version $699
www.heliumscraper.com
OutWit Hub “OutWit Hub breaks down Web pages into their different
constituents. Navigating from page to page automatically,
it extracts information elements and organizes them into
usable collections.”
Lite version free,
Pro version at
$89.90
www.outwit.com
Screen Scraper “Screen Scraper automates copying text from a web page,
clicking links, entering data into forms and submitting
them, iterating through search results pages, downloading
files (PDF, MS Word, images, etc.).”
Basic edition free,
commercial
versions from
$549 to $2,799
www.screen-scraper.com
Web Content
Extractor
“Web Context Extractor is a professional web data
extraction software designed not only to perform the most
of [sic] dull operations automatically, but also to greatly
$89 www.newprosoft.com/w
eb-content-extractor.htm
TGJ Volume 11, Number 3 2015 Haddaway
188
aDescriptions are taken from product web sites (31/01/2015)
bCosts correct at time of publication (2015)
increase productivity and effectiveness of the web data
scraping process. Web Content Extractor is highly accurate
and efficient for extracting data from websites.”
Kimono “Kimono lets you turn websites into APIs in seconds. You
don’t need to write any code or install any software to
extract data with Kimono. The easiest way to use Kimono
is to add our bookmarklet to your browser’s bookmark
bar. Then go to the website you want to get data from and
click the bookmarklet. Select the data you want and
Kimono does the rest.”
Free with
additional
features costing
up to $180
www.kimonolabs.com
FMiner “FMiner is a software for web scraping, web data
extraction, screen scraping, web harvesting, web crawling
and web macro support for windows and Mac OS X. It is an
easy to use web data extraction tool that combines best-
in-class features with an intuitive visual project design
tool, to make your next data mining project a breeze.”
Free 15 day trial,
$168-£248
www.fminer.com
Data Extractor
by Mozenda
“The Mozenda data scraper tool is very basic; all you have
to do is use the program to scrape up information you
need off of [sic] websites without all the tiring work of
searching websites one by one. Whether you are working
for the government such as a police officer or a detective,
in the medical field, or even a large business or
entrepreneur, website scraping is fast, easy and
affordable, plus it saves you or your employees a ton of
stressful work and time; use Mozenda’s data scraper and
let the program do all the hard work for you.”
Free trial (500
page credits), $99
to $199 per
month
www.mozenda.com/data
-extractor
WebHarvy
Data Extractor
Tool
“WebHarvy is a visual web scraper. There is absolutely no
need to write any scripts or code to scrape data. You will
be using WebHarvy's in-built browser to navigate web
pages. You can select the data to be scraped with mouse
clicks.”
$99 - $399 www.webharvy.com
Web Data
Extractor
“Web Data Extractor [is] a powerful and easy-to-use
application which helps you automatically extract specific
information from web pages which is necessary in your
day-to-day internet / email marketing or SEO activities.
Extract targeted company contact data (email, phone, fax)
from web for responsible b2b communication. Extract url,
meta tag (title, desc, keyword) for website promotion,
search directory creation, web research.”
$89 - $199 www.webextractor.com
Easy Web
Extractor
“An easy-to-use tool for web scrape solutions (web data
extracting, screen scraping) to scrape desired web content
(text, url, image, html) from web pages just by few screen
clicks. No programing required.”
$69.99 (with in-
app upgrades)
www.webextract.net
WebSundew “WebSundew is a powerful web scraping tool that extracts
data from the web pages with high productivity and speed.
WebSundew enables users to automate the whole process
of extracting and storing information from the web sites.
You can capture large quantities of bad-structured data in
minutes at any time in any place and save results in any
format. Our customers use WebSundew to collect and
analyze the wide range of data that exists on the Internet
related to their industry.”
$69 - $2,495 www.websundew.com
Handy Web
Extractor
“ Handy Web Extractor is a simple tool for everyday web
content monitoring. It will periodically download the web
page, extract the necessary content and display it in the
window on your desktop. One may consider it as the data
extraction software, taking its own nitch [sic] in the
scraping software and plugins.”
Free www.scraping.pro/handy
-web-extractor
TGJ Volume 11, Number 3 2015 Haddaway
189
Figure 1 shows a screen shot of one web scraping program being used to establish an API for an
automated search for grey literature from the website of the US Environmental Protection
Agency. This particular web-scraping platform is in the form of a downloadable, desktop-based
program; a web browser. The browser is then used to visit and train APIs by identifying rows and
columns in the patterned data: in practice rows will typically be search records, whilst columns
will be different aspects of the patterned data, such as titles, authors, sources, publication dates,
descriptions, etc. Detailed methods for the use of web scrapers are available elsewhere
(Haddaway et al. in press). In this way, citation-like information can be extracted for search
results according to the level of detail provided by the website. In addition to extracting search
results, as described above, static lists and individual, similarly patterned pages can also be
extracted. Furthermore, active links can be maintained, allowing the user to examine linked
information directly from the extracted database.
Figure 1. Screenshot of web scraping software being used to train an API for searching for grey
literature on the Environmental Protection Agency website. Program used is Import.io.
Just as search results from organisational websites can be extracted as citations, as described
above, search results from web-based search engines can be extracted and downloaded into
databases of quasi-citations. Microsoft Academic Search
(http://academic.research.microsoft.com) results can be extracted in this way, and in fact a pre-
trained API is available from Microsoft for extracting data from search results automatically
(http://academic.research.microsoft.com/about/Microsoft%20Academic%20Search%20API%20U
ser%20Manual.pdf).
Perhaps a more comprehensive alternative to Microsoft Academic Search is Google Scholar
(http://scholar.google.com). Google Scholar, however, does not support the use of bots
(automated attempts to access the Google Scholar server), and repeated querying of the server
by a single IP address (approximately 180 queries or citation extractions in succession) can results
in a an IP address being blocked for an extended period (approximately 48-72 hours) (personal
observation)1. Whilst it is understandable that automated traffic could be a substantial problem
for Google Scholar, automation of activities that would otherwise be laboriously undertaken by
hand is arguably of great value to researchers with limited resources. Thus, a potential work-
around involves the scraping of locally saved search results HTML pages after they had been
downloaded individually or in bulk (this may still constitute an infringement of the Google Scholar
conditions of use, however). A further cautionary note relates to demands on the servers that
host the web sites being scraped. Scraping a significant volume of pages from one site or scraping
multiple pages in a short period of time can put significant strain on smaller servers. However,
the level of scraping necessary to extract 100s to 1,000s of search results is unlikely to have
detrimental impacts on server functionality.
1Details of Google Scholar’s acceptable use policy are available from the following web page:
https://scholar.google.co.uk/intl/en/scholar/about.html.
TGJ Volume 11, Number 3 2015 Haddaway
190
Systematic reviewers must download hundreds or thousands of search results for later screening
from a suite of different databases. At present, Google Scholar is only cursorily searched in most
reviews (i.e. by examining the first 50 search results). The addition of Google Scholar as a
resource for finding additional academic and grey literature has been demonstrated to be useful
for systematic reviews (Haddaway et al. in press). Automating searches and transparency
documenting the results would increase transparency and comprehensiveness of the reviews
with a highly resource-efficient activity at little additional effort for reviewers. These implications
apply equally to other situations where web-based searching is beneficial but potentially time-
consuming.
Web scrapers are an attractive technological development in the field of grey literature. The
availability of a wide range of free and low-cost web scraping software provides an opportunity
for significant benefits to those with limited resources, particularly researchers working alone or
small organisations. Future developments will make use of the software even easier; for example
the one click, automatic training provided by Import.io (https://magic.import.io). Web scrapers
can increase resource-efficiency and drastically improve transparency, and existing networks can
benefit through readily sharable trained APIs. Furthermore, many programs can be easily used by
those with minimal or no skill or prior knowledge of this form of information technology.
Researchers could benefit substantially by investigating the applicability of web scraping to their
own work.
Acknowledgments
The author wishes to thank MISTRA EviEM for support during the preparation of this manuscript.
References
Allen C, Richmond K (2011) The Cochrane Collaboration: International activity within Cochrane Review Groups in the first decade of
the twenty-first century. Journal of Evidence-Based Medicine 4(1):2–7
Bernes, C., Carpenter, S.R., Gårdmark, A., Larsson, P., Persson, L., Skov, C., Speed, J.D.M., Van Donk, E. (2015). What is the influence
of a reduction of planktivorous and benthivorous fish on water quality in temperate eutrophic lakes? A systematic review.
Environmental Evidence
Collaboration for Environmental Evidence. 2013. Guidelines for systematic review and evidence synthesis in environmental
management. Version 4.2. Environmental Evidence. Available from
www.environmentalevidence.org/Documents/Guidelines/Guidelines4.2.pdf (accessed January 2014).
Gurevitch, J. and Hedges, L.V. 1999. Statistical issues in ecological meta-analyses. – Ecol. 80: 1142-1149.
Haddaway, N.R. and Bayliss, H.R. 2015. Shades of grey: two forms of grey literature important for conservation reviews. In press
Haddaway, N. R., Burden, A., Evans, C., Healey, J. R., Jones, D. L., Dalrymple, S. E., Pullin, A. S. (2014) Evaluating effects of land
management on greenhouse gas fluxes and carbon balances in boreo-temperate lowland peatland systems. Environmental
Evidence, 3:5.
Haddaway, N.R., Collins, A.M., Coughlin, D., Kirk, S. (2015) The role of Google Scholar in academic searching and its applicability to
grey literature searching. PLOS ONE, in press.
Kicinski, M., Springate, D. A., Kontopantelis, E. (2015). Publication bias in metaanalyses from the Cochrane Database of
Systematic Reviews. Statistics in Medicine, 34: 2781-2793.
Lortie, C.J. 2014. Formalized synthesis opportunities for ecology: systematic reviews and meta-analyses. – Oikos 123: 897-902.
Pullin, A. S., Bangpan, M., Dalrymple, S. E., Dickson, K., Haddaway, N. R., Healey, J. R., Hauari, H., Hockley, N., Jones, J. P. G, Knight,
T., Vigurs, C., Oliver, S. (2013) Human well-being impacts of terrestrial protected areas. Environmental Evidence, 2:19.
Walker, D., G. Bergh, E. Page, and M. Duvendack. 2013. Adapting a Systematic Review for Social Research in International
Development: A Case Study from the Child Protection Sector. London: ODI.
... Web scraping is a technology that allows the taking of resources from the web and the results can be utilized again by other systems. The process of retrieving data or information from sites on the internet is called web scraping [3], [4], [5]; web extraction [6], [7], [8]; web mining [9], [10]; and web harvesting [11], [12]. ...
... Several studies related to the implementation of web scraping of scientific article or literature from the internet have been carried out beforehand including: web scraping for Indonesian -English parallel corpus using HTML DOM method [4], web-scraping software in searching for gray literature [5], application of web scraping techniques in scientific article search engines [13], the application of web scraping and winnowing web for the detection of plagiarism in the final project title [14], [15]. There are several algorithms that can be used in web scraping such as: regular expressions, HTML DOM, and Xpath [16]. ...
... as well as Indonesian news collection documents as a source and English as the translation. The research conducted by [5], suggests a variety of tools that can be used to search for references and scraping gray literature. There are about 15 platforms that can be used for scraping data are presented and are equipped with descriptions, prices, and URLs to access them. ...
Article
Full-text available
Google Scholar is a web-based service for searching a broad academic literature. Various types of references can be accessed such as: peer-reviewed papers, theses, books, abstracts and articles from academic publishers, professional communities, pre-printed data centers, universities and other academic organizations. Google Scholar provides the profile creation feature of every researcher, expert and lecturer. Quantity of publication from an academic institution along with detailed data on the publication of scientific articles can be accessed through Google Scholar. A recap of the publication of scientific articles of each researcher in an institution or organization is needed to determine the research performance collectively. But the problems that occur, the unavailability of recap services for publishing scientific articles for each researcher in an institution or organization. So that the scientific article publication data can be utilized by academic institutions or organizations, this research will take data from Google Scholar to make a recap of scientific article publication data by applying web scraping technology. Implementation of web scraping can help to take the available resources on the web and the results can be utilized by other applications. By doing web scraping on Google Scholar, collective scientific article publication data can be obtained. So that the process of making scientific publications data recap can be done quickly. Experiments in this study have succeeded in taking 236 researchers data from Google Scholar, with 9 attributes, and 2,420 articles.
... This section focuses on simplifying the stage of Data Preparation for an automated academic literature content extraction [14][15][16]. Earlier literature had explained how readily available web content could be re-cycled as part of various research processes at different stages as well as the top corresponding challenges. ...
... The value of the development tools is definitely very high for web developers. However, when it comes back to the case of extracting web content from scientific repositories the process remains complex due to the hierarchical xPath structures depicted in Figure 2a Over time, many of the web scraping tools identified in literature [15] had been simplified in order not to only provide free utilities, but also to code free solutions. In our case an extension to the google chrome browser (known as Web Scraper-please see Appendix A3) had been used to overcome the complexities with HTML parsing and content extraction. ...
Article
Full-text available
Scientific web repositories are central cyber locations where academic papers are stored and maintained. With the nature of the unstructured and semi-structured information/metadata within these repositories, literature analysis for scholar writing becomes a challenge. Correspondingly, applying CRISP-DM poses a stance to address this challenge through formulating a rather augmented process for a relevant literature search. However, almost all repositories do not have a straight forward method where metadata could be extracted for preliminary data processing being applied as part of the CRISP-DM process. Additionally, most repositories do not follow open access standards. Until the time this paper was published, the topic of the augmented, relevant literature search had seen a methodological progress only, with the inability to apply the underlying methods on a larger scale, given data access constraints to open access repositories. The aim of this paper is to propose CRISP-DM as an augmented research methodology with a focus on web scraping as part of the data processing step. To substantiate the proposed methodology, a play role case study is conducted. This then works on alleviating these restrictions, as well as encouraging the wider adoption of the augmented analysis process for a relevant literature search within the research community.
... Considerando que la tecnología utilizada por el metabuscador es el lenguaje Java versión 1.8 y que almacena sus datos mediante el sistema gestor de bases de datos PostgreSQL en su versión 9.4. Se listan a continuación las herramientas de extracción de datos remotos más utilizadas en el ámbito académico [106] y en la industria [107,108], que son compatibles con las tecnologías del SRI y cuentan con la capacidad de operar con técnicas basadas en web scrapping y API's. ...
... Debido a estos motivos y evaluando los requerimientos planteados en secciones anteriores, se optó por el uso de una herramienta que pueda realizar las extracciones sin depender de aplicaciones de terceros evitando costos económicos que podrían limitar el uso del metabuscador. Es por ello que se opta por el uso de Scrapy dado que es un proyecto de código abierto con una amplia comunidad que lo mantiene, fue lanzado bajo licencia de uso BSD, cuenta con una estructura de trabajo predefinida lista para su uso, y ha sido utilizado en trabajos similares a los objetivos perseguidos en el presente trabajo [106,118,119]. ...
Thesis
Full-text available
La recuperación de producción científico-tecnológica relevante a través de la Web es uno de los mayores desafíos que enfrentan los investigadores, debido al gran volumen, variedad y velocidad de actualización de los mismos. En la actualidad existen distintas alternativas que permiten a los usuarios acceder a estos contenidos y son presentados como herramientas de búsqueda, entre estas se encuentra un metabuscador de artículos científicos desarrollado por el Instituto de Investigación Desarrollo e Innovación en Informática de la Universidad Nacional de Misiones. Esta herramienta permite la búsqueda, recuperación y presentación de publicaciones científicas, sin embargo, para su evolución se pretende incorporar la sugerencia de autores científicos que guarden relación con las consultas de sus usuarios, con el fin de proporcionarles una cuota extra de información relevante para el desarrollo de sus proyectos. La generación de las recomendaciones requiere de la sistematización del proceso de recuperación de datos científicos, como así también de modelos capaces de evaluar los aspectos relevantes en autores científicos y la aplicación de técnicas apropiadas que se adapten a los requerimientos de la comunidad académica. El presente trabajo desarrolla y propone un proceso de recomendación que cumple con dichos requisitos y es capaz de ser integrado dentro del metabuscador mencionado ut supra.
... Esta biblioteca crea un árbol con todos los elementos del documento y puede ser utilizado para extraer información. Por lo tanto, esta biblioteca es útil para realizar web scraping -extraer información de sitios web.2 (Haddaway, 2015). ...
Conference Paper
Full-text available
El Factor de impacto es una medida de la importancia de una publicación científica. Los indicadores en la evaluación de impacto proporcionan información sobre cómo trabajan conjuntamente para producir un efecto global. Para cada indicador debe existir una definición, fórmula de cálculo y metadatos necesarios para su mejor entendimiento y socialización. El Instituto Nacional de Geografía, Estadística e Información cuenta con una metodología de control acerca de cómo, quien y cuando usan su información contenida en los artículos de sus investigación, sin embargo, tiene el interés de mejorar la forma de evaluar su impacto que estos tiene en otras investigaciones. Por lo que fue necesario desarrollar una herramienta que nos permitiera explorar dentro de esos conjuntos de información, que se incrementan continuamente. Dicha herramienta nos permitió buscar y obtener información, con la cual se podrá medir de manera precisa, el impacto que causan los resultados de las investigaciones del INEGI.
... In order to build multi-catalog ecosystem [22,23] where the records are linked asa structured linked data ecosystem,web harvesting process [33][34][35][36][37][38][39][40][41][42]from different data sources, with their own unstructured data model, remains a challenge. To present the related works, we focus on two research axes about semantic metadata harvesting from several sources:Data integration and record linkage [11][12][13][14][15][16][17][18][19][20]43] and Name Entity Resolution (NER) [1][2][3][4][5][6][7][8][9][10][11][44][45][46][47][48][49][50]. ...
Article
Full-text available
ARTICLE INFO ABSTRACT Crowd sourced and Entity Resolution has recently attracted significant attentions because it can harness the wisdom of crowd to improve the quality of Entity Resolution. Entity Resolution can be defined as the process of identifying, matching, verifying accuracy and merging metadata that correspond to the same entities from several databases. Two main issues have been identified for crowd sourced Entity Resolution: data, relation harvesting and integration, and named Entity Resolution. In this paper, we address the issue of data and metadata integration from multi-sources. We propose a new semantic approach of data integration, called SMESE Trusted Smart Harvesting Algorithm based on Semantic Relationship and Social Network (SMESE-TSHA). SMESE-TSHA is based on efficient Semantic Harvesting Strategies (SHS)addresses the problem of performing Entity Resolution (MLM-TSHA) using trusted and ranked sources.SHS addresses the problem of semantic harvesting based on authority file sources, sources classification model and the data graph model nodes exploration patterns while MLM-TSHA addresses the problem of performing Entity Resolution on RDF graphs containing multiple types of nodes. We experimentally evaluate our SMES-TSHA approach on large real datasets and compare the performance results with existing approaches. Our experimental results show our proposed models perform well on the Entity Resolution compared to the existing approaches, while also satisfying the running time restrictions.
... Related work: Data integration and record linkage Chen, 2013;Jagadish, 2014;Dong, 2013;Philip, 2014;Chen et al., 2014;Hashem et al., 2015;Assunção, 2015;Kacfah Emani, 2015;Cai, 2015; and Entity Resolution (ER)(1-10, 12, 33-39) are two related research domains that aim to build a multi-catalog ecosystem where the records are linked as a structured linked data ecosystem, web harvesting process Vargiu, 2013;Teli et al., 2015;Shi et al., 2015;Haddaway, 2015;Kadam, 2014;Glez-Peña et al., 2014;Dastidar, 2016;Casali et al., 2016;Gupta, 2017) from different data sources, with their own unstructured data model, remains a challenge. ...
Article
Full-text available
ARTICLE INFO ABSTRACT Entity Resolution and unstructured Web has recently attracted significant attentions and usage of Block Chain is increasing to try to solve different problems as traceability. Entity Resolution can be defined as the process of definition of the Entity from several sources including the unstructured web and structured databases. A new issue have been identified for Entity Resolution: What is informatio an instant later, so we talk now of the value of the Entity Resolution in function of the time. In this paper, we address the issue of data and metadata timely integration from unstructured, structured and multi-sources. We propos aims to build a unified and trusted traceable repository (UTTR), called SMESE Traceable Trusted Smart Harvesting Algorithm from Unstructured and Structured Web (SMESE TTSHA is based on Traceable Smart Harvesting Strategies (TSHS) performing Traceable Entity Resolution (MLM care of the value of the information at an instant approach on large real datasets and compare the performance results with those of existing approaches. Our experimental results show our proposed models perform well on the Traceable Entity Resolution compared to the existing approaches, Copyright © 2019, Ronald Brisebois et al. This is an open use, distribution, and reproduction in any medium, provided
... peer-reviewed and grey literature) from the fields of sciences, social sciences, and arts and humanities. We decided to employ all of them to obtain reliable, robust, and cross-checked data; including a reasonable amount of 'grey literature' via Google Scholar, so that we could include in our search technical reports and government-funded research studies which are not usually published by commercial publishers [Haddaway, 2015]. The searches took place in June (i.e. ...
Article
Full-text available
Although hundreds of citizen science applications exist, there is lack of detailed analysis of volunteers' needs and requirements, common usability mistakes and the kinds of user experiences that citizen science applications generate. Due to the limited number of studies that reflect on these issues, it is not always possible to develop interactions that are beneficial and enjoyable. In this paper we perform a systematic literature review to identify relevant articles which discuss user issues in environmental digital citizen science and we develop a set of design guidelines, which we evaluate using cooperative evaluation. The proposed research can assist scientists and practitioners with the design and development of easy to use citizen science applications and sets the basis to inform future Human-Computer Interaction research in the context of citizen science.
Article
Full-text available
This article is about the creation of a class in Java for creating a LinkedList with HTML tags from HTML code. The main idea for this article is to share a class that returns a String list with all tags.
Adapting a Systematic Review for Social Research in International Development: A Case Study from the Child Protection Sector
  • D Walker
  • G Bergh
  • E Page
  • M Duvendack
Walker, D., G. Bergh, E. Page, and M. Duvendack. 2013. Adapting a Systematic Review for Social Research in International Development: A Case Study from the Child Protection Sector. London: ODI.
Article
Full-text available
Google Scholar (GS), a commonly used web-based academic search engine, catalogues between 2 and 100 million records of both academic and grey literature (articles not formally published by commercial academic publishers). Google Scholar collates results from across the internet and is free to use. As a result it has received considerable attention as a method for searching for literature, particularly in searches for grey literature, as required by systematic reviews. The reliance on GS as a standalone resource has been greatly debated, however, and its efficacy in grey literature searching has not yet been investigated. Using systematic review case studies from environmental science, we investigated the utility of GS in systematic reviews and in searches for grey literature. Our findings show that GS results contain moderate amounts of grey literature, with the majority found on average at page 80. We also found that, when searched for specifically, the majority of literature identified using Web of Science was also found using GS. However, our findings showed moderate/poor overlap in results when similar search strings were used in Web of Science and GS (10–67%), and that GS missed some important literature in five of six case studies. Furthermore, a general GS search failed to find any grey literature from a case study that involved manual searching of organisations' websites. If used in systematic reviews for grey literature, we recommend that searches of article titles focus on the first 200 to 300 results. We conclude that whilst Google Scholar can find much grey literature and specific, known studies, it should not be used alone for systematic review searches. Rather, it forms a powerful addition to other traditional search methods. In addition, we advocate the use of tools to transparently document and catalogue GS search results to maintain high levels of transparency and the ability to be updated, critical to systematic reviews.
Article
Full-text available
We used a Bayesian hierarchical selection model to study publication bias in 1106 meta-analyses from the Cochrane Database of Systematic Reviews comparing treatment with either placebo or no treatment. For meta-analyses of efficacy, we estimated the ratio of the probability of including statistically significant outcomes favoring treatment to the probability of including other outcomes. For meta-analyses of safety, we estimated the ratio of the probability of including results showing no evidence of adverse effects to the probability of including results demonstrating the presence of adverse effects. Results: in the meta-analyses of efficacy, outcomes favoring treatment had on average a 27% (95% Credible Interval (CI): 18% to 36%) higher probability to be included than other outcomes. In the meta-analyses of safety, results showing no evidence of adverse effects were on average 78% (95% CI: 51% to 113%) more likely to be included than results demonstrating that adverse effects existed. In general, the amount of over-representation of findings favorable to treatment was larger in meta-analyses including older studies. Conclusions: in the largest study on publication bias in meta-analyses to date, we found evidence of publication bias in Cochrane systematic reviews. In general, publication bias is smaller in meta-analyses of more recent studies, indicating their better reliability and supporting the effectiveness of the measures used to reduce publication bias in clinical trials. Our results indicate the need to apply currently underutilized meta-analysis tools handling publication bias based on the statistical significance, especially when studies included in a meta-analysis are not recent. Copyright © 2015 John Wiley & Sons, Ltd.
Article
Full-text available
Background Peatlands cover 2 to 5 percent of the global land area, while storing 30 and 50 percent of all global soil carbon (C). Peatlands constitute a substantial sink of atmospheric carbon dioxide (CO2) via photosynthesis and organic matter accumulation, but also release methane (CH4),nitrous oxide (N2O), and CO2 through respiration, all of which are powerful greenhouse gases (GHGs). Lowland peats in boreo-temperate regions may store substantial amounts of C and are subject to disproportionately high land-use pressure. Whilst evidence on the impacts of different land management practices on C cycling and GHG fluxes in lowland peats does exist, these data have yet to be synthesised. Here we report on the results of a Collaboration for Environmental Evidence (CEE) systematic review of this evidence. Methods Evidence was collated through searches of literature databases, search engines, and organisational websites using tested search strings. Screening was performed on titles, abstracts and full texts using established inclusion criteria for population, intervention/exposure, comparator, and outcome key elements. Remaining relevant full texts were critically appraised and data extracted according to pre-defined strategies. Meta-analysis was performed where sufficient data were reported. Results Over 26,000 articles were identified from searches, and screening of obtainable full texts resulted in the inclusion of 93 relevant articles (110 independent studies). Critical appraisal excluded 39 studies, leaving 71 to proceed to synthesis. Results indicate that drainage increases the N2O emission and the ecosystem respiration of CO2, but decreases CH4 emission. Secondly, naturally drier peats release more N2O than wetter soils. Finally, restoration increases the CH4 release. Insufficient studies reported C cycling, preventing quantitative synthesis. No significant effect was identified in meta-analyses of the impact of drainage and restoration on DOC concentration. Conclusions Consistent patterns in C concentration and GHG release across the evidence-base may exist for certain land management practices: drainage increases N2O production and CO2 from respiration; drier peats release more N2O than wetter counterparts; and restoration increases CH4 emission. We identify several problems with the evidence-base; experimental design is often inconsistent between intervention and control samples, pseudoreplication is extremely common, and variability measures are often unreported.
Article
Full-text available
Background Establishing Protected Areas (PAs) is among the most common conservation interventions. Protecting areas from the threats posed by human activity will by definition inhibit some human actions. However, adverse impacts could be balanced by maintaining ecosystem services or introducing new livelihood options. Consequently there is an ongoing debate on whether the net impact of PAs on human well-being at local or regional scales is positive or negative. We report here on a systematic review of evidence for impacts on human well-being arising from the establishment and maintenance of terrestrial PAs. Methods Following an a priori protocol, systematic searches were conducted for evidence of impacts of PAs post 1992. After article title screening, the review was divided into two separate processes; a qualitative synthesis of explanations and meaning of impact and a review of quantitative evidence of impact. Abstracts and full texts were assessed using inclusion criteria and conceptual models of potential impacts. Relevant studies were critically appraised and data extracted and sorted according to type of impact reported. No quantitative synthesis was possible with the evidence available. Two narrative syntheses were produced and their outputs compared in a metasynthesis. Results The qualitative evidence review mapped 306 articles and synthesised 34 that were scored as high quality. The quantitative evidence review critically appraised 79 studies and included 14 of low/medium susceptibility to bias. The meta-synthesis reveals that a range of factors can lead to reports of positive and negative impacts of PA establishment, and therefore might enable hypothesis generation regarding cause and effect relationships, but resulting hypotheses cannot be tested with the current available evidence. Conclusions The evidence base provides a range of possible pathways of impact, both positive and negative, of PAs on human well-being but provides very little support for decision making on how to maximise positive impacts. The nature of the research reported to date forms a diverse and fragmented body of evidence unsuitable for the purpose of informing policy formation on how to achieve win-win outcomes for biodiversity and human well-being. To better assess the impacts of PAs on human well-being we make recommendations for improving research study design and reporting.
Data
Full-text available
Background In lakes that have become eutrophic due to sewage discharges or nutrient runoff from land, problems such as algal blooms and oxygen deficiency often persist even when nutrient supplies have been reduced. One reason is that phosphorus stored in the sediments can exchange with the water. There are indications that the high abundance of phytoplankton, turbid water and lack of submerged vegetation seen in many eutrophic lakes may represent a semi-stable state. For that reason, a shift back to more natural clear-water conditions could be difficult to achieve. In some cases, though, temporary mitigation of eutrophication-related problems has been accomplished through biomanipulation: stocks of zooplanktivorous fish have been reduced by intensive fishing, leading to increased populations of phytoplankton-feeding zooplankton. Moreover, reduction of benthivorous fish may result in lower phosphorus fluxes from the sediments. An alternative to reducing the dominance of planktivores and benthivores by fishing is to stock lakes with piscivorous fish. These two approaches have often been used in combination. The implementation of the EU Water Framework Directive has recently led to more stringent demands for measures against eutrophication, and a systematic review could clarify whether biomanipulation is efficient as a measure of that kind. Methods The review will examine primary field studies of how large-scale biomanipulation has affected water quality and community structure in eutrophic lakes or reservoirs in temperate regions. Such studies can be based on comparison between conditions before and after manipulation, on comparison between treated and non-treated water bodies, or both. Relevant outcomes include Secchi depth, concentrations of oxygen, nutrients, suspended solids and chlorophyll, abundance and composition of phytoplankton, zooplankton and fish, and coverage of submerged macrophytes.
Article
Full-text available
Meta-analysis is the use of statistical methods to summarize research findings across studies. Special statistical methods are usually needed for meta-analysis, both because effect-size indexes are typically highly heteroscedastic and because it is desirable to be able to distinguish between-study variance from within-study sampling-error variance. We outline a number of considerations related to choosing methods for the meta-analysis of ecological data, including the choice of parametric vs. resampling methods, reasons for conducting weighted analyses where possible, and comparisons fixed vs. mixed models in categorical and regression-type analyses.
Article
Narrative reviews are dead. Long live systematic reviews (and meta-analyses). Synthesis in many forms is now a driving force in ecology. Advances in open data for ecology and new tools provide vastly improved capacity for novel, emergent knowledge synthesis in our discipline. Systematic reviews and meta-analyses are two formal synthesis opportunities for ecologists that are now accepted as traditional publications, but the scope of validated syntheses will continue to expand. To date, systematic reviews are rarely used whilst the rate of meta-analyses published in ecological journals is increasing exponentially. Systematic reviews provide an overview of the literature landscape for a topic, and meta-analyses examine the strength of evidence integrated across different studies. Effective synthesis benefits from both approaches, but better data reporting and additional advances in the culture of sharing data, code, analytics, workflows, methods and also ideas will further energize these efforts. At this junction, synthetic efforts that include systematic reviews and meta-analyses should continue as stand-alone publications. This is a necessary step in the evolution of synthesis in our discipline. Nonetheless, they are still evolving tools, and meta-analyses in particular are simply an extended set of statistical tests. Admittedly, understanding the statistics and assumptions influence how we conduct synthesis much as statistical choices often shape experimental design, i.e. ANOVA versus regression-based experiments, but statistics do not make the paper. Current steps – primary research articles need to more effectively report evidence, sharing scientific products should expand, systematic reviews should be used to identify research gaps/delineate literature landscapes, and meta-analyses should be used to examine evidence patterns to further predictive ecology.
Article
The Cochrane Collaboration (www.cochrane.org) is the world's largest organisation dedicated to preparing, maintaining and promoting the accessibility of systematic reviews of the effects of healthcare interventions. It is an international organisation with participants in more than 100 countries, principally focused around the Cochrane Review Groups that are responsible for the preparation and maintenance of Cochrane reviews. Since 2000, a periodic audit has been done to count the number of active members in the Cochrane Review Groups, subdivided by the countries in which these people are based. At the beginning of 2010, there were almost 28,000 people involved, an increase from about 5500 in 2000. The growth of activity has been dramatic, and especially large for authors of Cochrane reviews and protocols. In the year 2000, 2840 people were listed as authors by the Cochrane Review Groups. At the beginning of 2010, this had risen to over 21,000 people.