ArticlePDF Available

COVoc and COVTriage: novel resources to support literature triage

Authors:

Abstract and Figures

Motivation Since early 2020, the COVID-19 pandemic has confronted the biomedical community with an unprecedented challenge. The rapid spread of COVID-19 and ease of transmission seen worldwide is due to increased population flow and international trade. Front-line medical care, treatment research and vaccine development also require rapid and informative interpretation of the literature and COVID-19 data produced around the world, with 177,500 papers published between January 2020 and November 2021, i.e., almost 8,500 papers per month. To extract knowledge and enable interoperability across resources, we developed the COVID-19 Vocabulary (COVoc), an application ontology related to research of this pandemic. The main objective of COVoc development was to enable seamless navigation from biomedical literature to core databases and tools of ELIXIR, a European-wide intergovernmental organisation for life sciences. Results This collaborative work provided data integration into SIB Literature services (SIBiLS), an application ontology (COVoc), and a triage service named COVTriage and based on annotation processing to search for COVID-related information across pre-defined aspects with daily updates. Thanks to its interoperability potential, COVoc lends itself to wider applications, hopefully through further connections with other novel COVID-19 ontologies as has been established with CIDO. Availability https://github.com/EBISPOT/covoc, https://candy.hesge.ch/COVTriage Supplementary information Supplementary data are available at Bioinformatics online.
Content may be subject to copyright.
Bioinformatics
, YYYY, 0–0
doi: 10.1093/bioinformatics/xxxxx
Advance Access Publication Date: DD Month YYYY
Manuscript Category
Databases and Ontologies
COVoc and COVTriage: novel resources to support
literature triage
Déborah Caucheteur1,2,*, Zoë May Pendlington3, Paola Roncaglia3, Julien
Gobeill1,2, Luc Mottin1,2,4, Nicolas Matentzoglu3,5, Donat Agosti1,6, David
Osumi-Sutherland3, Helen Parkinson3 and Patrick Ruch1,2
1SIB Text Mining Group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland, 2BiTeM Group,
Information Sciences, HES-SO/HEG Genève, 1227 Carouge, Switzerland, 3European Bioinformatics
Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Genome Campus, Hinxton,
Cambridge CB10 1SD, UK, 4Department of Microbiology and Molecular Medicine, Faculty of Medicine,
University of Geneva, 1205 Geneva, Switzerland, 5Semanticly Ltd, London, UK, 6Plazi, 3007 Bern,
Switzerland.
*To whom correspondence should be addressed.
Associate Editor: XXXXXXX
Received on XXXXX; revised on XXXXX; accepted on XXXXX
Abstract
Motivation: Since early 2020, the COVID-19 pandemic has confronted the biomedical community with
an unprecedented challenge. The rapid spread of COVID-19 and ease of transmission seen worldwide
is due to increased population flow and international trade. Front-line medical care, treatment research
and vaccine development also require rapid and informative interpretation of the literature and COVID-
19 data produced around the world, with 177,500 papers published between January 2020 and
November 2021, i.e., almost 8,500 papers per month.
To extract knowledge and enable interoperability across resources, we developed the COVID-19
Vocabulary (COVoc), an application ontology related to research of this pandemic. The main objective
of COVoc development was to enable seamless navigation from biomedical literature to core
databases and tools of ELIXIR, a European-wide intergovernmental organisation for life sciences.
Page 1 of 6 Bioinformatics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
© The Author(s) 2022. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium,
provided the original work is properly cited.
Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac800/6895097 by guest on 13 December 2022
D.Caucheteur et al.
Results: This collaborative work provided data integration into SIB Literature services (SIBiLS), an
application ontology (COVoc), and a triage service named COVTriage and based on annotation
processing to search for COVID-related information across pre-defined aspects with daily updates.
Thanks to its interoperability potential, COVoc lends itself to wider applications, hopefully through
further connections with other novel COVID-19 ontologies as has been established with CIDO.
Availability:
https://github.com/EBISPOT/covoc
https://candy.hesge.ch/COVTriage
Contact: deborah.caucheteur@hesge.ch
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
At the end of 2019, a severe form of viral pneumonia was detected in
China. The majority of the initial cases had visited a local market where
the sale of live animals was allowed. However, it was subsequently found
that people who had not visited the market were also carrying the virus,
concluding that human-human transmission was possible. The first death
officially reported by the authorities occurred on January 11, 2020
[WHO(1)]. The flow of people and merchandise led to a rapid spread of
the virus around the globe, sparing no continent. Although many measures
(incl. barrier gestures, curfews, confinement) may have slowed the spread
of the disease, the threshold of more than 1 million reported deaths was
exceeded during the year 2020.
The prompt response of the scientific community to the pandemic led to
the publication of a large number of scientific articles [Chen et al., 2020]
including preprints. Nevertheless, it is not always easy for a researcher to
find information that is relevant to them in the midst of this large
collection. COVID-19 itself was only defined as “aggravated pneumonia”
at the beginning of the pandemic and the virus as “novel coronavirus”, but
no official name or definition was declared until early 2020. In January
2020 [WHO(2)], “2019 n-CoV” was the first name proposed by the WHO
before the final decision was declared in February: “COVID-19” for
“coronavirus disease 2019” [WHO(3)].
This new disease has led to the use of new terms, making the retrieval of
information more complex. An ontology dedicated to COVID-19 would
be very useful to achieve efficient and quick queries by scientific
researchers. By definition, an ontology is a controlled vocabulary that
structures defined terms with the use of hierarchical relationships and is a
mathematical model based on a subset of first order logic. Ontologies
additionally contain synonyms and cross references to other ontologies,
vocabularies, and other resources to enrich each term and improve the
harmonisation and interoperability of annotated data and literature. As a
result, ontologies are both human- and computer-readable and can aid in
analysis. There are two main types of ontologies: domain and application.
A domain ontology consists of knowledge centred on a specific entity or
entities within the same scope (e.g., cells, anatomical structures, or
phenotypes) and the relations between them. Many domain ontologies in
the biomedical space abide by the standards set out by, and are part of the
Open Biological and Biomedical Ontologies (OBO) Foundry [Jackson et
al., 2021]. An application ontology can be built to define a wider set of
entities than a domain ontology, for example an ensemble of cells,
anatomical structures, phenotypes, diseases, and assays can be combined
in an application ontology to detail the interactions and relationships
between the individual domains while using the expertly curated terms
and relationships from the domain ontologies. Domain ontologies are
regularly created and driven by data or application needs, as an alternative
to duplicating work and creating new domain ontologies. The benefit of
an application ontology comes from relating domain-specific ontology
terms to each other, allowing new relationships to be visualised (e.g., cells
linked to anatomy, anatomy linked to phenotypes, and phenotypes linked
to disease).
To serve scientific research, data sharing and information dissemination,
we worked in collaboration with committed research partners to develop:
- a controlled vocabulary dedicated to COVID-19 named COVoc,
- an application ontology from the controlled vocabulary,
- associated text analytics services (e.g., COVTriage).
We participated in the TREC-COVID competition [NIST(1)] to assess the
performance of these tools.
Established in 1992, the Text REtrieval Conference (TREC) [TREC(1)],
co-sponsored by the United States National Institute of Standards and
Technology (NIST) and the Intelligence Advanced Research Projects
Activity, is a yearly workshop focused on a list of different information
retrieval (IR) research areas, also named tracks. The goal is to accelerate
research in this domain. With the creation of the first large test collections
of full-text documents and standardised retrieval evaluation and a
consequent number of participants and edition, the TREC competition has
contributed to the development of new retrieval techniques.
The BiTeM group at HES-SO and SIB Text Mining group at the Swiss
Institute of Bioinformatics (SIB) in Geneva have been participating in
several TREC tracks like TREC Clinical Decision Support [Gobeill et al.,
2014; Gobeill et al., 2015], TREC Chemical IR [Gobeill et al., 2011],
TREC Genomics [Gobeill et al., 2007], TREC Medical Records [Gobeill
et al., 2011], TREC Deep Learning [Knafou et al., 2019; Knafou et al.,
2020] and TREC Precision Medicine [Pasche et al., 2017; Pasche et al.,
2018; Caucheteur et al., 2019; Pasche et al., 2020].
On April 15th, 2020, NIST announced the launch of TREC-COVID, a
community evaluation that created a collection of tests around the
literature specific to COVID-19, provided by the Allen Institute for
Artificial Intelligence (AI2). In the context of the pandemic, the topics of
interest, number of publications and diversity of the covered subjects are
changing rapidly. To remain effective in supporting biomedical research
for information retrieval (in the context of the TREC competition for
example), it is necessary to provide useful data for the evaluation of
algorithms or to discover and manage scientific information methods for
future biomedical crises.
2 Methods
2.1 COVoc - Ontology development
In March 2020, we started collaborating on a public spreadsheet with a
wide range of research communities to collect vocabulary from scientific
publications or press articles related to the COVID-19 pandemic. To this
aim, several hackathons were held. In addition, a list of terms was
automatically generated containing the most frequent terms used in
abstracts of the CORD-19 collection [Wang et al., 2020], a collection
dedicated to COVID-19 research provided by the AllenAI Institute,
according to a TF-IDF (term frequency-inverse document frequency)
calculation script. The top vocabulary terms were manually curated and
added to a spreadsheet in order to organise the axes according to individual
concepts and to allow an easy, user-friendly way for collaborative
contributions from the scientific community without the requirement for
ontology knowledge.
Page 2 of 6Bioinformatics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac800/6895097 by guest on 13 December 2022
COVoc: a COVID-19 ontology to support literature triage
An ontology building pipeline was designed to compile the resource
directly from the spreadsheet to create an application ontology with the
distinct purpose of aiding curation of COVID-19 related literature. Classes
from existing OBO ontologies were imported where applicable, along
with their cross-references to public resources, making COVoc highly
interoperable with existing domain ontologies and the growing
community of COVID related ontologies in response to the pandemic.
Manual and automated curation was carried out to create the template files
from the original spreadsheet, using available mapping tools Zooma
[Zooma] and the Ontology Lookup Service [OLS]. The objective was to
include as many terms as possible from widely used OBO ontologies to
connect annotations to COVoc directly to other useful resources such as
the COVID-19 data platform [Covid Data Portal] (Table 1) and utilise
domain ontology expertise alongside minting novel COVID-19 related
terms. Terms that are both novel to COVoc and imported from other
ontologies have additional annotation properties to aid users of COVoc,
including a preferred label, synonyms, internal identifiers, cross
references to both OBO ontologies and other vocabularies and resources.
The Ontology Development Kit [ODK] and ROBOT [ROBOT] were used
to convert the templates from the original spreadsheet. Protégé (version
5.5.0) [Protégé] was used to check the structure of COVoc during the
ontology building process.
Table 1.Mapping among 20 existing ontologies for each COVoc
axis.
Axis
Cross-references with:
BioMedical Vocabulary
BAO, CHEBI, CHIMO, CL, EFO, GO,
IDO, MAXO, Mondo, NCBITaxon, NCIt,
OBI, OMT, PR, UBERON
Biotic interactions
ENVO, GO, INO, NBO, OMIT, RO
Cell lines
CLO, EFO
Chemicals
CHEBI, NCIt
Clinical Trials
-
Conceptual Entities
CHEBI, EFO, OBI, PATO
Diseases & Syndromes
CHEBI, EFO, HP, Mondo
Geographic Locations
DBPedia, HANCESTRO
Organisms
NCBITaxon, CIDO
Proteins and Genomes
CHEBI, NCIt, PR
2.2 Integration into SIBiLS
The first step is the import of literature collections into SIBiLS (Swiss
Institute of Bioinformatics Literature Services) [Gobeill et al., 2020]. We
work with 3 collections: MEDLINE [MEDLINE], free full-text from PMC
[PMC] and CORD-19 [Wang et al., 2020], respectively consisting of
34,664,562, 4,722,601 and 389,830 documents (October 2022). These
collections are daily updated for the first two, punctually for the third.
Additional collections regrouping supplementary data files from PMC as
well as taxonomic treatments from Plazi [Naderi et al. 2022, Penev et al.
2022] are planned.
A parsing task is done to split each document in distinct fields, for example
title, abstract, keywords, which are then pushed into a MongoDB database.
Document annotation is a once-only process. First, it implies string pre-
processing and tokenization methods. A dash or a slash could sometimes
be responsible for non-matching. In our pipeline, this risk is removed
thanks to the deletion of these symbols and creation of additional words:
the two parts of words are kept separately but also fused to create a new
word (“covid-19” becomes “covid” plus “covid19”, while “19” is not kept
because its length is inferior to 3 characters). Such processing enables
retrieval of papers in which only occurrences of the word with the dash
are present for example. Applied to documents as well as to ontology
concepts, this processing set makes it possible to annotate terms no matter
how the author has spelled them. The second step is the use of COVoc
ontology to annotate the collections: identifiers of COVoc concepts found
in the text are attached to it. By going through each document in a corpus,
we question its presence in one of the COVoc axes. When a term is
matched, an annotation is constructed which includes the term, its id, the
associated preferred term, the axis in which it is found and its position in
the document. To access these COVoc annotations in a dedicated way, we
have developed the COVTriage tool (previously named COVID Triage)
described in the next section. To allow a filtering step by users, COVoc is
splitted in axes: BioMedical Vocabulary (BMV), Cell lines (CL), Clinical
Trials (CT), Conceptual entities (CE), Diseases and Syndromes (DIS),
Chemicals (CHEM), Geographic locations (GL), Organisms (ORG),
Proteins and Genomes (PG). If users are more interested in the cell lines
cited in publications, it will be possible to show only the “Cell lines” axis
annotations.
The benefits of this pipeline are multiple:
- time saving when answering queries because even in the case
of repeated queries, the process has been carried out beforehand, only
once. Also, paragraphs are reduced to a list of IDs where one or more IDs
are retrieved, which is faster;
- a better recall thanks to the association of a unique identifier
for each occurrence of a concept and all its synonyms.
For each document loaded in the initial MongoDB database, a list of
related annotations is created and exported as a unique entry in a new
MongoDB database, dedicated to annotations. Finally, both original fields
and annotations are indexed in an ElasticSearch index (v7.2.0).
2.3 COVTriage - Service
COVTriage is a prioritisation system designed to facilitate the literature
investigation process for clinicians and researchers involved in the fight
against COVID-19.
Initially built as a prototype to demonstrate typical use-case of the COVoc
ontology, we have then decided to carry on the development and the
maintenance of this application that responds concretely to this research
needs. COVTriage aims at providing the experts with relevance-based
articles from three different collections: MEDLINE, PubMed Central
(PMC), and COVID-19 Open Research Dataset (CORD-19) [Wang et al.,
2020].
Developed with Python/Java/Javascript technologies, COVTriage
implements a search engine based on the SIB Literature Services (SIBiLS)
[Gobeill et al., 2020], a graphical user interface (GUI), and a bunch of
APIs enabling interactions with other systems.
On the back end of COVTriage, MongoDB hosts a mirror of the
literature and the full set of annotations resulting from the above-
mentioned work with COVoc. Different ElasticSearch indexes provide
quick access to these elements depending on the source selected at query
time by the user. Their request is processed on the server side and must
include a “focus” in addition to the keywords. This focus can be selected
among the nine COVoc axes and it will impact the ranking of the retrieved
documents. I.e., for a specific search, a score is calculated for each
returned article according to its proximity to the initial query and its
relevance to the selected focus. Details about this re-ranking (RR) function
and its evaluation are presented in sections 2.4 and 3.3.
On the front-end, a graphical user interface (GUI) has been implemented
by adapting a layout that has already proven effective for biomedical
literature curation [Britan et al, 2018]. As a complement to the website,
public REST APIs were prepared to ensure the interoperability of the two
major functionalities (literature prioritisation and annotation extraction)
with independent systems.
2.4 Evaluation (TREC-COVID)
The TREC-COVID task was a classic ad hoc search task: from a
pandemic-related topic (i.e., query like “what evidence is there related to
COVID-19 super spreaders”), the competing system had to return a ranked
list of 1,000 relevant documents. The document set used was the CORD-
19 collection and the relevance judgements were made by human experts.
As five different and successive rounds were organised, the collection and
topic lists were updated for each round. For round 3, the CORD-19
collection (May 19 version) contained 54,842 documents and there were
40 topics; for round 4, the collection (June 19 version) contained 73,858
documents and there were 45 topics. TREC-COVID brought together
hundreds of participants for five rounds of evaluation where each team
could complete five runs, five lists of ranked documents per topic from
distinct systems [TREC(2)].
Our system relies on robust strategies, and strategies linked with the
COVoc ontology. First, we parsed the collection and indexed documents
and metadata in a Lucene Elasticsearch engine. For each topic, we
generated a weighted set of keywords based on terms’ Document
Frequencies observed in PubMed Central rather than in CORD-19; this
strategy aimed at favouring rare and informative words. The engine was
then queried with Okapi BM25 weighting scheme, in order to generate
baseline runs. Beyond this, we also exploited the COVoc ontology for
producing supplementary runs. The COVoc terms were mapped in the
CORD-19 collection, and documents enriched with mapped COVoc
concept ids were indexed in a second Elasticsearch index; in the same way,
the mapping was applied to topics to expand the queries with relevant
Page 3 of 6 Bioinformatics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac800/6895097 by guest on 13 December 2022
D.Caucheteur et al.
COVoc concept ids. The engine was then queried to generate the so-called
“query expansion” runs. Finally, the search engine implemented in
COVTriage was applied to both baseline and query expansion runs.
To refine the process, we developed a prioritisation component that takes
further advantage from COVoc to contextualise the search. At least one of
the ontology axes is a priori associated to each of the TREC queries with
regards to their proximity. At query time, the system gathered the list of
publications returned by the search engine together with all the concepts
stored into SIBiLS for the selected axis. Then, the list is re-ranked
according to a new score calculated with a linear combination of 3
parameters: a) the original score from ElasticSearch; b) a measurement of
the document specificity (the number of distinct concepts related to the
selected axis); and c) a measurement of COVoc density in the document
(the TF-IDF). The final linear combination has been optimised with
weighting factors empirically set at round 3 of TREC-COVID to generate
re-ranking runs (RR) for the evaluation.
3 Results
3.1 Ontology
COVoc contains 563 terms with 2,481 synonyms, required to enhance
annotation and search functions, and 5,751 cross references, enhancing the
harmonisation and interoperability of data annotated to COVoc.
Over 400 COVoc terms have been mapped to and imported from one of
20 existing ontologies (Table 1), mostly in the biological/medical
vocabulary, diseases, and proteins name spaces. Cell lines were enriched
with synonyms from the CelloSaurus (Bairoch 2018), which is now an
ELIXIR Core Data Resource. Additional synonyms and cross-references
added to these terms will be sent to the domain ontologies in question in
order to improve the terms at their core. In early 2023, the biotic
interaction descriptors benefited from the contribution of the biodiversity
community via the CETAF/DISSCo Task Force [Poelen et al. 2020] will
be added to COVoc.
Around 100 novel COVoc terms were created for concepts that did not
exist in other ontologies or for entirely new concepts that fit the COVoc
ontology space (e.g., terms for clinical trials). These terms are currently
under review to be included in CIDO (Coronavirus Infectious Disease
Ontology, [CIDO, [He et al., 2020; He et al., 2022]]), Cell Ontology (CL,
[Diehl et al., 2016]) or other domain ontologies where possible.
The ontology can be browsed using the Ontology Lookup Service (OLS)
[Jupp et al., 2015] and is available with the full ontology building pipeline
on GitHub [COVoc]. The curation-support literature triage demonstrator
is available online [COVTriage (1)].
3.2 SIBiLS - Annotation process
Those COVoc concepts and synonyms were used to fully annotate
biomedical literature from MEDLINE, PubMed Central as well as the
COVID-19 Open Research Dataset (CORD-19) preprint collection
provided by the AllenAI institute as described before.
By October 2022, we had more than a billion COVoc annotations
(N=1,089,914,909) which can be accessed via SIBiLS [SIBiLS].
Statistics have been calculated on our different collections including the
number of annotations per axis per collection, the total number of
annotations per collection and the relative average (Table 2). As expected,
the average of annotations per document is higher for the CORD-19
collection (n=219), because these documents are dedicated to COVID-19
like COVoc ontology. The CORD-19 annotations are uploaded within the
EuropePMC archive using the SciLite services (Venkatesan 2017).
Table 2.Statistics for COVoc annotations into SIBiLS (Oct. 2022).
COVoc axes
MEDLINE
ePMC
CORD-19
BMV
175,386,868
441,873,718
36,199,118
CE
53,959,676
150,125,656
19,585,865
CL
173,861
872,857
203,926
CT
0
3
1,406
CHEM
2,194,487
5,253,987
676,882
DIS
27,828,170
67,832,153
11,350,128
GL
3,981,568
16,417,448
1,845,734
ORG
12,385,979
32,341,398
13,315,914
PG
3,624,982
10,261,638
2,221,487
Total annotations
279,535,591
724,978,858
85,400,460
Nb of documents
34,664,562
4,722,601
389,830
Average anns/doc
8
153
219
3.3 Evaluation (TREC-COVID)
Metrics evaluation of our two strategies (query expansion (QE) and
reranking (RR)) are available in Table 3. Experiments #1 and #2
correspond to Round3 and Round4 of TREC-COVID respectively. As
shown, the evolutionary rate of P@10 in Experiment #1 is positive
(+19.7%) whereas for MAP it is quite stable (+3.97%). In Experiment #2,
it is the evolutionary rate of P@10 which is rather stable (-2.88%) and that
of MAP is strongly positive (+54.77%). COVoc, exploited via QE strategy
and/or reranking strategy, has a statistically significant positive effect
compared to the baseline.
Table 3.Metrics evaluations in Experiments #1 and #2.
P@10
Ev. rate
MAP
Ev. rate
0.4825
0.1561
0.555
0.158
0.41
0.1436
0.575
+19.17%
0.1623
+3.97%
0.6178
0.1006
0.6022
0.1519
0.6244
0.1708
0.6
-2.88%
0.1557
+54.77%
Ev. rate = Evolutionary rate (compared to the baseline results).
3.4 COVTriage
COVTriage is a web application publicly available since April 2020.
Besides the mandatory input {keywords+focus} on the first panel, the user
can fill in some optional fields such as the source (otherwise MEDLINE
is selected by default), a date range related to the articles’ publication
dates, or keywords to filter out some results. Then, once the documents
have been processed, the ranked result of the retrieval function is
displayed on the second panel. At this step, the user can browse through
the articles of interest and take advantage of the highlighting of COVoc
concepts as presented in Fig. 1. Supplementary concepts from other
biomedical ontologies (MeSH, ATC, ICD-10, …) can also be highlighted
at this step if the user checks the corresponding boxes on the right frame.
Finally, if needed, the user can export selected publications with their
integrated annotations into a json output.
Fig. 1.GUI of COVTriage. COVTriage display of search results for the query
“remdesivir”. Articles are retrieved in PMC in the range 2000-2020.
All the data generated and visualised in COVTriage may represent a
substantial resource for research groups. Our tool accumulates around
Page 4 of 6Bioinformatics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac800/6895097 by guest on 13 December 2022
COVoc: a COVID-19 ontology to support literature triage
1754 visits, with an average of 3.1 queries per day, from individuals
around the world between June 2020 and December 2021.
In order to enable programmatic searches and to make the data directly
integrable into external processing pipelines, we have developed public
web services. Accessible through REST APIs, they ensure the
interoperability with independent systems by providing json outputs for
the two possible requirements: literature prioritisation and annotation
extraction. Details on how to use the services are described in
documentation [SIBiLS; COVTriage(2)].
4 Discussion
COVoc, via COVTriage which provides a dimension specific
prioritisation of the literature based on the context of documents, is a text-
mining tool for research scientists looking for publications about precise
axes around the COVID-19 pandemic. Nevertheless, at least one axis
needs to be improved. As observed in the statistics table, there is almost
no annotation for the Clinical Trials (CT) axis, which means no NCTid
(National Clinical Trial identifier) is matched. More work needs to be
done to check why they are not detected by our system. To be more
efficient in the detection of Clinical Trials through publications, it could
be useful to add specific terms related to Clinical trials in the CT axis. This
should reveal the papers where a clinical trial is discussed but not
systematically point to the clinical trial ID linked. Another approach could
be to look for the PMID or PMCID of each paper from the CORD-19
collection in the Clinical Trials collection available in SIBiLS, specifically
based on the “PMID (PMCID) reference” field available from CT.gov to
retrieve clinical trials linked (with the NCTid) to these publications and
create in a second time the corresponding annotation. Furthermore, with
the research on COVID-19, the number of clinical trials and their citations
have risen significantly (around 4,000 in December 2020, 7,154 one year
later) [CT]. With the COVID-19 vaccine campaigns, the Clinical Trials
axis should prove to be more consistent, and the number of annotations
should increase.
COVoc annotations are also available to the scientific community via
EuropePMC. To have an overview of the integration of these annotations,
by executing the query ((ANNOTATION_TYPE:"COVoc") AND
(ANNOTATION_PROVIDER:"HES-SO_SIB")), all papers with at least one COVoc
annotation will be retrieved.
During 2021, we engaged in an effort to harmonise COVID-19
ontologies created in response to the pandemic, contributing to the 12th
International Conference on Biomedical Ontologies (ICBO 2021) [ICBO
2021] flash talk A community effort for COVID-19 Ontology
Harmonization’. Through this, we began collaboration efforts with CIDO
[He et al., 2020; He et al., 2022], a COVID-19 domain ontology which
partly overlaps with COVoc. As COVoc is an application ontology, it is
more ideal to import from domain ontologies which have been developed
and curated by field experts and will continue to be updated to contain the
latest information. This will enhance COVoc in the long term as we will
benefit from expert curation and the continuous updates through dynamic
import of domain ontologies like CIDO, Mondo and the Cell Line
Ontology (CLO) in response to the COVID-19 pandemic and continuing
research. This is an ongoing project where we hope to reduce the number
of terms minted in the COVoc namespace and instead import from domain
ontologies where possible and improve the annotation of unique COVoc
terms that do not have an exact domain ontology such as the clinical trial
terms.
In addition to financial resources, it is worth observing that the long-term
maintenance of COVTriage may be affected by copyright regulations.
Indeed, the pandemics imposed Open Access as the de facto standard for
scientific publications and although Open Access is benefiting from a
clear impetus, many Open Access contents may return to a paywalled
status once the pandemic is declared over. Further, the WHO-promoted
“One Health” paradigm needed to respond to the health crisis and in
particular zoonosis do demand for more holistic/inclusive perspectives
over life sciences libraries. The current topical boundaries of MEDLINE
and PMC, which excludes relevant contributions from non-medical
sciences, and in particular biodiversity, should opportunely be questioned.
These questions emerged on the agenda of both SIB and ELIXIR and will
require innovative responses at a global level with potential implications
from leading publishers in the field. The BICIKL project, coordinated by
Pensoft, could play a pioneering role with the development of a
Biodiversity Knowledge Hub [Penev et al. 2022], i.e., a one stop entry
point to search publications and data from all life sciences.
5 Conclusion
Thanks to the work of collaborators since the first months of the
pandemic, a controlled vocabulary about COVID-19 was developed to
help scientific communities. COVoc is now translated into application
ontology to meet needs such as interoperability, and totally dedicated to
COVID-19. The first release is available on GitHub [COVoc] and
ontology comments and requests can be submitted via
https://github.com/EBISPOT/covoc/issues. It is also used as a controlled
vocabulary for the triage tool named COVTriage [COVTriage(1)]. The
literature ranking service in response to a query is based on COVoc
concepts found in text corpora and created in the form of annotations. This
helps researchers to deal with the large amount of new information about
COVID-19. These annotations are uploaded and available via the Europe
PMC website [EuropePMC], directly in the publication viewer interface.
The presentation of COVoc at congresses has aroused interest and should
lead to new collaborations in the near future. A first collaboration is
already underway with CIDO.
Acknowledgements
We would like to thank all the members of the scientific community - mostly
anonymous, but also the CETAF Covid-19 Task Force - which contributed to the
creation of COVoc, both with the addition of relevant terms at the beginning of the
project, and on the construction of the usable version (ontology, COVTriage service).
Funding
This design of the vocabulary has been partially supported by a HES-SO internal
funding scheme (Covid-19 response program, #CuSToC grant), and by the BICKL
Research Project (H2020-EU.1.4, Grant agreement ID: 101007492). The SIB
Literature Services (SIBiLS) are maintained thanks to the support of the ELIXIR
Data Platform. The work of DC has been partially supported by the e-BioDiv grant
from swissuniversities. The work of ZMP and PR (Paola Roncaglia) was supported
by Open Targets [OTAR005]. The work of HP was supported by EMBL-EBI core
funds. The work of DOS and NM was supported by a grant from National Institutes
of Health (NIH) Office of the Director (OD) - The Monarch Initiative
[1R24OD011883].
Conflict of Interest: none declared.
References
Bairoch A. The Cellosaurus, a Cell-Line Knowledge Resource. J Biomol Tech. 2018
Jul;29(2):25-38. doi: 10.7171/jbt.18-2902-002.
Britan A, Cusin I, Hinard V, et al. (2018). Accelerating annotation of articles via
automated approaches: evaluation of the neXtA5 curation-support tool by
neXtProt. Database: the journal of biological databases and curation, 2018,
bay129. DOI: 10.1093/database/bay129.
Caucheteur D, Pasche E, Gobeill J, Mottaz A, Mottin L, and Ruch P. Designing
retrieval models to contrast precision-driven ad hoc search vs. recall-driven
treatment extraction in Precision Medicine. In TREC 2019.
Chen Q, Allot A, Lu Z. LitCovid: an open database of COVID-19 literature. Nucleic
Acids Research. 2020
Diehl A, Meehan T, Bradford Y, Brush M, Dahdul W, Dougall D, He Y, Osumi-
Sutherland D, Ruttenberg A, Sarntivijai S, Van Slyke C, Vasilevsky N, Haendel
M, Blake J, Mungall C. The Cell Ontology 2016: enhanced content,
modularization, and ontology interoperability. Journal of Biomedical Semantics
7, article number 44 (2016).
Gobeill J, Ehrler F, Tbahriti I, and Ruch P. Vocabulary-driven Passage Retrieval for
Question-Answering in Genomics. In TREC. 2007.
Gobeill J, Gaudinat A, Pasche E, Teodoro D, Vishnyakova D, and Ruch R. BiTeM
group report for TREC Chemical IR Track 2011. In TREC. 2011.
Gobeill J, Gaudinat A, Pasche E, Teodoro D, Vishnyakova D, and Ruch P. BiTeM
Group Report for TREC Medical Records Track 2011. In TREC. 2011.
Gobeill J, Gaudinat A, Pasche E, and Ruch P. Full-texts representation with Medical
Subject Headings and co-citations network reranking strategies for TREC 2014
Clinical Decision Support Track. In TREC. 2014.
Gobeill J, Gaudinat A, and Ruch P. Exploiting incoming and outgoing citations for
improving Information Retrieval in the TREC 2015 Clinical Decision Support
Track. In TREC. 2015.
Gobeill J, Caucheteur D, Michel P-A, Mottin L, Pasche E, Ruch P. SIB Literature
Services: RESTful customizable search engines in biomedical literature,
Page 5 of 6 Bioinformatics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac800/6895097 by guest on 13 December 2022
D.Caucheteur et al.
enriched with automatically mapped biomedical concepts, Nucleic Acids
Research, Volume 48, Issue W1, 02 July 2020, Pages W12–W16,
https://doi.org/10.1093/nar/gkaa328
He Y, Yu H, Ong E, et al. CIDO, a community-based ontology for coronavirus
disease knowledge and data integration, sharing, and analysis. Sci Data.
2020;7(1):181. Published 2020 Jun 12. doi:10.1038/s41597-020-0523-6
He Y, Yu H, Huffman An, et al. A comprehensive update on CIDO: the communitya-
based coronavirus infectious disease ontology. J Biomed Semantics. 2022 Oct
21; 13(1):25.
Jackson R, Matentzoglu N, Overton J.A, Vita R, Balhoff J.P, Buttigieg P.L, Carbon
S, Courtot M, Diehl A.D, Dooley D.M, Duncan W.D, Harris N.L, Haendel M.A,
Lewis S.E, Natale D.A, Osumi-Sutherland D, Ruttenberg A, Schriml L.M, Smith
B, Stoeckert Jr. C.J, Vasilevsky N.A, Walls R.L, Zheng J, Mungall C.J, Peters
B. OBO Foundry in 2021: operationalizing open data principles to evaluate
ontologies, Database, Volume 2021.
Jupp S et al. (2015) A new Ontology Lookup Service at EMBL-EBI. In: Malone, J.
et al. (eds.) Proceedings of SWAT4LS International Conference 2015
Knafou J, Jeffreyes M, Mottin L, Teodoro D, and Ruch P. SIB Text Mining at TREC
2019 Deep Learning Track: Working Note. In TREC 2019.
Knafou J, Jeffreyes M, Ferdowsi S, and Ruch P .SIB Text Mining at TREC 2020
Deep Learning Track. In TREC 2020.
Naderi N, Mottaz A, Teodoro D, and Ruch P. Analyzing the Information Content of
Text-Based Files in Supplementary Materials of Biomedical Literature. Studies
in Health Technology and Informatics. 2022 May 01; 294:876-877.
Pasche E, Gobeill J, Mottin L, Mottaz A, Teodoro D, Van Rijen P, and Ruch P.
Customizing a Variant Annotation-Support Tool: an Inquiry into Probability
Ranking Principles for TREC Precision Medicine. In TREC. 2017.
Pasche E, Gobeill J, Mottin L, Mottaz A, Teodoro D, Van Rijen P, and Ruch P. SIB
Text Mining at TREC 2018 Precision Medicine Track. In TREC. 2018.
Pasche E, Caucheteur D, Mottin L, Mottaz A, Gobeill J, and Ruch P. SIB Text
Mining at TREC Precision Medicine 2020.In TREC 2020.
Penev L, Koureas D, Groom Q, Lanfear J, Agosti D, Casino A, Miller J, Arvanitidis
C, Cochrane G, Hobern D, Banki O, Addink W, Kõljalg U, Copas K, Mergen P,
Güntsch A, Benichou L, Benito Gonzalez Lopez J, Ruch P, Martin CS, Barov B,
Demirova I, Hristova K (2022) Biodiversity Community Integrated Knowledge
Library (BiCIKL). Research Ideas and Outcomes 8: e81136.
https://doi.org/10.3897/rio.8.e81136
Poelen J, Upham N, Agosti D. CETAF-DiSSCo/COVID19-TAF biodiversity-related
knowledge hub working group: indexed biotic interactions and review summary.
Zenodo. Oct 6, 2020.
Venkatesan A, Kim JH, Talo F, Ide-Smith M, Gobeill J, Carter J, Batista-Navarro R,
Ananiadou S, Ruch P, McEntyre J. SciLite: a platform for displaying text-mined
annotations as a means to link research articles with biological data. Wellcome
Open Res. 2017 Jul 10;1:25. doi: 10.12688/wellcomeopenres.10210.2.
Wang L.L, Lo K, Chandrasekhar Y, Reas R, Yang J, Eide D, Funk K, Kinney R.M,
Liu Z, Merrill W, Mooney P, Murdick D.A, Rishi D, Sheehan J, Shen Z, Stilson
B, Wade A.D, Wang K, Wilhelm C, Xie B, Raymond D.A, Weld D.S, Etzioni
O, & Kohlmeier S. (2020). CORD-19: The COVID-19 Open Research Dataset.
ArXiv.
Websites:
CIDO
https://github.com/CIDO-ontology/cido
COVTriage(1)
https://candy.hesge.ch/COVTriage
COVTriage(2)
https://candy.hesge.ch/COVTriage/documentation/
Covid Data Portal
https://www.covid19dataportal.org/
COVoc
https://github.com/EBISPOT/covoc
CT https://clinicaltrials.gov/ct2/results?cond=COVID-19
ePMC
https://europepmc.org/About
EuropePMC
https://europepmc.org
ICBO 2021
https://icbo2021.inf.unibz.it/
MEDLINE
https://www.nlm.nih.gov/bsd/medline.html
NIST(1)
https://ir.nist.gov/covidSubmit/index.html
ODK
https://github.com/INCATools/ontology-development-kit
OLS
https://www.ebi.ac.uk/ols/index
Protégé
https://protege.stanford.edu
ROBOT
http://robot.obolibrary.org/
SIBiLS
https://sibils.github.io
TREC(1)
https://trec.nist.gov
TREC(2)
https://ir.nist.gov/covidSubmit/papers/Forum_TRECCOVID1.pdf
WHO(1)
https://www.who.int/csr/don/12-january-2020-novel-coronavirus-china/en/
WHO(2)
https://www.who.int/docs/default-source/coronaviruse/situation-
reports/20200130-sitrep-10-ncov.pdf
WHO(3)
https://www.who.int/docs/default-source/coronaviruse/situation-
reports/20200211-sitrep-22-ncov.pdf
Zooma
http://wwwdev.ebi.ac.uk/spot/zooma/
Page 6 of 6Bioinformatics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac800/6895097 by guest on 13 December 2022
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Background The current COVID-19 pandemic and the previous SARS/MERS outbreaks of 2003 and 2012 have resulted in a series of major global public health crises. We argue that in the interest of developing effective and safe vaccines and drugs and to better understand coronaviruses and associated disease mechenisms it is necessary to integrate the large and exponentially growing body of heterogeneous coronavirus data. Ontologies play an important role in standard-based knowledge and data representation, integration, sharing, and analysis. Accordingly, we initiated the development of the community-based Coronavirus Infectious Disease Ontology (CIDO) in early 2020. Results As an Open Biomedical Ontology (OBO) library ontology, CIDO is open source and interoperable with other existing OBO ontologies. CIDO is aligned with the Basic Formal Ontology and Viral Infectious Disease Ontology. CIDO has imported terms from over 30 OBO ontologies. For example, CIDO imports all SARS-CoV-2 protein terms from the Protein Ontology, COVID-19-related phenotype terms from the Human Phenotype Ontology, and over 100 COVID-19 terms for vaccines (both authorized and in clinical trial) from the Vaccine Ontology. CIDO systematically represents variants of SARS-CoV-2 viruses and over 300 amino acid substitutions therein, along with over 300 diagnostic kits and methods. CIDO also describes hundreds of host-coronavirus protein-protein interactions (PPIs) and the drugs that target proteins in these PPIs. CIDO has been used to model COVID-19 related phenomena in areas such as epidemiology. The scope of CIDO was evaluated by visual analysis supported by a summarization network method. CIDO has been used in various applications such as term standardization, inference, natural language processing (NLP) and clinical data integration. We have applied the amino acid variant knowledge present in CIDO to analyze differences between SARS-CoV-2 Delta and Omicron variants. CIDO's integrative host-coronavirus PPIs and drug-target knowledge has also been used to support drug repurposing for COVID-19 treatment. Conclusion CIDO represents entities and relations in the domain of coronavirus diseases with a special focus on COVID-19. It supports shared knowledge representation, data and metadata standardization and integration, and has been used in a range of applications.
Chapter
Full-text available
We present an analysis of supplementary materials of PubMed Central (PMC) articles and show their importance in indexing and searching biomedical literature, in particular for the emerging genomic medicine field. On a subset of articles from PubMed Central, we use text mining methods to extract MeSH terms from abstracts, full texts, and text-based supplementary materials. We find that the recall of MeSH annotations increases by about 5.9 percentage points (+20% on relative percentage) when considering supplementary materials compared to using only abstracts. We further compare the supplementary material annotations with full-text annotations and we find out that the recall of MeSH terms increases by 1.5 percentage point (+3% on relative percentage). Additionally, we analyze genetic variant mentions in abstracts and full-texts and compare them with mentions found in supplementary text-based files. We find that the majority (about 99%) of variants are found in text-based supplementary files. In conclusion, we suggest that supplementary data should receive more attention from the information retrieval community, in particular in life and health sciences.
Article
Full-text available
BiCIKL is an European Union Horizon 2020 project that will initiate and build a new European starting community of key research infrastructures, establishing open science practices in the domain of biodiversity through provision of access to data, associated tools and services at each separate stage of and along the entire research cycle. BiCIKL will provide new methods and workflows for an integrated access to harvesting, liberating, linking, accessing and re-using of subarticle-level data (specimens, material citations, samples, sequences, taxonomic names, taxonomic treatments, figures, tables) extracted from literature. BiCIKL will provide for the first time access and tools for seamless linking and usage tracking of data along the line: specimens > sequences > species > analytics > publications > biodiversity knowledge graph > re-use.
Article
Full-text available
Biological ontologies are used to organize, curate and interpret the vast quantities of data arising from biological experiments. While this works well when using a single ontology, integrating multiple ontologies can be problematic, as they are developed independently, which can lead to incompatibilities. The Open Biological and Biomedical Ontologies (OBO) Foundry was created to address this by facilitating the development, harmonization, application and sharing of ontologies, guided by a set of overarching principles. One challenge in reaching these goals was that the OBO principles were not originally encoded in a precise fashion, and interpretation was subjective. Here, we show how we have addressed this by formally encoding the OBO principles as operational rules and implementing a suite of automated validation checks and a dashboard for objectively evaluating each ontology’s compliance with each principle. This entailed a substantial effort to curate metadata across all ontologies and to coordinate with individual stakeholders. We have applied these checks across the full OBO suite of ontologies, revealing areas where individual ontologies require changes to conform to our principles. Our work demonstrates how a sizable, federated community can be organized and evaluated on objective criteria that help improve overall quality and interoperability, which is vital for the sustenance of the OBO project and towards the overall goals of making data Findable, Accessible, Interoperable, and Reusable (FAIR). Database URL http://obofoundry.org/
Article
Full-text available
The Coronavirus Infectious Disease Ontology (CIDO) is a community-based ontology that supports coronavirus disease knowledge and data standardization, integration, sharing, and analysis.
Article
Full-text available
Thanks to recent efforts by the text mining community, biocurators have now access to plenty of good tools and Web interfaces for identifying and visualizing biomedical entities in literature. Yet, many of these systems start with a PubMed query, which is limited by strong Boolean constraints. Some semantic search engines exploit entities for Information Retrieval, and/or deliver relevance-based ranked results. Yet, they are not designed for supporting a specific curation workflow, and allow very limited control on the search process. The Swiss Institute of Bioinformatics Literature Services (SIBiLS) provide personalized Information Retrieval in the biological literature. Indeed, SIBiLS allow fully customizable search in semantically enriched contents, based on keywords and/or mapped biomedical entities from a growing set of standardized and legacy vocabularies. The services have been used and favourably evaluated to assist the curation of genes and gene products, by delivering customized literature triage engines to different curation teams. SIBiLS (https://candy.hesge.ch/SIBiLS) are freely accessible via REST APIs and are ready to empower any curation workflow, built on modern technologies scalable with big data: MongoDB and Elasticsearch. They cover MEDLINE and PubMed Central Open Access enriched by nearly 2 billion of mapped biomedical entities, and are daily updated.
Article
Full-text available
The development of efficient text-mining tools promises to boost the curation workflow by significantly reducing the time needed to process the literature into biological databases. We have developed a curation support tool, neXtA5, that provides a search engine coupled with an annotation system directly integrated into a biocuration workflow. neXtA5 assists curation with modules optimized for the thevarious curation tasks: document triage, entity recognition and information extraction.Here, we describe the evaluation of neXtA5 by expert curators. We first assessed the annotations of two independent curators to provide a baseline for comparison. To evaluate the performance of neXtA5, we submitted requests and compared the neXtA5 results with the manual curation. The analysis focuses on the usability of neXtA5 to support the curation of two types of data: biological processes (BPs) and diseases (Ds). We evaluated the relevance of the papers proposed as well as the recall and precision of the suggested annotations.The evaluation of document triage by neXtA5 precision showed that both curators agree with neXtA5 for 67 (BP) and 63% (D) of abstracts, while curators agree on accepting or rejecting an abstract ~80% of the time. Hence, the precision of the triage system is satisfactory.For concept extraction, curators approved 35 (BP) and 25% (D) of the neXtA5 annotations. Conversely, neXtA5 successfully annotated up to 36 (BP) and 68% (D) of the terms identified by curators. The user feedback obtained in these tests highlighted the need for improvement in the ranking function of neXtA5 annotations. Therefore, we transformed the information extraction component into an annotation ranking system. This improvement results in a top precision (precision at first rank) of 59 (D) and 63% (BP). These results suggest that when considering only the first extracted entity, the current system achieves a precision comparable with expert biocurators.
Article
Full-text available
The Cellosaurus is a knowledge resource on cell lines. It aims to describe all cell lines used in biomedical research. Its scope encompasses both vertebrates and invertebrates. Currently, information for >100,000 cell lines is provided. For each cell line, it provides a wealth of information, cross-references, and literature citations. The Cellosaurus is available on the ExPASy server (https://web.expasy.org/cellosaurus/) and can be downloaded in a variety of formats. Among its many uses, the Cellosaurus is a key resource to help researchers identify potentially contaminated/misidentified cell lines, thus contributing to improving the quality of research in the life sciences.
Article
Full-text available
The tremendous growth in biological data has resulted in an increase in the number of research papers being published. This presents a great challenge for scientists in searching and assimilating facts described in those papers. Particularly, biological databases depend on curators to add highly precise and useful information that are usually extracted by reading research articles. Therefore, there is an urgent need to find ways to improve linking literature to the underlying data, thereby minimising the effort in browsing content and identifying key biological concepts. As part of the development of Europe PMC, we have developed a new platform, SciLite, which integrates text-mined annotations from different sources and overlays those outputs on research articles. The aim is to aid researchers and curators using Europe PMC in finding key concepts more easily and provide links to related resources or tools, bridging the gap between literature and biological data.
Article
Since the outbreak of the current pandemic in 2020, there has been a rapid growth of published articles on COVID-19 and SARS-CoV-2, with about 10 000 new articles added each month. This is causing an increasingly serious information overload, making it difficult for scientists, healthcare professionals and the general public to remain up to date on the latest SARS-CoV-2 and COVID-19 research. Hence, we developed LitCovid (https://www.ncbi.nlm.nih.gov/research/coronavirus/), a curated literature hub, to track up-to-date scientific information in PubMed. LitCovid is updated daily with newly identified relevant articles organized into curated categories. To support manual curation, advanced machine-learning and deep-learning algorithms have been developed, evaluated and integrated into the curation workflow. To the best of our knowledge, LitCovid is the first-of-its-kind COVID-19-specific literature resource, with all of its collected articles and curated data freely available. Since its release, LitCovid has been widely used, with millions of accesses by users worldwide for various information needs, such as evidence synthesis, drug discovery and text and data mining, among others.