Conference PaperPDF Available
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1–7,
Berlin, Germany, August 7-12, 2016. c
2016 Association for Computational Linguistics
Results of the 4th edition of BioASQ Challenge
Anastasia Krithara1,Anastasios Nentidis1,George Paliouras1, and Ioannis Kakadiaris2
1National Center for Scientific Research “Demokritos”, Athens, Greece
2University of Houston, Texas, USA
The goal of this task is to push the re-
search frontier towards hybrid information
systems. We aim to promote systems and
approaches that are able to deal with the
whole diversity of the Web, especially for,
but not restricted to, the context of bio-
medicine. This goal is pursued by the
organization of challenges. The fourth
challenge, as the previous challenges, con-
sisted of two tasks: semantic indexing and
question answering. 16 systems partic-
ipated by 7 different participating teams
for the semantic indexing task. The ques-
tion answering task was tackled by 37 dif-
ferent systems, developed by 11 different
teams. 25 of the systems participated in
the phase A of the task, while 12 par-
ticipated in phase B. 3 of the teams par-
ticipated in both phases of the question
answering task. Overall, as in previous
years, the best systems were able to out-
perform the strong baselines. This sug-
gests that advances over the state of the art
were achieved through the BIOAS Q chal-
lenge but also that the benchmark in it-
self is very challenging. In this paper, we
present the data used during the challenge
as well as the technologies which were at
the core of the participants’ frameworks.
1 Introduction
The aim of this paper is twofold. First, we aim
to give an overview of the data issued during the
BioASQ challenge in 2016. In addition, we aim to
present the systems that participated in the chal-
lenge and for which we received system descrip-
tions, as well as evaluate their performance. To
achieve these goals, we begin by giving a brief
overview of the tasks, including the timing of the
different tasks and the challenge data. Thereafter,
we give an overview of the systems which par-
ticipated in the challenge and provided us with
an overview of the technologies they relied upon.
Detailed descriptions of some of the systems are
given in lab proceedings. The evaluation of the
systems, which was carried out by using state-of-
the-art measures or manual assessment, is the last
focal point of this paper. The conclusion sums up
the results of this challenge.
2 Overview of the Tasks
The challenge comprised two tasks: (1) a large-
scale semantic indexing task (Task 4a) and (2) a
question answering task (Task 4b).
Large-scale semantic indexing. In Task 4a the
goal is to classify documents from the PubMed1
digital library into concepts of the MeSH2hierar-
chy. Here, new PubMed articles that are not yet
annotated are collected on a weekly basis. These
articles are used as test sets for the evaluation of
the participating systems. As soon as the anno-
tations are available from the PubMed curators,
the performance of each system is calculated by
using standard information retrieval measures as
well as hierarchical ones. The winners of each
batch were decided based on their performance in
the Micro F-measure (MiF) from the family of flat
measures (Tsoumakas et al., 2010), and the Low-
est Common Ancestor F-measure (LCA-F) from
the family of hierarchical measures (Kosmopou-
los et al., 2013). For completeness several other
flat and hierarchical measures were reported (Ba-
likas et al., 2013). In order to provide an on-line
and large-scale scenario, the task was divided into
three independent batches. In each batch 5 test
sets of biomedical articles were released consecu-
tively. Each of these test sets were released in a
weekly basis and the participants had 21 hours to
provide their answers. Figure 1 gives an overview
of the time plan of Task 4a.
Biomedical semantic QA. The goal of task 4b
was to provide a large-scale question answering
challenge where the systems should be able to
cope with all the stages of a question answer-
ing task, including the retrieval of relevant con-
cepts and articles, as well as the provision of
natural-language answers. Task 4b comprised
two phases: In phase A, BI OASQ released ques-
tions in English from benchmark datasets created
by a group of biomedical experts. There were
four types of questions: “yes/no” questions, “fac-
toid” questions,“list” questions and “summary”
questions (Balikas et al., 2013). Participants
had to respond with relevant concepts (from spe-
cific terminologies and ontologies), relevant arti-
cles (PubMed articles), relevant snippets extracted
from the relevant articles and relevant RDF triples
(from specific ontologies). In phase B, the re-
leased questions contained the correct answers for
the required elements (articles and snippets) of
the first phase. The participants had to answer
with exact answers as well as with paragraph-sized
summaries in natural language (dubbed ideal an-
The task was split into five independent batches.
The two phases for each batch were run with a
time gap of 24 hours. For each phase, the partic-
ipants had 24 hours to submit their answers. We
used well-known measures such as mean preci-
sion, mean recall, mean F-measure, mean average
precision (MAP) and geometric MAP (GMAP)
to evaluate the performance of the participants
in Phase A. The winners were selected based on
MAP. The evaluation in phase B for the ideal an-
swers was carried out manually by biomedical ex-
perts on the answers provided by the systems. For
the sake of completeness, ROUGE (Lin, 2004) is
also reported. For the exact answers, we used ac-
curacy for the yes/no questions, mean reciprocal
rank (MRR) for the factoids and mean F-measure
for the list questions.
3 Overview of Participants
3.1 Task 4a
In this subsection we describe the proposed
systems which have sent a description and stress
their key characteristics.
In (Papagiannopoulou et al., 2016) flat classifi-
cation processes were employed for the semantic
indexing task. In particular, they used as a training
set the last 1million articles and kept the last 50
thousand as a validation set. Pre-processing of
the articles was carried out by concatenated the
abstract and the title. One-grams and bi-grams
were used as features, removing stop-words and
features with less than five occurrences in the
corpus. The tf-idf representation has been used
for the features. The proposed system includes
several multi-label classifiers (MLC) that are
combined in ensembles. In particular, they used
the Meta-Labeler, a set of Binary Relevance
(BR) models with Linear SVMs and a Labeled
LDA variant, Prior LDA. All the above models
were combined in an ensemble, using the MULE
framework, a statistical significance multi-label
ensemble that performs classifier selection.
The approach proposed by (Segura-Bedmar et
al., 2016) is based on Elastic Search. They use
ElasticSearch in order to index the training set
provided by the BioASQ. Then, each document
in the test set is translated into a query, that is
fired against the index built from the training set,
returning the most relevant documents and their
MeSH categories. Finally, each MeSH category
is ranked using a scoring system based on the
frequency of the category and the similarity of
relevant documents, which contain the category,
with the test document to classify.
Baselines. During the challenge three systems
were served as baseline systems. The first base-
line is a state-of-the-art method called Medical
Text Indexer (MTI) (Mork et al., 2014) which is
developed by the National Library of Medicine3
and serves as a classification system for articles of
MEDLINE. MTI is used by curators in order to
assist them in the annotation process. The second
baseline is an extension of the system MTI with
the approaches of the first BioASQ challenge’s
winner (Tsoumakas et al., 2013). The third one,
dubbed BioASQ Filtering (Zavorin et al., 2016) is
February 08
February 15
February 22
February 29
March 07
March 14
March 21
March 28
April 04
April 11
April 18
April 25
May 02
May 09
May 16
1st batch 3rd batch
2nd batch End of Task4a
Figure 1: The time plan of Task 4a.
March 09
March 10
March 23
March 24
April 06
April 07
April 20
April 21
May 4
May 5
2nd batch 4th batch 5th batch3rd batch
1st batch
Phase A
Phase B
Figure 2: The time plan of Task 4b. The two phases for each batch run in consecutive days.
a new extension of the MTI system. In particular,
Learning to Rank methodology is used as a boost-
ing component of the MTI system. The improved
system shows significant gains in both precision
and recall for some specific classes of MeSH head-
3.2 Task 4b
As mentioned above, the second task of the
challenge is split into two phases. In the first
phase, where the goal is to annotate questions
with relevant concepts, documents, snippets and
RDF triples 9 teams with 25 systems participated.
In the second phase, where teams are requested
to submit exact and paragraph-sized answers for
the questions, 5 teams with 12 different systems
The system presented in (Papagiannopoulou et
al., 2016) is based on Indri search engine, and
they use MetaMap and LingPipe to detect the
biomedical concepts in local ontology files. For
the relevant snippets, they calculate the semantic
similarity between each one of the sentences
and the query (expanded with synonyms) using a
semantic similarity measure. Concerning phase B,
They provided exact answers only for the factoid
questions. Their system is based on their previous
participation in BioASQ challenge (Papanikolaou
et al., 2014). The system tries to extract the
lexical answer type by manipulating the words
of the question. Then, the relevant snippets of
the question which are provided as inputs for
this tasks are processed with the 2013 release of
MetaMap in order to extract candidate answers.
This year, they have extended their approach by
expanding both the scoring mechanism, as well as
the set of candidate answers.
The system presented in (Yang et al., 2016),
extends the system in (Yang et al., 2015). In
particular, they used TmTool (CH et al., 2016),
in addition to MetaMap, to identify possible
biomedical named entities, especially out-of-
vocabulary concepts. In addition, they also
extract frequent multi-word terms from relevant
snippets to further improve the recall of concept
and candidate answer text extraction. They also
introduced a unified classification interface for
judging the relevance of each retrieved concept,
document, and snippet, which can combine the
relevant scores evidenced by various sources. A
supervised learning method is used to rerank the
answer candidates for factoid and list questions
based on the relation between each candidate
answer and other candidate answers.
The system presented in (Schulze et al., 2016)
relies on the Hana Database for text processing.
It uses the Stanford CoreNLP package for tok-
enizing the questions. Each of the tokens is then
sent to the BioPortal and to the Hana database
for concept retrieval. The concepts retrieved from
the two stores are finally merged to a single list
that is used to retrieve relevant text passages
from the documents at hand. The second system
relies on existing NLP functionality in the IMDB.
They have extended it with new functions tailored
specifically to QA.
The approach presented in (gu Lee et al., 2016)
participated in phase A of task 4b. The main
focus was the retrieval of relevant documents and
snippets. The proposed system uses a clusterbased
language model. Then, it reranks the retrieved
top-n sentences using five independent similarity
models based on shallow semantic analysis.
4 Results
4.1 Task 4a
During the evaluation phase of the Task 4a, the
participants submitted their results on a weekly ba-
sis to the online evaluation platform of the chal-
lenge4. The evaluation period was divided into
three batches containing 5 test sets each. 7 teams
were participated in the task with a total of 16
systems. For measuring the classification perfor-
mance of the systems several evaluation measures
were used both flat and hierarchical ones (Balikas
et al., 2013). The micro F-measure (MiF) and the
Lowest Common Ancestor F-measure (LCA-F)
were used to asses the systems and choose the win-
ners for each batch (Kosmopoulos et al., 2013).
12,208,342 articles with 27,301 labels (19.4GB)
were provided as training data to the participants.
Table 1 shows the number of articles in each test
set of each batch of the challenge.
Table 2 presents the correspondence of the sys-
tems for which a description was available and the
submitted systems in Task 4a. The systems MTI
First Line Index, Default MTI, BioASQ Filtering
were the baseline systems used throughout the
challenge. Systems that participated in less than
4 test sets in each batch are not reported in the
According to (Demsar, 2006) the appropriate way
to compare multiple classification systems over
multiple datasets is based on their average rank
across all the datasets. On each dataset the system
with the best performance gets rank 1.0, the
5According to the rules of BioASQ, each system had to
participate in at least 4 test sets of a batch in order to be eli-
gible for the prizes.
second best rank 2.0 and so on. In case that two
or more systems tie, they all receive the average
Tables 3 presents the average rank (according to
MiF and LCA-F) of each system over all the test
sets for the corresponding batches. Note, that the
average ranks are calculated for the 4 best results
of each system in the batch according to the rules
of the challenge6. The best ranked system is
highlighted with bold typeface.
Table 4: Statistics on the training and test datasets
of Task 4b. All the numbers for the documents,
snippets, concepts and triples refer to averages.
Batch Size # of documents # of snippets
training 1307 13.00 17.86
1 100 4.56 6.41
2 100 5.25 6.98
3 100 4.79 6.46
4 100 4.90 7.25
5 97 3.93 6.10
total 1804 10.71 14.77
4.2 Task 4b
Phase A. Table 4 presents the statistics of the
training and test data provided to the participants.
The evaluation included five test batches. For the
phase A of Task 4b the systems were allowed
to submit responses to any of the correspond-
ing types of annotations, that is documents, con-
cepts, snippets and RDF triples. For each of the
categories we rank the systems according to the
Mean Average Precision (MAP) measure (Balikas
et al., 2013). The final ranking for each batch is
calculated as the average of the individual rank-
ings in the different categories. In tables 6 and 7
some indicative results from batch 1 are presented.
The detailed results for Task 4b phase A can
be found in http://participants-area.
Phase B. In the phase B of Task 4b the systems
were asked to report exact and ideal answers. The
systems were ranked according to the manual
evaluation of ideal answers by the BioASQ
experts (Balikas et al., 2013), and according to
automatic measures for the exact answers.
Table 7 shows the results for the exact answers
for the first batch of task 4a. In case that systems
Table 1: Statistics on the test datasets of Task 4a.
Batch Articles Annotated Articles Labels per article
1 3,740 569 11.25
2,872 714 12.01
2,599 275 11.09
3,294 520 13.72
3,210 418 11.23
Subtotal 15,715 2,496 11.96
2 3,212 443 10.57
3,213 371 11.37
2,831 534 11.78
3,111 541 10.67
2,470 268 9.82
Subtotal 14,837 2,157 10.94
3 2,994 89 12.08
3,044 353 11.79
3,351 241 10.81
2,630 93 9.77
3,130 50 12.56
Subtotal 15,149 826 11.35
Total 45,701 5,479 11.42
Table 2: Correspondence of reference and submitted systems for Task 4a.
Reference Systems
(Papagiannopoulou et al., 2016) Auth1, Auth2
(Segura-Bedmar et al., 2016) LABDA ElasticSearch, LargeElasticLABDA, LABDA baseline
Baselines ((Mork et al., 2013),(Zavorin et al., 2016)) MTI First Line Index, Default MTI, BioASQ Filtering
Table 3: Average ranks for each system across the batches of the task 4a for the measures MiF and
LCA-F. A hyphenation symbol (-) is used whenever the system participated in less than 4 times in the
System Batch 1 Batch 2 Batch 3
iria-1 - - 9.0 9.0 - -
LABDA ElasticSearch - - - - - -
d33p - - - - - -
auth1 2.75 3.25 3.75 3.75 - -
Default MTI 4.0 3.0 5.0 4.5 - -
auth2 - - 6.0 6.25 - -
MeSHLabeler 1.25 1.25 1.25 1.25 - -
LargeElasticLABDA - - - - - -
LABDA baseline - - - - - -
BioASQ Filtering 4.5 4.75 5.75 5.5 - -
MeSHLabeler-2 - - 2.0 2.0 - -
MeSHLabeler-1 1.75 1.75 - - - -
MeSHLabeler-3 - - 3.5 3.25 - -
CSX-1 - - - - - -
MTI First Line Index 5.5 5.75 5.75 6.25 - -
UCSDLogReg - - - - - -
didn’t provide exact answers for a particular
kind of questions we used the symbol “-”. The
results of the other batches are available at
org/results/4b/phaseB/. From those
results we can see that the systems are achieving
a very high (>90% accuracy) performance in the
yes/no questions. The performance in factoid and
list questions is not as good indicating that there
is room for improvements.
5 Conclusion
In this paper, an overview of the fourth BioASQ
challenge is presented. As the previous chal-
lenges, the challenge consisted of two tasks: se-
mantic indexing and question answering. Over-
all, as in previous years, the best systems were
able to outperform the strong baselines provided
by the organizers. This suggests that advances
over the state of the art were achieved through the
BIOA SQ challenge but also that the benchmark in
Table 5: Results for batch 1 for documents in phase A of Task 4b.
System Mean Mean Mean MAP GMAP
Precision Recall F-measure
testtext 0.169 0.5331 0.2276 0.0981 0.0128
ustb prir2 0.158 0.5277 0.2164 0.0973 0.0119
ustb prir4 0.165 0.5254 0.2224 0.0967 0.0109
fdu2 0.147 0.5011 0.2012 0.0885 0.0087
ustb prir3 0.156 0.497 0.2114 0.0869 0.0095
fdu 0.153 0.5086 0.2081 0.0866 0.0095
ustb prir1 0.155 0.4936 0.2097 0.0865 0.0088
fdu4 0.15 0.5057 0.205 0.0859 0.012
fdu3 0.154 0.5184 0.2112 0.0849 0.0109
fdu5 0.149 0.4971 0.2036 0.0823 0.01
KNU-SG Team Korea 0.084 0.2258 0.1065 0.0486 0.0008
HPI-S1 0.1209 0.3266 0.1547 0.0474 0.0012
Auth001 0.069 0.1983 0.0914 0.0375 0.0004
WS4A 0.01 0.0134 0.011 0.0038 0
HPI-S2 0.005 0.0062 0.0054 0.0028 0
Table 6: Results for batch 1 for snippets in phase A of Task 4b.
System Mean Mean Mean MAP GMAP
Precision Recall F-measure
HPI-S1 0.0822 0.1706 0.0917 0.0481 0.0005
KNU-SG Team Korea 0.0482 0.0952 0.0534 0.0266 0.0002
ustb prir2 0.0469 0.1135 0.0503 0.0216 0.0002
ustb prir3 0.0452 0.1070 0.0482 0.0212 0.0002
ustb prir1 0.0409 0.1080 0.0491 0.0211 0.0002
ustb prir4 0.0449 0.1108 0.0477 0.0201 0.0002
testtext 0.0433 0.1098 0.0460 0.0188 0.0002
Table 7: Results for batch 3 for exact answers in phase B of Task 4b.
System Yes/no Factoid List
Accuracy Strict Acc. Lenient Acc. MRR Precision Recall F-measure
fa1 0.9600 0.1154 0.1923 0.1442 0.2500 0.3000 0.2641
Lab Zhu ,Fdan Univer 0.9600 0.1923 0.2692 0.2192 0.1450 0.5929 0.2181
LabZhu,FDU 0.9600 0.1923 0.2692 0.2192 0.1444 0.6214 0.2176
LabZhu FDU 0.9600 0.1923 0.2692 0.2192 0.1420 0.5929 0.2132
Lab Zhu,Fudan Univer 0.9600 0.1923 0.2692 0.2192 0.1455 0.5770 0.2185
oaqa-3b-3 0.5200 0.2308 0.2692 0.2436 0.5396 0.5008 0.4828
WS4A 0.2400 0.0385 0.0385 0.0385 0.1172 0.2817 0.1609
LabZhu-FDU 0.0400 0.1923 0.2692 0.2192 0.1420 0.5929 0.2132
itself is very challenging. Consequently, we regard
the outcome of the challenge as a success towards
pushing the research on bio-medical information
systems a step further. In future editions of the
challenge, we aim to provide even more bench-
mark data derived from a community-driven ac-
quisition process.
The fourth edition of BioASQ is supported by
a conference grant from the NIH/NLM (number
1R13LM012214-01) and sponsored by the Atypon
Georgios Balikas, Ioannis Partalas, Aris Kosmopoulos,
Sergios Petridis, Prodromos Malakasiotis, Ioannis
Pavlopoulos, Ion Androutsopoulos, Nicolas Baskio-
tis, Eric Gaussier, Thierry Artieres, and Patrick Gal-
linari. 2013. Evaluation Framework Specifications.
Project deliverable D4.1, 05/2013.
Wei CH, Leaman R, and Lu Z. 2016. Beyond ac-
curacy: creating interoperable and scalable text-
mining web services. Bioinformatics.
Janez Demsar. 2006. Statistical Comparisons of Clas-
sifiers over Multiple Data Sets. Journal of Machine
Learning Research, 7:1–30.
Hyeon gu Lee, Minkyoung Kim, Juae Kim,
Maengsik Choi Sunjae Kwon, Youngjoong Ko,
Yi-Reun Kim, Jung-Kyu Choi, Harksoo Kim,
and Jungyun Seo. 2016. KSAnswer: Question-
answering System of Kangwon National University
and Sogang University in the 2016 BioASQ Chal-
lenge . In In Proceedings of the BioASQ Workshop,
in ACL.
Aris Kosmopoulos, Ioannis Partalas, Eric Gaussier,
Georgios Paliouras, and Ion Androutsopoulos.
2013. Evaluation Measures for Hierarchical Clas-
sification: a unified view and novel approaches.
CoRR, abs/1306.6802.
Chin-Yew Lin. 2004. ROUGE: A package for auto-
matic evaluation of summaries. In Proceedings of
the ACL workshop ‘Text Summarization Branches
Out’, pages 74–81, Barcelona, Spain.
James Mork, Antonio Jimeno-Yepes, and Alan Aron-
son. 2013. The NLM Medical Text Indexer System
for Indexing Biomedical Literature. In 1st BioASQ
Workshop: A challenge on large-scale biomedical
semantic indexing and question answering.
James G. Mork, Dina Demner-Fushman, Susan C.
Schmidt, and Alan R. Aronson. 2014. Recent en-
hancements to the nlm medical text indexer. In Pro-
ceedings of Question Answering Lab at CLEF.
Eirini Papagiannopoulou, Yiannis Papanikolaou, Dim-
itris Dimitriadis, Sakis Lagopoulos, Grigorios
Tsoumakas, Manos Laliotis, Nikos Markantonatos,
and Ioannis Vlahavas. 2016. Large-Scale Semantic
Indexing and Question Answering in Biomedicine.
In In Proceedings of the BioASQ Workshop, in ACL.
Yannis Papanikolaou, Dimitrios Dimitriadis, Grigo-
rios Tsoumakas, Manos Laliotis, Nikos Markanto-
natos, and Ioannis Vlahavas. 2014. Ensemble
Approaches for Large-Scale Multi-Label Classifica-
tion and Question Answering in Biomedicine. In
2nd BioASQ Workshop: A challenge on large-scale
biomedical semantic indexing and question answer-
Frederik Schulze, Ricarda Schuler, Tim Draeger,
Daniel Dummer, Alexander Ernst, Pedro Flemming,
Cindy Perscheid, and Mariana Neves. 2016. HPI
Question Answering System in BioASQ 2016. In In
Proceedings of the BioASQ Workshop, in ACL.
Isabel Segura-Bedmar, Adrian Carruana, and Paloma
Martnez. 2016. LABDA at the 2016 BioASQ chal-
lenge task 4a: Semantic Indexing by using Elastic-
Search. In In Proceedings of the BioASQ Workshop,
in ACL.
Grigorios Tsoumakas, Ioannis Katakis, and Ioannis
Vlahavas. 2010. Mining Multi-label Data. In Oded
Maimon and Lior Rokach, editors, Data Mining and
Knowledge Discovery Handbook, pages 667–685.
Springer US.
Grigorios Tsoumakas, Manos Laliotis, Nikos Markon-
tanatos, and Ioannis Vlahavas. 2013. Large-Scale
Semantic Indexing of Biomedical Publications. In
1st BioASQ Workshop: A challenge on large-scale
biomedical semantic indexing and question answer-
Zi Yang, Niloy Gupta, Xiangyu Sun, Di Xu, Chi
Zhang, and Eric Nyberg. 2015. Learning to answer
biomedical factoid and list questions: Oaqa at bioasq
3b. In CLEF.
Zi Yang, Yue Zhou, and Eric Nyberg. 2016. Learning
to answer biomedical questions: Oaqa at bioasq 4b.
In In Proceedings of the BioASQ Workshop, in ACL.
Ilya Zavorin, James Mork, and Dina Demner-Fushman.
2016. Using Learning-To-Rank to Enhance NLM
Medical Text Indexer Results. In In Proceedings of
the BioASQ Workshop, in ACL.
... In this experiment, we present a systematic evaluation on biomedical questions provided by the BioASQ challenge so as to compare with BioASQ participant systems. As we previously noted, the BioASQ challenges in phase B (i.e., exact an ideal answers) of Task b provide the test set of biomedical questions along with their golden documents, golden snippets, and questions types [61,62,56] and participant systems [29,31,30,28,32] were asked to answer with exact answers and ideal answers using the golden documents, golden snippets, and golden questions types. For each question, each participating system may return an ideal answer, i.e., a paragraph-sized summary of relevant information. ...
Background and objective Question answering (QA), the identification of short accurate answers to users questions written in natural language expressions, is a longstanding issue widely studied over the last decades in the open-domain. However, it still remains a real challenge in the biomedical domain as the most of the existing systems support a limited amount of question and answer types as well as still require further efforts in order to improve their performance in terms of precision for the supported questions. Here, we present a semantic biomedical QA system named SemBioNLQA which has the ability to handle the kinds of yes/no, factoid, list, and summary natural language questions. Methods This paper describes the system architecture and an evaluation of the developed end-to-end biomedical QA system named SemBioNLQA, which consists of question classification, document retrieval, passage retrieval and answer extraction modules. It takes natural language questions as input, and outputs both short precise answers and summaries as results. The SemBioNLQA system, dealing with four types of questions, is based on (1) handcrafted lexico-syntactic patterns and a machine learning algorithm for question classification, (2) PubMed search engine and UMLS similarity for document retrieval, (3) the BM25 model, stemmed words and UMLS concepts for passage retrieval, and (4) UMLS metathesaurus, BioPortal synonyms, sentiment analysis and term frequency metric for answer extraction. Results and conclusion Compared with the current state-of-the-art biomedical QA systems, SemBioNLQA, a fully automated system, has the potential to deal with a large amount of question and answer types. SemBioNLQA retrieves quickly users’ information needs by returning exact answers (e.g., “yes”, “no”, a biomedical entity name, etc.) and ideal answers (i.e., paragraph-sized summaries of relevant information) for yes/no, factoid and list questions, whereas it provides only the ideal answers for summary questions. Moreover, experimental evaluations performed on biomedical questions and answers provided by the BioASQ challenge especially in 2015, 2016 and 2017 (as part of our participation), show that SemBioNLQA achieves good performances compared with the most current state-of-the-art systems and allows a practical and competitive alternative to help information seekers find exact and ideal answers to their biomedical questions. The SemBioNLQA source code is publicly available at
... An exception is the framework proposed by NCBI (Mao et al., 2014), which directly computes the cosine similarities between the questions and the sentences. Another team (Yang et al., 2016) introduced a unified classification interface for judging the relevance of each retrieved concept, document and snippet, which can combine the relevant scores evidenced by various sources (Krithara et al., 2016). ...
Motivation: With the abundant medical resources, especially literature available online, it is possible for people to understand their own health status and relevant problems autonomously. However, how to obtain the most appropriate answer from the increasingly large-scale database, remains a great challenge. Here, we present a biomedical question answering framework and implement a system, Health Assistant, to enable the search process. Methods: In Health Assistant, a search engine is firstly designed to rank biomedical documents based on contents. Then various query processing and search techniques are utilized to find the relevant documents. Afterwards, the titles and abstracts of top-N documents are extracted to generate candidate snippets. Finally, our own designed query processing and retrieval approaches for short text are applied to locate the relevant snippets to answer the questions. Results: Our system is evaluated on the BioASQ benchmark datasets, and experimental results demonstrate the effectiveness and robustness of our system, compared to BioASQ participant systems and some state-of-the-art methods on both document retrieval and snippet retrieval tasks. Availability and implementation: A demo of our system is available at
... In this paper, we investigate the effectiveness of BioBERT in biomedical question answering and report our results from the 7th BioASQ Challenge [7,10,11,21]. Biomedical question answering has its own unique challenges. First, the size of datasets is often very small (e.g., few thousands of samples in BioASQ) as the creation of biomedical question answering datasets is very expensive. ...
Full-text available
The recent success of question answering systems is largely attributed to pre-trained language models. However, as language models are mostly pre-trained on general domain corpora such as Wikipedia, they often have difficulty in understanding biomedical questions. In this paper, we investigate the performance of BioBERT, a pre-trained biomedical language model, in answering biomedical questions including factoid, list, and yes/no type questions. BioBERT uses almost the same structure across various question types and achieved the best performance in the 7th BioASQ Challenge (Task 7b, Phase B). BioBERT pre-trained on SQuAD or SQuAD 2.0 easily outperformed previous state-of-the-art models. BioBERT obtains the best performance when it uses the appropriate pre-/post-processing strategies for questions, passages, and answers.
... The BioASQ challenge was continuously conducted each year until today with many participants and a variety of approaches in both tasks A and B [8,5,19,25]. In Task A, MeSHLabeler won the challenge in 2014, 2015 and 2016 [21] using an ensemble approach of k-NN, the MTI itself as well as further MeSH classification approaches. ...
Full-text available
Official MeSH annotations are provided from curators at the National Library of Medicine (NLM). Efforts to automatically assign such headings to Medline abstracts have proven difficult. Trained solutions , i.e. machine learning solutions, achieve promising results, however even these successes leave the open question, which features from the text best support the identification of MeSH terms from a given Medline abstract. This manuscript lays out specific approaches for the identification and use of contextual features for the Multi-Label Classification (BioASQ Task6a). In particular, the use of different approaches for the identification of compound terms have been tested. Furthermore, the used system has been extended to better rank selected labels for the BioASQ Task7a challenge. The tested solutions improved recall measures (see Task6a) whereas the second system did boost both performance for both precision-measures and recall-measures. Our presented work gives insights into the use of contextual features from text that would reduce the performance gap given to purely trained solutions in the respective tasks. Nevertheless, we still recognize that lexical features based on the MeSH thesaurus have a high discrepancy towards the actual annotation of MeSH Heading to Medline citations by human curators, another gap that requires explanations to improve the automatic annotation of Medline abstracts with MeSH Headings.
... BioNLP-ST has organized various biomedical IE tasks, usually focused on a specific biological system such as seed development [24], epigenetics and post-translational modifications [80], and cancer genetics [81]. Other community challenges relevant to biomedical text mining include JNLPBA [82], BioASQ [83], i2b2 [84], and ShARe/CLEF eHealth [85]. ...
... Additionally, the CNN is trained with different pre-trained word embedding models and compared with the random initialization. First, the different word embedding models using the toolkit Word2vec (Mikolov, Sutskever, Chen, Corrado and Dean, 2013) are trained on the BioASQ 2016 dataset (Krithara et al., 2016), which contains more than 12 million MedLine abstracts. Skip-gram and continuous bag-of-words (CBOW) architectures of Word2vec are applied with the default parameters used in the C version of the Word2vec toolkit (i.e. ...
Full-text available
The main hypothesis of this PhD dissertation is that novel Deep Learning algorithms can outperform classical Machine Learning methods for the task of Information Extraction in the Biomedical Domain. Contrary to classical systems, Deep Learning models can learn the representation of the data automatically without an expert domain knowledge and avoid the tedious and time-consuming task of defining relevant features. A Drug-Drug Interaction (DDI), which is an essential subset of Adverse Drug Reaction (ADR), represents the alterations in the effects of drugs that were taken simultaneously. The early recognition of interacting drugs is a vital process that prevents serious health problems that can cause death in the worst cases. Health-care professionals and researchers in this domain find the task of discovering information about these incidents very challenging due to the vast number of pharmacovigilance documents. For this reason, several shared tasks and datasets have been developed in order to solve this issue with automated annotation systems with the capability to extract this information. In the present document, the DDI corpus, which is an annotated dataset of DDIs, is used with Deep Learning architectures without any external information for the tasks of Name Entity Recognition and Relation Extraction in order to validate the hypothesis. Furthermore, some other datasets are tested to evidence the performance of these systems. To sum up, the results suggest that the most common Deep Learning methods like Convolutional Neural Networks and Recurrent Neural Networks overcome the traditional algorithms concluding that Deep Learning is a real alternative for a specific and complex scenario like the Information Extraction in the Biomedical domain. As a final goal, a complete architecture that covers the two tasks is developed to structure the named entities and their relationships from raw pharmacological texts.
... Here, we will briefly introduce other participants' methods of document retrieval employed in the 2016 [37] and 2017 BioASQ [38] challenge. Papagiannopoulou et al. [39] built their system on Indri search engine and a variety of libraries had been used, such as the StAX Parser, the Stanford Parser and the GSON library. ...
The conventional Sequential Dependence Model (SDM) has been proved to perform better than the Bag of Words (BoW) model for biomedical article search because it pays attention to the sequence information within queries. Meanwhile, introducing lexical semantic relations into query expansion becomes a hot topic in IR research. However, few researches have been conducted on combining semantic and sequence information together. Hence, we propose the Semantic Sequential Dependence Model (SSDM) in this paper, which provides an innovative combination of semantic information and the conventional SDM. Specifically, our synonyms are obtained automatically through the word embeddings which are trained on the domain-specific corpus by selecting an appropriate language model. Then, these synonyms are utilized to generate possible sequences with the same semantics as the original query and these sequences are fed into SDM to obtain the final retrieval results. The proposed approach is evaluated on 2016 and 2017 BioASQ benchmark test sets and the experimental results show that our query expansion approach outperforms the baseline and other participants in the BioASQ competitions. OAPA
Answer selection with deep neural networks has been studied extensively owing to its strong capability of encoding semantic features. Most previous work has treated the candidates as independent individuals for ranking but ignored the relations between these. In this study, we propose a ranking method via partial ordering for answer selection, which is applicable to various tasks, such as question answering and reading comprehension. First, we propose a comparative network, namely Candidate vs. Candidate, which aims to discover the possible partial order relations between (candidate, candidate) pairs. Thereafter, a multi-task learning framework is constructed for the answer selection problem, in which the main task is representing the relevance between (question, answer) pairs, while the auxiliary sub-task is learning the partial order relations between (answer, answer) pairs. By jointly training the networks with abundant supervision information, a reasonable relevance function and a comparison function can be approximated for these tasks. The experimental results on four benchmarks indicate that ranking candidate answers via partial ordering can significantly improve the answer selection performance.
The recent success of question answering systems is largely attributed to pre-trained language models. However, as language models are mostly pre-trained on general domain corpora such as Wikipedia, they often have difficulty in understanding biomedical questions. In this paper, we investigate the performance of BioBERT, a pre-trained biomedical language model, in answering biomedical questions including factoid, list, and yes/no type questions. BioBERT uses almost the same structure across various question types and achieved the best performance in the 7th BioASQ Challenge (Task 7b, Phase B). BioBERT pre-trained on SQuAD or SQuAD 2.0 easily outperformed previous state-of-the-art models. BioBERT obtains the best performance when it uses the appropriate pre-/post-processing strategies for questions, passages, and answers.
MeSH annotations are attached to the Medline abstracts to improve retrieval and this service is provided from the curators at the National Library of Medicine (NLM). Efforts to automatically assign such headings to Medline abstracts have proven difficult, on the other side, such approaches would increase throughput and efficiency. Trained solutions, i.e. machine learning solutions, achieve promising results, however these advancements do not fully explain, which features from the text would suit best the identification of MeSH Headings from the abstracts. This manuscript describes new approaches for the identification of contextual features for automatic MeSH annotations, which is a Multi-Label Classification (BioASQ Task6a): more specifically, different approaches for the identification of compound terms have been tested and evaluated. The described system has then been extended to better rank selected labels and has been tested in the BioASQ Task7a challenge. The tests show that our recall measures (see Task6a) have improved and in the second challenge, both the performance for precision and recall were boosted. Our work improves our understanding how contextual features from the text help reduce the performance gap given between purely trained solutions and feature-based solutions (possibly including trained solutions). In addition, we have to point out that the lexical features given from the MeSH thesaurus come with a significant and high discrepancy towards the actual annotations of MeSH Headings attributed by human curators, which also hinders improvements to the automatic annotation of Medline abstracts with MeSH Headings.
Conference Paper
Full-text available
In this paper we present the methods and approaches employed in terms of our participation in the 2016 version of the BioASQ challenge. For the semantic indexing task, we extended our successful ensemble approach of last year with additional models. The official results obtained so-far demonstrate a continuing consistent advantage of our approaches against the National Library of Medicine (NLM) baselines. For the question answering task, we extended our approach on factoid questions, while we also developed approaches for the document, concept and snippet retrieval sub-tasks.
Conference Paper
Full-text available
Question answering (QA) systems are crucial when searching for exact answers for natural language questions in the biomedical domain. Answers to many of such questions can be extracted from the 26 millions biomedical publications currently included in MEDLINE when relying on appropriate natural language processing (NLP) tools. In this work we describe our participation in the task 4b of the BioASQ challenge using two QA systems that we developed for biomedicine. Preliminary results show that our systems achieved first and second positions in the snippet retrieval sub-task and for the generation of ideal answers.
Full-text available
Hierarchical classification addresses the problem of classifying items into a hierarchy of classes. An important issue in hierarchical classification is the evaluation of different classification algorithms, which is complicated by the hierarchical relations among the classes. Several evaluation measures have been proposed for hierarchical classification using the hierarchy in different ways. This paper studies the problem of evaluation in hierarchical classification by analyzing and abstracting the key components of the existing performance measures. It also proposes two alternative generic views of hierarchical evaluation and introduces two corresponding novel measures. The proposed measures, along with the state-of-the art ones, are empirically tested on three large datasets from the domain of text classification. The empirical results illustrate the undesirable behavior of existing approaches and how the proposed methods overcome most of these methods across a range of cases.
Conference Paper
For almost 15 years, the NLM Medical Text Indexer (MTI) system has been providing assistance to NLM Indexers, Catalogers, and the History of Medicine Division (HMD) in the task of indexing the ever increasing number of MEDLINE citations, with MTI’s role continuously expanding by providing more extensive and specialized coverage of the MEDLINE collection. The BioASQ Challenge has been a tremendous benefit by expanding the knowledge of leading-edge indexing research. In this paper we present an indexing approach based on the Learning to Rank methodology which was successfully applied to the indexing task by several participants of recent Challenges. The proposed solution is designed to enhance the results that come from MTI by combining strengths of MTI with additional sources of evidence to produce a more accurate list of top MeSH Heading candidates for a MEDLINE citation being indexed. It incorporates novel Learning to Rank features and other enhancements to produce performance superior to that of MTI, both overall and for two specific classes of MeSH Headings for which MTI has shown poor performance.
The biomedical literature is a knowledge-rich resource and an important foundation for future research. With over 24 million articles in PubMed and an increasing growth rate, research in automated text processing is becoming increasingly important. We report here our recently developed web-based text mining services for biomedical concept recognition and normalization. Unlike most text-mining software tools, our web services integrate several state-of-the-art entity tagging systems (DNorm, GNormPlus, SR4GN, tmChem, and tmVar) and offer a batch-processing mode able to process arbitrary text input (e.g., scholarly publications, patents and medical records) in multiple formats (e.g. BioC). We support multiple standards to make our service interoperable and allow simpler integration with other text-processing pipelines. To maximize scalability, we have pre-processed all PubMed articles, and use a computer cluster for processing large requests of arbitrary text. Availability: Our text-mining web service is freely available at Contacts: Zhiyong.Lu{at}
In the face of a growing workload and dwindling resources, the US National Library of Medicine (NLM) created the Indexing Initiative project in the mid-1990s. This cross-library team's mission is to explore indexing meth-odologies that can help ensure that MEDLINE and other NLM document col-lections maintain their quality and currency and thereby contribute to NLM's mission of maintaining quality access to the biomedical literature. The NLM Medical Text Indexer (MTI) is the main product of this project and has been providing indexing recommendations based on the Medical Subject Headings (MeSH) vocabulary since 2002. In 2011, NLM expanded MTI's role by desig-nating it as the first-line indexer (MTIFL) for a few journals; today the MTIFL workflow includes about 100 journals and continues to increase. Due to a close collaboration with the Index Section at NLM, MTI continues to grow and ex-pand its ability to provide assistance to the indexers. This paper provides an overview of MTI's functionality, performance, and its evolution over the years.