ArticlePDF Available
Noname manuscript No.
(will be inserted by the editor)
How Users Search and What They Search for in the
Medical Domain
Understanding Laypeople and Experts Through Query Logs
Jo˜ao Palotti ·Allan Hanbury ·Henning M¨uller ·Charles E. Kahn Jr.
Received: date / Accepted: date
Abstract The internet is an important source of med-
ical knowledge for everyone, from laypeople to medical
professionals. We investigate how these two extremes,
in terms of user groups, have distinct needs and ex-
hibit significantly different search behaviour. We make
use of query logs in order to study various aspects of
these two kinds of users. The logs from America On-
line (AOL), Health on the Net (HON), Turning Re-
search Into Practice (TRIP) and American Roentgen
Ray Society (ARRS) GoldMiner were divided into three
sets: (1) laypeople, (2) medical professionals (such as
physicians or nurses) searching for health content and
(3) users not seeking health advice. Several analyses
are made focusing on discovering how users search and
what they are most interested in. One possible outcome
of our analysis is a classifier to infer user expertise,
which was built. We show the results and analyse the
feature set used to infer expertise. We conclude that
medical experts are more persistent, interacting more
with the search engine. Also, our study reveals that,
conversely to what is stated in much of the literature,
the main focus of users, both laypeople and profession-
als, is on disease rather than symptoms. The results of
J. Palotti ·A. Hanbury
Vienna University of Technology
Austria
E-mail: palotti,hanbury@ifs.tuwien.ac.at
H. M¨uller
University of Applied Sciences and Arts Western Switzerland
(HES–SO)
Switzerland
E-mail: henning.mueller@hevs.ch
C. E. Kahn Jr.
University of Pennsylvania
United States
E-mail: charles.kahn@uphs.upenn.edu
this article, especially through the classifier built, could
be used to detect specific user groups and then adapt
search results to the user group.
1 Introduction
Among all topics available on the internet, medicine is
one of the most important in terms of impact on the
user and one of the most frequently searched. A recent
report states that one in three American adult Internet
users have sought out health advice online to diagnose
a medical condition [18]. This tendency is the same in
Europe, where a recent report from the European Com-
mission estimates that 60% of the population have used
the Internet to search for health-related information in
2014 [16], with numbers even higher in several mem-
ber states. Both reports show that the most common
tasks performed are either searching for general infor-
mation on health-related topics, such as diet, pregnancy
and exercise, or searching for information on specific
injuries or diseases. They also found that mostly the
search starts in a search engine and young users are
more likely to search for this kind of information.
Physicians are also very active Internet users [31].
PubMed which indexes the biomedical literature re-
ports more than one hundred million users [25], where
two-thirds are experts [32]. Nevertheless, studies on how
experts search on the Internet for medical content are
relatively rare [57].
We divide the users of medical search engines into
laypeople and experts, where laypeople are considered to
be searchers that do not have a deep knowledge about
the medical topic being searched, and experts do have a
deep knowledge about the medical topic being searched.
Our assumption is that laypeople wish to see more in-
2 Jo˜ao Palotti et al.
troductory material returned as search results, whereas
experts wish to see detailed scientific material returned
as search results. At first glance, this could easily be
interpreted as a division into patients and medically-
trained professionals. Nevertheless, it often occurs that
a patient or patient’s relatives become experts on a dis-
ease or condition affecting themselves or a family mem-
ber, sometimes becoming more knowledgeable in a nar-
row domain than medically-trained professionals. There
is also the case of a medical professional searching in a
medical topic outside of his/her main expertise (e.g. a
cardiac surgeon looking for information on a skin dis-
ease), where the information need may be initially sat-
isfied by less scientific documents, although likely not
very basic documents due to the medical background.
For these reasons, we specifically avoid defining medical
professional and patient classes.
Distinguishing laypeople and experts can significantly
improve their interactions with the search engine [52,
38]. Currently, users may get different results for their
queries if they are in different locations, but not if they
have different levels of expertise. We make the assump-
tion that it is possible to distinguish the level of ex-
pertise of the searcher based on the vocabulary used
and the search style. While it would be realistic to rep-
resent a continuum of expertise levels, we define two
classes (laypeople and experts) in this study, allowing
us to investigate the most relevant differences between
the classes.
Recently, many studies showed successful cases of
exploring the user’s expertise, in particular for general
search engines [52,44, 11]. Schwarz et al. [44] show that
the popularity of a webpage among experts is a cru-
cial feature to help laypeople identify credible websites.
Collins-Thompson et al. [11] discuss that re-ranking
general search engine results to match the user’s skills
of readability can provide significant gains, however es-
timating user profiles is a non-trivial task and needs to
be further explored.
This study investigates how users search for medi-
cal content, building profiles for experts and laypeople.
Understanding the needs of these two distinct groups
is important for designing search engines, whether it
is used for boosting easy-to-read documents or for sug-
gesting queries to match the search expertise. Addition-
ally, whenever it is possible, we also compare search for
medical information with regular search for other top-
ics.
This work is conducted through the analysis of user
interactions logged by search engines. Log analysis is
unobtrusive and captures the user behaviour in a natu-
ral setting [27]. We used Metamap1, which is the state-
1http://metamap.nlm.nih.gov/
of-the-art tool to recognise and map biomedical text to
its corresponding medical concepts, to provide a richer
set of information for each query. Little is known in the
literature on how to identify medical concepts in short
Web queries, therefore we also evaluated Metamap for
this task.
In particular, this work addresses the following ques-
tions:
1. How suitable is MetaMap for analysing short queries?
2. Which characteristics allow laypeople and experts
to be distinguished based on
(a) How they search in medical content?
(b) What they search for in medical content?
3. To what extent do these characteristics match or
disagree with those identified in other published stud-
ies?
4. What are the most useful features to automatically
infer user expertise through the query logs?
In our analysis, we use health related queries from the
America Online (AOL) query log, as well as the Health
on the Net (HON) search engine log to represent the
logs generated to a significant extent by laypeople. Med-
ical professionals also use general search engines to seek
health content, however their queries are drowned in
the laypeople queries. White et al. [52], for example,
hypothesise that search leading to PubMed was made
by experts. Using this hypothesis, only 0.004% of the
whole AOL log was issued by medical professionals (also
referred to as experts).
Besides the fact that PubMed is more frequently
used in a research environment rather than in a clinical
environment [31], it is also frequently visited by laypeo-
ple [32]. Therefore, we use the logs from the evidence-
based search engine TRIP Database and the radiol-
ogy image search engine American Roentgen Ray Soci-
ety (ARRS) GoldMiner to represent queries entered by
physicians usually when facing a practical problem.
Several analyses are presented: from general statis-
tics of the logs to complex inference on what is the
search focus in each individual search session. We con-
trast our results with others from the literature and
provide our interpretation for each phenomenon found.
The remainder of this paper is organised as follows.
Section 2 presents a literature review and positions our
work with respect to other articles in the literature. In
Section 3, we describe the datasets used and the pre-
processing steps applied. In Section 4 we present and
evaluate MetaMap, the tool used to enrich the informa-
tion contained in the query logs. We start our analysis
in Section 5, where we examine the general user be-
haviour and the most popular queries, terms and top-
ics searched. In Section 6, we introduce the concept of
How Users Search and What They Search for in the Medical Domain 3
search session into our analysis. In Section 7, we present
a Random Forest classifier to infer user expertise and
analyse the feature’s importance. Section 8 presents our
findings and limitations. Finally, conclusions and future
work are presented in Section 9.
2 Related Work
As soon as modern search engines appeared, the first
studies on query logs started [29,45]. Jansen et al. [29]
and Silverstein et al. [45] analysed the logs from Ex-
cite and Altavista respectively, popular search engines
at that time. Both articles point out some important
results such as the fact that the vast majority of users
issue only one single query and rarely access any result
page beyond the first one.
The most recent general search engine to disclose
query logs to researchers was America Online (AOL) in
2006 [41]. The AOL data were afterwards used in var-
ious studies, such as Brenes et al. [7], which provides
methods to group users and their intents, and Torres
et al. [14], who analyse queries targeting children’s con-
tent. In this work, we compare the analysis made in the
literature for general search engines [29,45] with medi-
cal domain search engines, and we adopt a method sim-
ilar to [14] to divide the AOL logs into queries related or
not to health. It is important to mention that the AOL
log had known privacy problems in the past, resulting
in some users being identified even though the logs were
supposedly anonymised. Despite this problem, we opt
to use this dataset for several reasons. One reason is
that it can be freely downloaded, as well as the code
used for all the experiments of this paper, making the
experiments reproducible2. Another reason is that stud-
ies of how medical annotation tools such as MetaMap
perform in the wild are not well known. Finally, in the
absence of a more recent large search engine query log
we consider that the AOL logs are still the best choice
for researchers in academia. A complete reference of the
previous 20 years of research on log analysis and its ap-
plications is well described by Silvestri [46].
There are a number of studies analysing query logs
in the medical domain. We highlight here some im-
portant work for this research, including work based
on general search engines [47,52, 8,54], as well as spe-
cialised ones [22,25,48, 34, 59]. We also highlight some
important work on user expertise and behaviour. Fig-
ure 1 depicts each one of these areas, including relevant
work on general search for non-health-related content.
For a matter of organisation, we divide the rest of this
section into 3 parts, one for each topic. As shown in
2https://github.com/joaopalotti/logAnalysisJournal
Figure 1, some papers may be relevant to more than
one topic.
2.1 General Search Engines
We describe here studies on health-related query logs in
general search engines, starting with Spink et al. [47],
who studied medical queries issued in 2001 in Excite
and AlltheWeb.com. They showed that medical web
search was decreasing since 1999, suggesting that users
were gradually shifting from general-purpose search en-
gines to specialised sites for health-related queries. Also,
they found that health-related queries were equivalent
in length, complexity and lack of reformulation to gen-
eral web searching.
More recently, White and Horvitz [53,54] studied
how users start looking for a simple symptom and end
up searching for a serious disease, a phenomenon they
named cyberchondria. They used the logs of the Win-
dows Live Toolbar to obtain their data and list of key-
words to annotate symptoms and diseases in queries,
while we used the US National Library of Medicine
MetaMap to do the same. Similar to our work, they
define user sessions as a series of queries followed by a
period of user inactivity of more than 30 minutes and
they made use of the Open Directory Project (ODP)
hierarchy to identify medical sessions.
Another important work is Cartright et al. [8]. The
authors presented a log-based study of user behaviour
when searching for health information online. The au-
thors classified user queries into three classes: symp-
toms, causes and remedy. They analysed the change of
search focus along a session, and showed that it is pos-
sible to build a classifier to predict what is the next
focus of a user in a session. We decided to use the same
classes in order to make our study comparable, however
we used the semantic annotator of MetaMap instead of
hand coded rules.
Not studying the query logs, but the ranking lists
of major general search engines, Wang et al. [50] com-
pared the results of Google, Yahoo!, Bing, and Ask.com
for one single query breast cancer. Among their conclu-
sions is the fact that results provided rich information
and highly overlapped between the search engines. The
overlap between any two search engines was about half
or more. Another work that compares a large number
of search engines is Jansen and Spink [28], in which 9
search engines with logs varying from 1997 to 2002 were
used. Nevertheless, the latter did not focus on medical
queries.
4 Jo˜ao Palotti et al.
2.2 Expertise and Search Engines
One of the first studies to report how expertise influ-
ences the process of search dates from the 1990’s. In
this work, Hsieh-Yee [24] reported that experienced li-
brary science students could use more thesauri, synony-
mous terms, combinations of search terms and spend
less time monitoring their searches than novices. Later,
Bhavnani [5] studied search expertise in the medical
and shopping domains. He reported that experts in a
topic can easily solve the task given even without us-
ing a search engine, because they already knew which
website was better adapted to fill their needs. Bhav-
nani also reported that experts started their search by
using websites such as MedlinePlus3, instead of a major
search engine, while laypeople started with Google.
White et al. [52] showed a log-based analysis of ex-
pertise in four different domains (medicine, finance, law,
and computer science), developing an expertise classi-
fier based on their analysis. Apart from showing that
it is possible to predict user expertise based on their
behaviour, they showed that experts have a higher suc-
cess rate only in their domain of expertise, with success
in a session being defined as a clicked URL as the final
event in a session. Therefore, an expert in finance would
have a comparable or worse success rate in medicine
than a non-expert. An important difference between
our work and White’s work is the approach used to
separate experts from non-experts. They assume that
search leading to PubMed was made by medical experts
and search leading to ACM Digital library (ACM-DL)4
was made by computer science experts. In the medical
domain this is a weak premise for two reasons: (1) it is
estimated that one-third of PubMed users are laypeo-
ple [32], (2) PubMed is more important for medical re-
searchers than practitioners [31]. Tracing a parallel be-
tween medicine and computer science, a general prac-
titioner would be like a software developer that does
not necessarily need to consult the ACM-DL (the cor-
respondent for PubMed) to perform his/her work. One
could manually expand the list of expert sites to in-
clude, for example, StackOverflow5or an API website
for experts in CS and treatment guidelines or drug in-
formation sites for medicine but it would be a laborious
task and unstable over time. Hence, to cope with this
challenge, we use the logs of different search engines
made for distinct audiences.
3MedlinePlus is a web-based consumer health informa-
tion system developed by the American National Library of
Medicine (NLM): http://www.medlineplus.gov/
4http://dl.acm.org/
5http://stackoverflow.com/
An important user study was conducted by Wilde-
muth [55]. He evaluated how the search tactics of micro-
biology students changed over an academic year, while
the students’ topic knowledge was increasing. The stu-
dents were asked questions about the topic at 3 dif-
ferent times: before starting the course, when finishing
the course, and 6 months after the course. As their ex-
pertise increased, the users were able to perform a bet-
ter term selection for search, being more effective. The
most common pattern used across all three occasions
was the narrowing of the retrieved result set through
the addition of search concepts, while at the beginning
users were less effective in the selection of concepts to
include in the search and more errors were made in the
reformulation of a query. Later, Duggan and Payne [15]
explored the domains of music and football to evalu-
ate how the user knowledge of a topic can influence the
probability of a user answering factual questions, find-
ing that experts detect unfruitful search paths faster
than non-experts.
Recently, there have been a few user studies in user
expertise prediction. For example, Zhang et al. [58] and
Cole et al. [10] are based on TREC Genomics data.
The former employed a regression model to match user
self-rated expertise and high level user behaviour fea-
tures such as the mean time analysing a document and
the number of documents viewed. They found that the
user’s domain knowledge could be indicated by the num-
ber of documents saved, the user’s average query length,
and the average rank position of opened documents.
Their model, however, needs to be further investigated
because the data was limited, collected in a controlled
experiment, and from only one domain. Similarly, but
using only eye movement patterns as features, the lat-
ter conducted a user study instead of log analysis and
employed a linear model and random forests to infer the
user expertise level. Their main contribution is demon-
strating that models to infer a user’s level of domain
knowledge without processing the content of queries or
documents is possible, however they only performed one
single experiment and in one single domain.
2.3 Medical-Specialised Search Engines
For specialised medical search engines, Herskovic et al. [22]
analysed an arbitrary day in PubMed, the largest biomed-
ical database in the world. They concluded that PubMed
may have a different usage profile than general web
search engines. Their work showed that PubMed queries
had a median of three terms, one more than what is
reported for Excite and Altavista. Subsequently, Do-
gan et al. [25] studied an entire month of PubMed
log data. Their main finding comparing PubMed and
How Users Search and What They Search for in the Medical Domain 5
general search engines was that PubMed users are less
likely to select results when the result sets increase in
size, users are more likely to reformulate queries and
are more persistent in seeking information. Whenever
possible, our analysis is compared with the statements
made for PubMed.
Meats et al. [34] conducted an analysis on the 2004
and 2005 logs of the TRIP Database, together with a
usability study with nine users. Their work concluded
that most users used a single term and only 12% of
the search sessions utilised a Boolean operator, under-
utilising the search engine features. Tsikrika et al. [48]
examined query logs from ARRS GoldMiner, a profes-
sional search engine for radiology images. They studied
the process of query modification during a user session,
aiming to guide the creation of realistic search tasks for
the ImageCLEFmed benchmark. Meats used 620,000
queries and Tsikrika only 25,000, while we use nearly 3
and 9 times more queries from TRIP and GoldMiner,
respectively, allowing us to perform a deeper analysis.
Zhang [59] analysed how 19 students solved 12 tasks
using MedlinePlus. The tasks were created based on
questions from the health section of Yahoo! Answers.
Although the log analysis made is very limited due to
the artificial scenario created and the small number of
users, Zhang could investigate browsing strategies used
by users (amount of time searching and/or browsing
MedlinePlus) and the users’ experience with Medline-
Plus (usability, usefulness of the content, interface de-
sign) through questionnaires and recording the users
performing the tasks. Our study is limited to only the
query logs, however a large analysis is made for differ-
ent websites and the user behaviour is captured in a
very natural setting.
2.4 This Work
As illustrated in Figure 1, this work closes a gap. It
studies both general and specialised search engines, as
well as taking into consideration different user expertise
levels. Throughout the rest of this work, we compare
our methodology and results with the studies cited in
this section.
3 Datasets and Pre-processing Steps
In this section, we describe the datasets used in this
study and the preprocessing steps applied to them.
Fig. 1: Our work studies both general and specialised search
engines and investigates how users with different expertise
levels search for health content
3.1 Sorting the Data by Expertise Level
We make the assumption that experts and laypeople
are more likely to use different search engines to sat-
isfy their information needs. Therefore we assume that
almost all queries entered into a particular search en-
gine are entered by only one of the two classes of users
under consideration. This assumption is justified as we
are using search logs from search engines clearly aimed
at users of specific expertise. This assumption is also
more inclusive than another assumption that has been
used to separate medical experts from laypeople: that
only searches leading to PubMed were made by med-
ical experts [52]. As discussed in [38], this assumption
would only tend to detect medical researchers, as med-
ical practitioners make less use of PubMed [31]. We do
not take into account that many users are in between
laypeople and experts as levels can vary.
On one extreme, we have AOL laypeople users. There
might be a few medical experts using AOL, but their
queries are drowned in the laypeople queries. Also fo-
cused on patients, HON is a search engine for laypeo-
ple searching for reliable health information. The main
target audience is laypeople concerned about the relia-
bility of the information they access. On the other ex-
treme, mainly targeting physicians looking for medical
evidence, the TRIP database can also be accessed by
patients but these few patients might be already con-
sidered specialists on their diseases. Finally, the Gold-
Miner search engine is made by radiologists and for ra-
diologists, patients have practically no use for this kind
of information, but a variety of physicians might ac-
6 Jo˜ao Palotti et al.
cess the system. We position each dataset on an exper-
tise axis in Figure 2, to help understanding how each
dataset relates to each other.
AOL
HON TRIP
GoldMiner
Laypeople Focused
Datasets
Expert Focused
Datasets
Expertise Scale
Fig. 2: The datasets used here are plotted on an expertise
scale. The expertise level increases as a dataset is placed more
to the right-hand side of the scale.
3.2 Data
Four query logs from search engines taking free text
queries were divided into five datasets in our analy-
sis: two focused on laypeople queries, two made up of
queries from medical professionals and one consisted of
queries not related to health or medical information.
The query logs that are assumed to consist almost
completely of queries submitted by laypeople were ob-
tained from medical-related search in America Online’s
search service [41]6and from the Health on the Net
Foundation website (HON7).
The AOL logs were obtained from March to May
2006. We divided them into two non-overlapping sets:
AOL-Medical and AOL-NotMedical. For this pur-
pose, the click-through information available in the AOL
data was used. A common approach to decide what
the topic of a URL is, is checking if it is listed in the
Open Directory Project (ODP)8[8,11,14, 52, 54]. For
the clicked URLs that are not present in ODP, some re-
searchers use supervised learning to automatically clas-
sify them [11,52,54]. However, it is very important to
note that this approach cannot be used here, as 47% of
the AOL log entries lack the clicked URL information.
Alternative approaches can be designed. One is to
keep only queries in which the clicked URL is found in
ODP, excluding all the rest. Although valid, this ap-
proach results in removing 73% of all queries, as only
27% of the queries had a clicked URL found in ODP.
This has a strong impact in the behaviour analysis,
such as a strong reduction in the number of queries per
session. Another possibility is doing as in Cartright’s
work [8], in which a list of symptoms was used to filter
sessions on health information. However this approach
creates a strong bias when analysing what users are
6Obtained from http://www.gregsadetsky.com/aol-data/
7http://www.hon.ch/HONsearch/Patients/index.html
8http://www.dmoz.org/
searching for, as it certainly results in a dataset in which
everyone searches for symptoms.
Our solution is based on user sessions – this ap-
proach is not as restricted as when analysing single
queries and does not suffer from the bias of filtering by
keywords. First we divide the query log into user ses-
sions, continuous queries from the same user followed
by an inactivity period exceeding 30 minutes. After
this, we attribute one of the following labels for each
clicked URL, if any: (1) Medical, (2) Not Medical, or
(3) Not Found. This depends on whether the URL is
(1) found in any Medical category listed in Table 1;
(2) found in any other category: News, Arts, Games,
Health/Animals, Health/Beauty, etc; or (3) not found
in either of these. Last, we assign to the whole session
the Medical label only if the proportion of URLs on
Medical information is greater than a threshold t. Med-
ical search sessions classified this way are attributed
to the set AOL-Medical, while the rest goes to the
AOL-NotMedical set. Figure 3 illustrates the session
assignment procedure. For the experiments performed
in this work, we use t= 0.5. This value is a fair trade-off
between two extremes: considering an entire session as
being on medical information because one single URL
on medical information was clicked (see second part of
Figure 3), and considering an entire session as being
on medical information only if all the known clicked
links are on medical information (see the first part of
Figure 3).
For the first part of Figure 3, it is important to note
that the first query could be considered to belong to an-
other session, as the user intent might be different from
the rest of the session. The second and third queries,
drug names that are clearly for medical content, were
not used to calculate whether the session was on medi-
cal information or not, as their clicked URLs were not
found in ODP. After the label estimation is done, all the
queries of a session are assigned to the same class, there-
fore all six queries in Siare assigned to AOL-Medical.
While only 27% of the queries have their URLs found
in ODP, using the session approach described above al-
lows us to have 50% of all sessions with at least one
URL found in ODP. Altogether, 68% of all AOL queries
were evaluated, as they belong to sessions that had at
least one clicked URL in ODP. A more accurate way to
define sessions is a field of research by itself [21,30, 19]
and it is not the goal of this work.
The HON dataset is composed of anonymous logs
ranging from December 2011 to August 2013. This non-
governmental organisation is responsible for the HON-
code, a certification of quality given to websites ful-
filling a pre-defined list of criteria [6]. HON provides
a search engine to facilitate the access to the certified
How Users Search and What They Search for in the Medical Domain 7
Table 1: ODP categories used to filter the AOL-Medical. These categories are the most relevant ones related to Medicine in
ODP hierarchy (see http://www.dmoz.org/Health/Medicine/)
ODP Category URL Examples
\Top\Health\Medicine http://www.nlm.nih.gov
http://www.webmd.gov
\Top\Health\Alternative http://www.acupuncturetoday.com
http://www.homeopathyhome.com
\Top\Health\Dentistry http://www.dental-- health.com
http://www.animated-teeth.com
\Top\Health\Conditions and Diseases http://www.cancer.gov
http://www.cancer.org
\Top\Health\Organisations\Medicine http://www.ama-assn.org
http://www.aafp.org
\Top\Health\Resources http://health.nih.gov
http://www.eyeglassretailerreviews.com
AOL
Logs
"goog
le" 19:12:50 google.com
"zetia" 19:21:50 None
"triamterene" 19:22:25 http://www.triamterene.com
"benicar" 19:24:00 http://www.benicar.com
"toprol" 19:25:15 http://www.rxlist.com
"toprol" 19:25:15 http://www.toprol-xl.com
"ber carpet" 00:49:01 http://www.oorfacts.com
"ber carpet" 00:49:01 http://www.shawoors.com
"carpet" 00:52:41 None
"carpet" 00:53:35 None
"new carpet" 00:55:20 http://www.cpsc.gov
"new carpet" 00:55:20 http://www.servicemagic.com
Query Log
ODP
Classication Final Decision
P(Session=Medical) =
3 Medical / (3 Medical + 1 Not Medical) = 0.75
Session is considered on medical information
and goes to AOL-Medical
P(Session=Medical) =
1 Medical / (1 Medical + 3 Not Medical) = 0.25
Session is not on medical information.
It goes to AOL-NotMedical
Si
Not Medical
Not Found
Not Found
Medical
Medical
Medical
Not Medical
Not Medical
Not Found
Not Found
Medical
Not Medical
Sj
Fig. 3: Two real user sessions extracted from AOL logs, Siis classified as a search for medical content, while Sjis not.
sites. Although the majority of the queries are issued in
English, the use of French or Spanish is frequent. Aim-
ing to reduce noise, every query in the HON dataset
was re-issued in a commercial search engine and the
snippets of the top 10 results were used as input for an
automatic language detection tool [33], which presented
a precision of 94% in filtering English queries.
As expert datasets, we use the logs from the Turning
Research Into Practice (TRIP) database9and ARRS
GoldMiner10. The former is a search engine indexing
more than 80,000 documents and covering 150 manu-
ally selected health resources such as MEDLINE and
the Cochrane Library. Its intent is to allow easy access
to online evidence-based material for physicians [34].
The logs contain queries of 279,280 anonymous users
from January 2011 to August 2012. GoldMiner consists
of logs from an image search engine that provides access
to more than 300,000 radiology images based on text
queries of text associated with the images. Although
the usage of an image search engine is slightly differ-
ent from document search, previous work in the liter-
9http://www.tripdatabase.com/
10 http://goldminer.arrs.org
ature [48,23] showed that the user search behaviour is
similar. We had access to more than 200,000 queries
with last logged query being issued in January 2012.
Due to a confidentiality agreement, we cannot reveal
the start date of this collection. The GoldMiner search
engine is interesting because its users are so specialised
and it therefore represents the particular case of cater-
ing to experts in a narrower domain inside medicine.
As GoldMiner is so specialised, the number of laypeo-
ple using it is likely small. It is therefore a good ex-
ample of the extreme specialisation end of the expert
continuum, allowing the effects of this specialisation on
the vocabulary and search behaviour of the users to be
found.
3.3 Pre-processing Log Files
The first challenge dealing with different sources of logs
is normalising them. Unfortunately, there is clickthrough
URL information available only for the AOL and HON
datasets, limiting a detailed click analysis. Therefore,
we focus on a query content analysis, using only the
intersection of all possible fields: (1) timestamp, (2)
8 Jo˜ao Palotti et al.
anonymous user identification, and (3) keywords. Nei-
ther stop word removal nor stemming were used.
Sessions were defined as follows. They begin with
a query and continue with the subsequent queries from
the same user until a period of inactivity of over 30 min-
utes is found. This approach for sessions, as well as the
30-minutes threshold, is widely used in the literature [8,
54,30]. We excluded extremely prolific users (over 100
queries in a single session), since they could represent
“bots” rather than individuals.
4 Enriching the Query Logs with MetaMap
The US National Library of Medicine MetaMap was
intensively used in this work to enrich the informa-
tion contained in the query logs, adding annotations re-
garding the concepts searched in the queries. MetaMap
is widely used to map biomedical text to the Unified
Medical Language System (UMLS) Metathesaurus, a
compendium of many controlled vocabularies in the
biomedical sciences [1]. This mapping can serve for dif-
ferent tasks, such as query expansion [4,20], concept
identification and indexing [2,36], question answering [12],
knowledge discovering [51], and more related to this
work, enrich query logs to understand user goals [22,
25]. To explain how mapping queries to UMLS can give
us some insights about the user intent, we first have to
explain what UMLS is and how MetaMap maps text to
UMLS. We explain how the mapping works in the next
section and we evaluate the mapping in Section 4.2.
4.1 MetaMap
A Metathesaurus can be defined as a very large, multi-
purpose, and multi-lingual vocabulary resource that con-
tains information about biomedical and health related
concepts, their various names, and the relationships
among them [37]. In its 2013 version, the UMLS Meta-
thesaurus has more than one hundred different con-
trolled vocabulary sources and a large amount of in-
ternal links, such as alternative names and views of the
same concept.
The white row of Table 2 is the original version of
the classical UMLS example from [37]. It illustrates how
different atoms can have the same meaning. Atoms are
the basic building blocks from which the Metathesaurus
is constructed, containing the concept names or strings
from each of the source vocabularies. The atoms shown
are part of two vocabularies PSY (Psychological Index
Terms), and MSH (Medical Subject Headings, MeSH),
mapping different strings and terms to the same con-
cept, C0004238, which states that atrial fibrillation is a
pathological function. The other row of this table shows
another concept, C1963067, mapped from the vocabu-
lary NCI (National Cancer Institute), which states that
atrial fibrillation can be an adverse event associated
with the use of a medical treatment or procedure, al-
though we do not know which medical treatment or
procedure.
The job of MetaMap is to map a biomedical text to
its corresponding concept(s). MetaMap generates a can-
didate set for a piece of text, based on its internal parser
and variant generation algorithm, which takes into ac-
count acronyms, synonyms, inflections and spelling vari-
ants of the text. Then, based on metrics such as cen-
trality, variation, coverage and cohesiveness, MetaMap
ranks each candidate [1]. Occasionally, more than one
candidate may have the same score. We collect all the
top candidate(s) and its (their) associated semantic type(s),
shown in bold below the CUIs in Table 2. In the run-
ning example, a text containing only ‘atrial fibrillation
is mapped to both C0004238 and C1963067 with the
same top score, and the types ‘Pathologic Function’
and ‘Finding’ are assigned to the query. To the best
of our knowledge, MetaMap is the state of the art for
mapping biomedical text to UMLS concepts.
Finding
Pathologic Function
Semantic Types
C14.280.067.198
C23.550.073.198
MeSH Hierarchy
atrial brillation
What is atrial
brillation?
<userID, timestamp, "atrial brillation">
NLM Metamap
<userID, timestamp, "atrial brillation", [Finding, Path...], [C14.280.067.198, ...]>
C004238
C1963067
Concepts
Tissue, Body Part, Organ, or Organ
Component, Neoplastic Process,
Therapeutic or Preventive Procedure
Semantic Types
A04.411, E02, C08.381.540,
C04.588.894.797.520
MeSH Hierarchy
lung cancer treatment
What are the
treatments for
lung cancer?
<userID, timestamp, "lung cancer treatment">
NLM Metamap
<userID, timestamp, "lung cancer treatment", [Tissue, Body...], [A04.411, E02, ...]>
C0024109, C1522236, C1705169,
C1278908, C0684249, C0087111,
C0242379, C1533734, C0819757,
C0039798, C092025
Concepts
Fig. 4: Two different user queries are enriched with infor-
mation extracted with MetaMap. In the top part, the same
example used in Table 2 is processed by MetaMap. In the
bottom part, the query “lung cancer treatment” is more am-
biguous and results in different mappings, such as Lung (En-
tire lung) / Cancer Treatment (Cancer Therapeutic Procedure)
and Lung Cancer (Malignant neoplasm of lung) / Treatment
(Therapeutic procedure)
How Users Search and What They Search for in the Medical Domain 9
Table 2: A concept is potentially linked to various AUI (atom), SUI (string), and LUI (term). We used MetaMap to map a user
query, e.g. “Atrial Fibrillation” to the different existing concepts (C0004238, C1963067). Note that each concept is associated
to one single semantic meaning.
Concept (CUI) Terms (LUIs) Strings (SUIs) Atoms (AUIs)
C0004238
[Pathologic Function]
Atrial Fibrillation
(preferred)
Atrial Fibrillations
Auricular Fibrillation
Auricular Fibrillations
L0004238
Atrial Fibrillation
(preferred)
Atrial Fibrillations
S0016668
Atrial Fibrillation
(preferred)
A0027665
Atrial Fibrillation
(from MSH)
A0027667
Atrial Fibrillation
(from PSY)
S0016669
(plural variant)
Atrial Fibrillations
A0027668
Atrial Fibrillations
(from MSH)
L0004327
(synonym)
Auricular Fibrillation
Auricular Fibrillations
S0016899
Auricular Fibrillation
(preferred)
A0027930
Auricular Fibrillation
(from PSY)
S0016900
(plural variant)
Auricular Fibrillations
A0027932
Auricular Fibrillations
(from MSH)
C1963067
[Finding]
Atrial fibrillation
(Atrial Fibrillation Adverse Event)
....... Auricular Fibrillations
(from NCI)
An interesting way to capture the user intent is map-
ping the queries to a well known domain corpus. In this
work we use the Medical Subject Headings, MeSH, as it
is a rich and well structured hierarchy that has already
been studied to examine user query logs [22], allowing
us to compare the behaviour of the users studied here
with PubMed users. The whole MeSH hierarchy con-
tains more than 25,000 subject headings in the 2013
version, the one used in this work, containing 16 top
categories such as ‘Anatomy’ and ‘Diseases’. Figure 5
Fig. 5: MeSH hierarchy with the Disease branch expanded
shows an example of the MeSH hierarchy with the first
level of the disease branch expanded.
We use the approach of Herskovic et al. [22] in this
paper, mapping each query onto one or more MeSH
terms with MetaMap. As shown in Figure 4, one query
can be mapped to multiple MeSH identifiers. For ex-
ample, the query ‘atrial fibrillation ’ is mapped to both
MeSH ids C14.280.067.198 and C23.550.073.198, both
in the topmost Disease category (represented by the
starting letter ‘C’ as show in Figure 5). After the map-
ping to MeSH is done, we can easily have an overview
regarding the subjects the users are more interested in.
In this case we would conclude that this user is inter-
ested in diseases, as her/his only query maps only to
category ‘C’, more specific in cardiovascular diseases,
C14, and pathological conditions, C23.
After preprocessing, each query is converted into the
following format: <timestamp, userID, query, seman-
ticTypes, meshIDs>, where the timestamp,userID and
query are originally query log fields, while meshIDs and
semanticTypes are the set of semantic types and MeSH
identifiers generated by MetaMap. These two fields are
examined in details in Sections 5.2.2 and 6.2. Figure 4
illustrates how the queries were enriched with the in-
formation provided by MetaMap and the final format.
Finally, it is important to mention that the queries
were mapped to concepts in the UMLS 2013 AA US-
Abase Strict Data and no special behaviour parameter
was used. We manually examined the behaviour for two
important parameters: allowing acronyms/abbreviations
(-a) and using the word sense disambiguation mod-
10 Jo˜ao Palotti et al.
ule (-y), and decided not to activate them. Our ex-
periments show that activating the former parameter
decreases the precision significantly for the sake of a
small increase in recall, as MetaMap is already capable
of matching some of the most frequently used abbrevi-
ations (HIV, HPV, AIDS, COPD). For the latter, we
have an inverse scenario, where we had a small gain in
precision but a larger loss in recall, as MetaMap always
picks only one possibility when more than one concept
is possible. It means that MetaMap would be forced
to choose between concepts C0004238 and C1963067 of
Table 2, even when both are equally likely. The last
important reason for not using any other parameter is
that we want to compare our results with [22], in which
no special option was used either. For the experiments
shown in Section 5.2.2 we used the parameter (-R) to re-
strict MetaMap to use only MeSH as vocabulary source.
4.2 Evaluation of the Mapping
As recently reported by MetaMap’s authors [3], a direct
evaluation of MetaMap against a manually constructed
gold standard mapping to UMLS concepts has almost
never been performed. Usually, indirect evaluations are
made, where the effectiveness of a task is measured with
and without MetaMap. For example, query expansion
using the related concepts of a concept identified by
MetaMap versus not using it. Here we are interested
in the few articles that evaluate the effectiveness of
MetaMap, especially the ones focused on mapping user
queries.
In 2003, Pratt and Yetisgen-Yildiz [42] compared
MetaMap mappings to UMLS with mappings made by
6 physicians and nurses. For the 151 concepts in their
ground truth, MetaMap could match 81 concepts ex-
actly, 60 partially and could not match only 10 con-
cepts, of which 6 were not available in UMLS. In a sce-
nario considering partial matches (e.g., mapping to ‘an-
giomatosis’ instead of ‘leptomeningeal angiomatosis’),
MetaMap had an F1 of 76%. In another experiment in
the same year, Denny et al. [13] built a bigger gold stan-
dard dataset of 4,281 concepts to evaluate MetaMap,
reaching a precision of 78%, recall of 85% and F1 of
81%.
More recently, N´ev´eol et al. [36] reported results
on using MetaMap to detect disease concepts on both
literature and query corpus. The results showed that
MetaMap had a better effectiveness for long sentences
(F1 of 76%) than for short queries (F1 of 70%), but
they also pointed out that the average inter-annotator
agreement of the 3 assessors for the query corpus was
73%, showing that MetaMap results are not far from
humans performing the same task. Using 1,000 queries
from partly the same datasets that are used here: AOL,
HON and TRIP, Palotti et al. [39] also showed an F1
of 70% for query mappings.
ev´eol et al. [35] created an annotated set of 10,000
queries that were mapped to 16 categories, in a simi-
lar way to what is done in Section 6.2, where the se-
mantic types produced by MetaMap are used to define
our own categories. We used N´ev´eol’s dataset to cali-
brate our mappings for our ‘Cause’ and ‘Remedy’ cat-
egories (see Section 6.2), as well as to take decisions
regarding MetaMap’s parameters. We used the ‘Disor-
der’ category of N´ev´eol as an equivalent of our ‘Cause’
category, and we combined ‘Chemical and Drugs’ (an-
tibiotic, drug or any chemical substance), ‘Gene, Pro-
teins and Molecular Sequences’ (name of a molecular
sequence) and ‘Medical Procedures’ (activity involving
diagnosis, or treatment) as the closed possible class of
our ‘Remedy’ class. We could reach an F1 of 78% for
the ’Cause’ category (P=75%, R=81%) and 72% for
‘Remedy’ (P=70%, R=73%). These figures are in line
with what is known for MetaMap when mapping med-
ical abstracts to concepts, encouraging us to use it for
mapping short queries to concepts as well.
4.3 Using the Mappings
In the following sections, we show how we exploit the
mappings made by MetaMap to enrich query logs. In
Section 5, we use the mappings to analyse individual
queries, following a very similar approach carried out by
Herskovic et al. [22], being able to compare our results
for individual queries. Later, in Section 6, the focus is on
the session level. An interesting work which we took as a
basis for comparison is Cartright et al. [8], which defines
3 classes: symptoms, diseases and remedies. Note that
we could group the MeSH hierarchy into these three
classes, but we prefer to use the semantic types provided
by MetaMap, as it is more intuitive and it was already
done in the literature (for example [26,35,38, 39]).
5 Individual Query Analysis
One goal of this section is to study how users search,
based first on simple statistics to model their behaviour.
Also, we start exploring the content of their queries, but
considering all the queries without dividing into user
sessions.
5.1 How Users Search
We start by showing a few simple but important statis-
tics about the logs. The aim of this section is to un-
How Users Search and What They Search for in the Medical Domain 11
derstand the user behaviour through general statistics,
as well as to show how each log is composed. In Ta-
ble 3 we depict several metrics that are used to char-
acterise user interactions, and compare their values to
those in related studies. Torres et al. [14] use AOL logs
to study queries performed by kids. White et al. [52]
use a keyword-based method to filter domain specific
queries and divide them into those issued by laypeople
and those issued by experts. Their work also considers
other types of queries, such as queries on computer sci-
ence or financial information. We show only the data
for the medical domain. Herskovic et al. [22] and Do-
gan et al. [25] analyse different periods of PubMed logs.
For all datasets, “N/A” is used when the information
is not available.
The query logs from the related work shown in Ta-
ble 3 belong to the same time period as the AOL logs.
Query logs from HON, TRIP and GM are considerably
newer than the others. Nevertheless, Table 3 shows that
AOL-Medical and HON are very similar in many as-
pects, such as the average number of terms per query
and the average time per session. The biggest difference
between these two query logs was found for the average
number of queries per session, however the difference is
small if compared to any other datasets.
The average terms/characters per query can be an
indicator of the complexity and difficulty of the users
to express their information needs. We note that AOL-
Medical and HON queries are shorter than TRIP queries,
and that TRIP logs are similar to PubMed logs in terms
of query length. White’s work also found that expert
queries are more complex than layperson queries.
The average number of queries per session and time
per session, although considerably smaller than what
was found by White’s work, follow the same pattern,
with TRIP data having longer sessions than HON and
AOL-Medical. We could not find an explanation for so
long queries in White et al. dataset. We show only the
sessions made by experts and laypeople in the medical
domain from White’s work, but in their original paper
they report that sessions are considerably smaller when
the same set of people query in other domains: having a
mean session length of less than 5 queries, and the mean
time per session is never longer than 800 seconds.
We aggregate the log into 2 groups in Table 4: laypeo-
ple and experts, making the comparison of our datasets
with the literature possible. As done by White et al. [52],
we use Cohen’s d to determine the effect size of each
variable between each pair of groups. We randomly
sampled 45,000 users from TRIP and merged them with
the 45,090 users from GoldMiner, making all datasets
have a comparable number of users. Cohen’s d is a use-
ful metric for meta-analysis [9] that uses the means and
standard deviations of each measurement to calculate
how significant a difference is. Although there are con-
troversies about what is a “small”, “medium” or “large”
effect size, a recommended procedure is to define a Co-
hen’s d effect size of 0.2 or 0.3 as a “small” effect, around
0.5 as “medium” effect and greater than 0.8 as a “large”
effect [9]. White et al. built a classifier to detect user
expertise based on a superset of the features shown in
Table 4. They argued that these are valuable features
based on Cohen’s d value, as well as feature importance
calculated by their regression classifier. Although con-
sidered to have a “small” effect, this was big enough to
help separate experts from laypeople. We reached very
similar Cohen’s d values to White’s paper, hypothesis-
ing that the behaviour could be used to predict exper-
tise in other logs as well. In particular, we found the
same ranking that White et al. found, among the four
features presented in Table 4.
5.2 What Users Search for
In order to understand what users are looking for, we
investigate popular terms and queries issued. Also, we
use MetaMap to map queries to the MeSH hierarchy,
finding the high level topics associated with the user
queries.
5.2.1 Terms and Queries
We depict the most popular queries, terms (here ex-
cluding the stop words), and abbreviations used in all
logs, as well as their frequency among the queries in
Table 5. As expected, AOL-NotMedical contains navi-
gational queries and diverse terms related to entertain-
ment. Similarly, some of the most popular queries in
AOL-Medical are navigational, with the website ‘webmd.com
appearing twice in the top 10 queries, and the Mayo
Clinic also a common query. Both of these navigational
queries also appear in the HON search log. The analysis
of AOL-Medical terms shows common medicine-related
concepts, with people searching for information about
different cancer types in more than 3% of the cases.
Most of the top queries in the TRIP log are related
to disease. In TRIP logs, we found ‘area:’ in 3% of
the queries, ‘title:’ in 2.2%, ‘to:’ in 1.5% and ‘from:’ in
1.8%, in total these keywords were used in 6.7% of the
queries, however, we do not show these terms in Table 5,
as they do not reveal what the users search, but how
they search. These patterns were not found in the other
datasets. The use of more advanced terms is also found
in PubMed logs [22], we hypothesise that some users
might just copy and paste their queries from PubMed
into the TRIP search engine, resulting in queries such
12 Jo˜ao Palotti et al.
Table 3: General Statistics describing the query logs. ‘*’ means that the median used as the mean was not informed and ‘N/A
means that the data was not available.
This Work Literature
Dataset Laypeople Experts Non-Medical AOL-Kids AOL-NKids Laypeople Experts Pubmed
AOL-M HON TRIP GM AOL-NM Torres et al. [14] White et al. [52] Herskovic et al.[22] Dogan et al.[25]
Logs Initial Date Mar 2006 Dec 2011 Jan 2011 N/A Mar 2006 Mar 2006 May 2007 Jan 2006 Mar 2008
Logs Final Date May 2006 Aug 2013 Aug 2012 Jan 2012 May 2006 May 2006 Jul 2007 Jan 2006 Mar 2008
# Users 47,532 47,280 279,280 45,090 655,292 N/A N/A 37,243 7,971 624,514 NA
# Queries 215,691 140,109 1,788,968 219,407 34,427,132 485,561 N/A 673,882 362,283 2,689,166 58,026,098
# Unique Queries 69,407 85,824 486,431 90,766 9,695,882 10,252 N/A N/A N/A N/A N/A
# Sessions 79,711 77,977 344,038 100,848 10,555,562 21,009 N/A 68,036 26,000 740,215 23,017,461
Avg Terms Per Query 2.61 ( 1.71) 2.72 ( 2.05) 3.40 ( 2.33) 2.28 ( 2.54) 2.46 ( 1.87) 3.23 2.5 2.92 3.30 3* 3.54
Avg Char Per Query 16.22 ( 9.11) 18.11 ( 11.48) 21.22 ( 9.69) 16.64 ( 10.20) 15.98 ( 9.67) N/A N/A 20.76 24.05 N/A N/A
Avg Queries Per Session 2.71 ( 2.50) 1.80 ( 2.48) 5.20 ( 5.95) 2.18 ( 2.57) 3.26 ( 4.65) 8.76 2.8 9.90 13.93 N/A 4.05
Avg Time Per Session (s) 258 ( 531) 208 ( 592) 471 ( 758) 163 ( 520) 384 ( 809) 1238 N/A 1549.74 1776.45 N/A N/A
Table 4: General Statistics – Stratified by expertise. L for laypeople and E for experts
Dataset Laypeople Experts Cohen’s d
Total Number of Users 94,812 90,090
E - L E - L
from [52]
Total Number Of Queries 355,800 504,745
Total Number Of Unique Queries 149,648 181,051
Total Number Of Sessions 157,688 155,965
Mean Terms Per Query 2.65 ( 1.85) 2.91 ( 2.09) 0.13 0.20
Mean Chars Per Query 16.97 ( 10.16) 19.18 ( 10.16) 0.22 0.30
Mean Queries Per Session 2.26 ( 2.53) 3.24 ( 4.29) 0.28 0.38
Mean Time Per Session (sec) 233 ( 562) 271 ( 629) 0.06 0.11
as ‘palliative care (area:oncology)’, indicating that the
user wants material about palliative care specifically for
the area of oncology. ‘Title’ is used in PubMed for per-
forming a search only in the title of the indexed articles,
while ’from:’ and ‘to:’ specify periods of time in which
a document was published.
The topmost query in the HON log and its top 3
terms are ‘trustworthy health sites’. It shows that many
of the queries are from users that do not know which
are the medical websites that they can trust, and also
demonstrates a misunderstanding by the end users of
the nature of the content indexed by the HON search
engine (only HONcode-certified websites are indexed).
For the GoldMiner queries and terms, we clearly see
the increase in the terminological specificity of the most
popular keywords used.
5.2.2 Mapping to MeSH
MeSH is a hierarchical vocabulary used by US National
Library of Medicine for indexing journal articles in the
life sciences field. A query log analysis using MeSH was
also carried out by Herskovic et al.[22] for the PubMed
logs in order to understand what are the most popular
topics searched by the users. We use the same weighting
schema used in Herskovic’s work: if ncategories are
detected in one query, we give the weight of 1/nto
these categories.
General statistics calculated for the mapping of user
queries to MeSH terms are shown in Table 6. Here, we
are testing MetaMap for the annotation of non-medical
queries as well, which to the best of our knowledge was
never studied.
An interesting result is the fact that around 50%
of AOL-NotMedical queries were successfully mapped
to a MeSH concept. To investigate this, we collected a
large random sample of mapped queries and analysed
them. We found that MetaMap is able to find many
concepts not directly linked to medicine, such as ge-
ographic locations, animals and plants, food and ob-
jects. For example, ‘www’ (L01.224.230.110.500 ), used
in 10% of all AOL queries, is recognised and annotated
as Manufactured Object. Also, locations are usually very
commonly found and help to explain the high mean
MeSH depth found for this dataset, second row in col-
umn AOL-NM (California is mapped to both Z01.107-
.567.875.760.200 and Z01.107.567.875.580.200 ). It is
important to have this in mind when building systems
like in Yan et al. [56], in which the MeSH depth is
used to model document scope and cohesion. When
looking at false positive mappings, especially the ones
mapping to diseases and symptoms, we detected that
MetaMap’s errors fall into two main categories: (1) En-
glish common words: tattoo (tattoo disorder), pokemon
(ZBTB7A gene), and (2) abbreviations: park (Parkin-
son disease), dvd (Dissociated Vertical Deviation). For
both types of errors, MetaMap or a system using it,
would have to use the context (words around the map-
ping) to detect that Pokemon is used as a cartoon or
How Users Search and What They Search for in the Medical Domain 13
Table 5: Top queries and terms and their relative frequency (%) among all queries
Laypeople Experts
Rank. AOL-Medical HON TRIP GoldMiner AOL-NotMedical
String Freq String Freq String Freq String Freq String Freq
QUERIES
1 webmd 0.98 trustworthy
health sites
4.24 skin 0.29 mega cisterna
magna
0.44 google 0.95
2 web md 0.41 cancer 0.51 diab etes 0.22 baastrup disease 0.40 ebay 0.40
3 shingles 0.27 webmd 0.47 asthma 0.17 toxic 0.23 yahoo 0.37
4 mayo clinic 0.26 sleep apnea syn-
dromes
0.27 hypertension 0.14 limbus vertebra 0.22 yahoo.com 0.28
5 lupus 0.25 lymphoma 0.22 stroke 0.13 cystitis cystica 0.20 mapquest 0.25
6 herpes 0.20 breast cancer 0.21 osteoporosis 0.11 thornwaldt cyst 0.14 google.com 0.23
7 diabetes 0.19 hypertension 0.18 low back pain 0.10 buford complex 0.13 myspace.com 0.22
8 fibromyalgia 0.18 mayoclinic.com 0.16 copd 0.10 splenic heman-
gioma
0.13 myspace 0.21
9 pregnancy 0.16 obesity 0.16 breast cancer 0.09 throckmorton
sign
0.12 www.yahoo.com 0.12
10 hernia 0.16 drweil.com 0.14 pneumonia 0.09 double duct sign 0.12 www.google.com 0.12
TERMS
1 cancer 3.40 health 6.39 treatment 3.03 cyst 3.17 free 1.24
2 hospital 3.00 sites 4.37 cancer 2.56 mri 1.89 google 1.04
3 pain 2.25 trustworthy 4.28 pain 2.13 disease 1.80 county 0.65
4 symptoms 2.14 cancer 2.74 care 2.10 ct 1.75 yahoo 0.62
5 disease 2.03 disease 1.53 children 1.98 fracture 1.68 pictures 0.60
6 blood 1.87 diab etes 1.17 therapy 1.81 tumor 1.65 lyrics 0.52
7 medical 1.62 treatment 0.96 diabetes 1.80 syndrome 1.47 school 0.51
8 webmd 1.21 syndrome 0.87 disease 1.78 liver 1.26 myspace 0.49
9 surgery 1.14 heart 0.83 pregnancy 1.70 pulmonary 1.22 ebay 0.46
10 syndrome 1.13 pain 0.80 acute 1.41 bone 1.16 sex 0.44
11 breast 1.11 care 0.77 syndrome 1.39 renal 1.13 florida 0.45
12 center 1.09 effects 0.75 management 1.14 sign 1.12 sale 0.41
13 health 1.04 medical 0.67 stroke 1.07 lung 1.11 city 0.40
14 heart 0.90 blood 0.65 surgery 1.06 brain 1.08 home 0.39
15 diabetes 0.86 pregnancy 0.61 prevention 1.05 cell 1.00 state 0.39
a game, and not as a gene name. Specifically for the
second case, it would be desirable if MetaMap could al-
low the use of a pre-defined list of acronyms to increase
its precision. In the current implementation, MetaMap
has a parameter for user defined acronyms (-UDA),
but it is just used to expand more acronyms instead of
overwriting its pre-defined ones. Also for AOL-NL, the
third and fourth rows indicate the suitability of using
mappings to MeSH for distinguishing between medical
and non-medical queries. Queries from the medical logs
have a larger number of MeSH terms and disease terms
than AOL-NM. If the errors analysed above could be
amended using the query context or session, for exam-
ple, then a mapping to MeSH could be helpful to detect
queries or sessions on medical information.
Going further, we present in Figure 6 the most pop-
ular categories for the first level of the MeSH hierar-
chy. We also show the results obtained by Herskovic et
al. [22] for PubMed, in order to compare our findings.
We show only the categories that have more than 5%
of the queries containing MeSH terms mapped to it.
When Herskovic and colleagues did this experiment,
they found that PubMed users were more interested
in the category Chemical and Drugs. In general, the
distributions over the categories for the AOL-Medical,
Table 6: General MeSH Statistics
Laypeople Experts
Metric AOL-M HON TRIP GM AOL-NM
% of queries
containing
MeSH terms
77.87 77.81 85.96 79.02 50.51
Mean MeSH
Depth
3.99 3.83 3.86 4.01 4.37
Mean MeSH
terms per
query
2.14 2.19 2.78 2.07 1.12
Mean Disease
terms per
query
0.81 0.60 0.99 1.17 0.05
HON and TRIP search logs are similar. However, dif-
ferently from PubMed, we found that the users are gen-
erally most interested in Diseases, and then Chemicals
and Drugs. The results for GoldMiner show another
trend for the second most popular category, focused on
anatomy rather than on drugs, likely because radiolo-
gists often have to append to their query the part of the
body that they are interested in. In its actual version,
GoldMiner has a filter for age, sex and modality (e.g
CT, X-ray), but it has no filter for body parts. This
analysis suggests that it could be interesting to add a
filter for body regions as well.
14 Jo˜ao Palotti et al.
Fig. 6: Popular categories according to MeSH mappings
Last, the four classes to the right of Figure 6 partly
explain the high percentage of AOL-NotMedical terms
mapped to MeSH terms. Also, the high percentage of
these least medical categories, together with low per-
centage of relevant medical categories, the four classes
to the left of Figure 6, can be used as a discriminative
feature to distinguish between medical and non-medical
logs.
6 Analysing Sessions
From now on, we consider user sessions instead of sep-
arate queries. Once more, we study first the user be-
haviour, then the content of each session.
6.1 Session Characteristics
A series of queries, part of an information seeking ac-
tivity, is defined as a session. We consider that, after
issuing the first query, a user may act in four different
ways: (1) repeat exactly the same query, (2) repeat the
query adding one or more terms to increase precision,
(3) reduce the number of terms to increase recall, or (4)
reformulate the query changing some or all the terms
used. We ignore the first case because we cannot be
sure if a user is really repeating the same query or just
changing the result page, as some search engines record
the same query as a result of a page change.
Table 7 depicts the changes made by users during
the sessions. If during one single session a user adds
a term to the previous query and then changes a few
words, we count one action in the row Exp.Ref (for
expansion and reformulation – the order is not impor-
tant). At the end, we divide the number of actions of
each row by the total actions in the query log. Hence,
Table 7 shows that the most frequent user action is the
reformulation alone but it is more likely to happen in
search engines targeting laypeople, e.g., 84% of the ses-
sions in the AOL-Health logs and 63% of HON had only
reformulations. The last row of Table 7 shows that ex-
pert users might be more persistent than laypeople, as
more than 10% of the sessions in the professional search
engines are composed of every type of action, while in
laypeople logs this number is less than a third of this.
In the literature, White et al [52] also hypothesise that
expert users are more persistent than laypeople.
To better understand the last row of Table 7, we
plot the two graphs in Figure 7. The first graph is the
cumulative distribution of session length, showing that
TRIP has clearly longer sessions, with 20% of the ses-
sions being longer than 5 queries. The second graph
shows the last row of Table 7 distributed over differ-
ent session length. In this graph we can see how TRIP
and GoldMiner users tend to perform more actions even
for short sessions, as 20% of sessions of length 4 have
already done all 3 actions. We also studied the user be-
haviour when the query repetition is allowed and we
found a very similar situation.
How Users Search and What They Search for in the Medical Domain 15
Table 7: Aggregated percentages for query modifications
along the sessions
Laypeople Experts
Action AOL-M HON TRIP GM AOL-NM
Expansion 6.66 13.83 14.85 5.96 3.71
Reduction 1.23 2.23 4.35 9.61 0.84
Reformulation 84.74 63.56 43.96 49.56 80.27
Exp. and Red. 0.37 1.29 5.09 3.54 0.57
Exp. and Ref. 5.43 13.90 15.27 8.28 9.66
Red. and Ref. 1.01 2.21 5.63 12.01 2.09
Exp. Red. Ref. 0.56 2.98 10.85 11.04 2.86
Fig. 7: The top graph shows the cumulative distribution of
sessions length in terms of number of queries. The users of
HON and GoldMiner tend to have shorter sessions, while the
users of TRIP have longer sessions. The graph below shows
the percentage of users that in a single session perform all
three actions (expand, reduce and reformulate the previous
query) for sessions of different sizes. As we can expect, this
percentage increases as the query length increases but it is
much higher for the expert users.
6.2 What are the Sessions About?
In this section, we attribute meaning to the users’ queries
in order to better understand their behaviour in a med-
ical search context. We decide to use the same classes
defined in Cartright, White and Horvitz [8]: symptom,
cause and remedy, so that a direct comparison can be
performed. A difference of their method and ours is that
we classify the queries into the semantic types using
MetaMap, as done in [26,35,38, 39], instead of hand-
made rules.
In Figure 8, we show all concepts that have a fre-
quency of at least 5% in any query log. Additionally,
we show the type ‘Sign and Symptom’ because it is
an important concept in our further analysis. We show
only these 10 semantic types for a matter of readability,
as currently MetaMap recognises 133 semantic types
and it is not possible to visualise them all11. The single
most common type in all the medical logs is ‘Disease
and Syndrome’. As we expect, the top types in AOL-
NotMedical are not really related to the medical do-
main, and the second most common semantic type for
GoldMiner is related to parts of the body, as one might
expect for radiology queries.
After a meticulous analysis of the semantic meaning
assigned for the queries and the experiments described
in Section 4.2, we defined the following classification
based on the three classes created in [8] (some examples
of queries classified for each type are given for a better
understanding):
Symptom: Sign or Symptom (cough;sore;headache;
red eyes), Findings (stress;testicular cyst )
– Cause: Anatomical Abnormality (hiatial hernia),
Cell or Molecular Dysfunction (macrocytos), Con-
genital Abnormality (scoliosis), Disease or Syndrome
(diabetes;heart failure), Experimental Model of Dis-
ease (cancer model), Injury or Poisoning (achilles
tendon rupture), Mental or Behavioural Dysfunc-
tion (bipolar disorder), Neoplastic Process (lung can-
cer;tumor ), Pathologic Function (atypical hyperpla-
sia)
Remedy: all 28 types belonging to the high-level
group Chemicals & Drugs, which includes Clinical
Drug (cough syrup), Antibiotic (penicillin), Phar-
macologic Substance (tylenol; mietamizol), Amino
Acid, Peptide, or Protein (vectibix; degarilex), Im-
munologic Factor (vaccine; acc antibody), Vitamin
(quercetin, vitamin B12), Therapeutic or Preventive
Procedure (treatment; physiotherapy), etc.
We analyse the most popular semantic types found
in the queries and show them in Table 8, together with
a direct comparison to Cartright et al. [8]. The largest
difference between all four medical logs analysed in this
paper and the Cartright et al. results is in the symptom
category. For the latter, 63.8% of the sessions are fo-
cused on symptoms, while between 5.5% and 9.1% are
focused on symptoms in our analysis. The main rea-
son for Cartright’s result is linked to the way in which
they created their dataset: keeping only sessions that
had at least one query containing a term in a wordlist
extracted from a list of symptoms from the Merck
11 A complete list of all semantic types can be found online:
http://metamap.nlm.nih.gov/SemanticTypesAndGroups.shtml
16 Jo˜ao Palotti et al.
medical dictionary. Their preprocessing step therefore
explains the fact that most of the sessions were con-
centrated only on searching for symptoms. Conversely,
our analysis reveals that the most common user focus
is on causes rather than on symptoms. Also, the second
most common focus is on a way to cure a disease. It is
important to note that Cartright et al. logs date from
2009, it means they are 3 years younger than AOL, but
roughly 3 years older than HON, also suggesting that
large divergence found is due to the preprocessing steps
and not to an evolution on how the users search.
Once more, GoldMiner presents a different behaviour,
we hypothesise that the low number of sessions on reme-
dies is explained by the fact that radiologists are not in-
terested in remedies when searching for images as they
are rather in the diagnosis phase. It is interesting to
note that searching for causes and remedies in the same
session is a very frequent task for medical professionals
in the TRIP logs, with 16% of the sessions searching
for both remedies and causes.
In Table 9, we show the behaviour modifications
along a session. One oscillation is characterised by a
transition from one focus type to another and then back
to the original type. Originally, this study was made to
support the hypothetico-deductive searching process in
which a user cyclically searches for a symptom, then a
cause and then returns to symptom [8]. The symptom-
cause pattern was also found in our experiments, but
with a more balanced distribution in relation to the
other patterns. Again, the large number of behaviours
involving symptoms found in [8] is likely an artefact of
how the dataset was constructed. We see that the cause-
remedy pattern plays a very important role, especially
in the TRIP log, in which this is the most common pat-
tern. Finally, the least frequent pattern found in all four
datasets is the symptom-remedy one. The study of the
behaviour modification was used in [8] to build a clas-
sifier to predict what is the next user action, allowing
a search system to support medical searchers by pre-
fetching results of possible interest or suggesting useful
search strategies.
7 User Classification
We have seen in Sections 5 and 6 that experts and
laypeople use different search strategies. In this section,
we take advantage of these differences to build an auto-
matic classifier that can assist search systems, exploring
the user domain knowledge.
The expertise inference can be directly applied by
a search engine to tailor the results shown, e.g. boost-
ing easy to read documents for laypeople [49,40,11],
or search aids, such as query suggestions to match the
searcher expertise. Also, the search strategies employed
by experts could be used to support non-experts in
learning more about domain resources and vocabulary [52].
In order to take advantage of the user domain exper-
tise, it is necessary to be able to identify whether a user
is an expert or not. We employed a Random Forest clas-
sifier12 to solve this binary classification problem, since
it is a well-known machine learning technique and has
shown to be suitable for this task before [10, 38].
The classifier relies upon a set of features to take
its decision between the two modelled classes: expert
or layperson. We list 14 features proposed in this work
in Table 10, and group them into two sets: (1) user
behaviour features, and (2) medicine-related features.
The first set is made from the analysis of Sections 5.1
and 6.1, while the second one covers Sections 5.2 and
6.2.
To form our dataset, we merged the users from AOL-
Medical and HON logs into the laypeople class, and the
users from TRIP and GoldMiner logs into the expert
class. As noted in Section 5.1, the number of users from
TRIP logs is considerably larger than the other logs,
therefore we repeat the sampling made to generate Ta-
ble 4, and 94,812 users are created for the layperson
class and 90,090 for the expert one.
We performed a ten-fold cross-validation experiment
and present the results in Table 11. We employed as
the baseline a simple classifier that always outputs the
positive class, which could reach an F1of 67.8%. The
next two rows of Table 11 show the classification per-
formance when using only a single group of features.
Clearly for our experiments, the user behaviour features
were more important than the medical ones: while the
medical features marginally improved the F1score for
each class, the user behaviour features could reach an
improvement of 14% over the baseline for detecting ex-
perts. The last row of Table 11 shows the performance
of the classifier using all features. We highlight an im-
provement of more than 20% over the baseline for both
classes.
The Random Forest classifier also allows us to com-
pute the Gini importance score for each feature. This
value (from 0.0 to 1.0) is higher when the feature is
more important, indicating how often a particular fea-
ture was selected for a split in a random forest, and how
large its overall discriminative value was for the classi-
fication problem under study. We show in Figure 9 all
the features according to the Gini importance score.
12 The Random Forest classifier is based on the python ma-
chine learning module scikit-learn (http://scikit-learn.org/).
Hyper-parameters were optimised using a grid-search ap-
proach.
How Users Search and What They Search for in the Medical Domain 17
Fig. 8: The top most frequently used semantic types (frequency in percentage). Many of the most used types are aggregated
to study the user focus described in Table 8
Table 8: User focus when searching for medical content in a single session.
Laypeople Experts
Intent AOL-Medical HON TRIP GoldMiner AOL-NotMedical Cartright et al.[8]
None 34.0 40.4 16.8 21.2 82.9 3.9
Symptom 9.1 6.3 5.5 6.4 3.9 63.8
Cause 24.3 20.9 26.0 58.2 3.3 5.3
Remedy 14.7 16.2 17.4 3.3 7.5 1.1
Symptom and Cause 6.8 6.1 7.2 6.4 0.5 22.6
Symptom and Remedy 2.1 2.6 4.5 0.5 0.9 2.0
Cause and Remedy 7.1 5.0 15.9 3.0 0.8 0.4
All three 1.9 2.5 6.7 1.0 0.2 0.8
Befitting the results of Table 11, the most important
features were predominantly user behaviour features.
8 Discussion
We presented the analysis of four different query logs
divided into five datasets. We discuss each of the ini-
tial research questions in the next subsections, with the
coverage of the third research question (relation to pre-
viously published results), covered in each subsection.
8.1 MetaMap and Short Queries
This study relies on the accuracy of MetaMap to en-
able the intent of the searchers to be identified. As
MetaMap was designed for annotation of documents
and not queries, we did an evaluation of its perfor-
mance for short queries. Using an existing dataset of
10,000 manually annotated queries [35], we evaluated
MetaMap on two of the categories used in this paper:
cause and remedy. The category symptom was not eval-
uated as it is not included in the dataset used. It is
found that MetaMap can annotate the cause category
with an F1 of 78% and the Remedy category with an F1
of 72%. While these values are not directly comparable
to other results published, they correspond to the level
of accuracy measured for related tasks: MetaMap was
shown to map disease concepts in queries with an F1 of
70% [36], and a mapping into five classes in [39] on 1000
queries was done with an F1 of 70%. Most importantly,
inter-annotator agreement for the manual annotation
of the query corpus in [36] was 73%. This demonstrates
that the results obtained by annotating the queries by
MetaMap are at the same level as those obtained by
18 Jo˜ao Palotti et al.
Table 9: Cycle Sequence along a single session
Laypeople Experts
Pattern Interaction AOL-Medical HON TRIP GoldMiner Cartright et al.[8]
Sessions with oscillations (%) 23.07 13.48 64.61 8.60 16.2
Symptom-Cause SymptomCauseSymptom 19.2 15.6 13.2 22.7 51.4
CauseSymptomCause 19.9 18.8 14.5 35.3 38.4
Symptom-Remedy SymptomRemedySymptom 8.2 11.8 10.8 4.1 5.1
RemedySymptomRemedy 8.1 14.2 11.6 3.8 2.7
Cause-Remedy CauseRemedyCause 18.2 18.4 24.8 20.3 1.5
RemedyCauseRemedy 26.4 21.2 25.1 13.8 0.9
Table 10: Features used in the expertise classification task.
Two groups were created using the features discussed in the
previous sections of this work.
Group Feature Explanation
User
Behaviour
Features
AvgCharPerQuery Average number of characters and
terms used by the user in each queryAvgTermsPerQuery
AvgQueryPerSession The average number of
queries and time per session.AvgTimePerSession
AvgExpansions Compares the i-th query to (i-1)-th
query and counts the expansions,
reductions and reformulations made
AvgReductions
AvgReformulation
Medical
Related
Features
AvgSymptomsPerQuery The average number of
symptoms/causes/remedies/none
of them per query
AvgCausesPerQuery
AvgRemediesPerQuery
AvgOtherTypePerQuery
PercQueriesWithMeSH Percentage of queries that could
be mapped to any MeSH concept
AvgMeSHPerQuery Average number of MeSH concepts
identified in all queries
AvgMeSHDepth The average depth of
all identified concepts
Table 11: Classification results: compared to the baseline, the
Random Forest classifier using all the features can reach an
improvement of 26% when detecting experts.
Classifier Pos.
Class Acc. Prec. Rec. F1
Baseline
Positive Class
Layp. 51.3 51.3 100.0 67.8
Exp. 48.7 48.7 100.0 65.5
Random Forest
User Behaviour Feat.
Layp. 75.7 76.3 76.4 76.3
Exp. 75.1 75.0 75.0
Random Forest
Medical Features
Layp. 67.1 67.4 69.5 68.5
Exp. 66.8 64.6 65.7
Random Forest
All Features
Layp. 83.5 84.1 83.6 83.9
Exp. 82.8 83.4 83.1
manual annotation, implying that the MetaMap anno-
tations are sufficiently accurate for this study.
8.2 How is Search Conducted for Medical Content?
This section covers the behaviour of the users when
searching for medical information. Analyses were done
both at the level of individual queries and of sessions.
Fig. 9: Feature importance according to the Gini importance
score generated by the Random Forest classifier. The error
bars represent the standard deviation from the mean value
for each feature.
It was found that the mean terms per query and mean
chars per query were higher for experts than for lay-
people with a small effect (measured by Cohen’s d value).
This supports the small effect also detected for these
characteristics by White et al. [52]. Moving toward ses-
sions, we found also longer sessions in both terms of
mean queries per session and mean time spent per ses-
sion, with a small effect, as detected in White et al.
Although White et al. studied search logs from a gen-
eral purpose commercial search engine, for which as-
sumptions had to be made about the behaviour of users
in order to detect experts and laypeople, we were able
to find very similar effect size, including the same im-
portance ranking for the four characteristics measured.
These small effects were sufficient to successfully train a
classifier to predict expert and layperson classes in [52]
and also in this paper.
When analysing the user behaviour in terms of ses-
sions, we conclude that experts are more persistent than
laypeople, as more than 10% of the sessions in the pro-
How Users Search and What They Search for in the Medical Domain 19
fessional search engines were composed of all possible
query modification actions (expansion, reduction, refor-
mulation). This was also found in White’s work, where
they noted that sessions conducted by domain experts
were generally longer than non-expert sessions and that
domain experts consistently visited more pages in a
session. Alternatively, longer sessions could mean that
experts are struggling to find relevant information. It
supports the current efforts of the information retrieval
community to help experts finding scientific material to
improve their clinical decisions [43]. It would be inter-
esting to study if the increase of expertise of laypeople
can change their user behaviour over time as suggested
by Wildemuth [55], but this will likely require years of
search engine logs.
8.3 What are the Users Searching for?
The investigation of what the users search for led us to
conclusions that are significantly different from results
published in the literature. In both of our analyses, the
one based on the MeSH hierarchy and the one based on
semantic types, we observed that users are more con-
cerned with diseases rather than symptoms, converse to
what Cartright et al. [8] found. This difference is large:
Cartright et al. found that searches for symptoms oc-
cur in 63.8% of the sessions, while our results showed
that symptoms were only in 5.5% to 9.1% of sessions,
depending on the search engine. In our analysis, the
cause category appeared most often, in 20.9% to 58.2%
of the sessions, depending on the search engine. This
large difference is likely due to the fact that Cartright
et al. had to make assumptions about the character-
istics of a medical query in order to extract medical
queries from the logs of a general purpose search en-
gine, whereas we used search logs from domain-specific
medical search engines in three of the four cases. This
allowed us to make the very strong assumption that
users will always enter medical queries into these search
engines. Understanding what users are searching for is
an essential step towards providing more relevant search
results.
We also identified patterns supporting the hypothetico-
deductive searching processes, especially for the cause-
remedy component, in which both laypeople and ex-
perts cycle through searching for causes and remedies
in sessions so as to discover potential treatments for
a disease. Finally, we found that TRIP users, mainly
users falling into our expert class, use the hypothetico-
deductive method very often, in more than 60% of their
sessions, versus less than 25% for AOL and HON. This
supports the hypothesis that experts have much more
complex information needs, which are not well addressed
by the current search systems [43].
An interesting kind of search in the medical domain
is the one for self-diagnosis purposes [17], which of-
ten arises before consulting a medical professional (or
to help the decision to consult). Previous research has
shown that exposing people with no or scarce medical
knowledge to complex medical language may lead to er-
roneous self-diagnosis and self-treatment and that ac-
cess to medical information on the Web can lead to the
escalation of concerns about common symptoms (e.g.,
cyberchondria) [53]. Also current commercial search en-
gines are far from being effective in answering such
queries [60], presenting on average only 3 highly rel-
evant results in the top 10 results. In the same manner
that experts can assist non-experts in detecting credi-
ble content on the Web [44], a search system capable
of inferring user expertise can learn about the decisions
taken by experts to better support non-experts. In the
case of self-diagnosis, the symptom-cause cycle in the
expert search logs can be explored to provide query
suggestions for non-experts.
After consulting a medical professional, non-experts
often query about a disease or about a treatment that
was recommended to them [17]. When literally copying-
and-pasting the complex terms into a search box, they
are presented with documents that are potentially as
complex as their queries [20]. Inferring user medical
knowledge can help matching non-experts with the suit-
able documents for them even for complex queries, sig-
nificantly diminishing harmful situations and misunder-
standings.
8.4 What are the Most Useful Features to Infer User
Expertise?
We grouped the features collected in this work into
two distinct sets: user behaviour features and medicine-
related features. Judging by the experimental results,
the user behaviour features are indispensable; and, while
the medicine-related features alone were not very effec-
tive, they showed to provide large gains in all metrics,
when combined with the user behaviour ones.
When analysing the features through the Gini im-
portance coefficient, the average MeSH depth was con-
sidered the best medical feature by the classifier, a fea-
ture that was also highly ranked in [38]. For the user
behaviour features, the main four metrics analysed here
were also important in [52], while features based on
query modification in a session seemed not to be well
used by the classifier.
20 Jo˜ao Palotti et al.
9 Conclusion
In this paper, we conducted a detailed study of med-
ical information search behaviour through query logs.
We studied how users search for medical documents, as
well as what they search for. Results were compared
to those in published studies analysing search logs in
the medical domain. Almost all recent studies about
the behaviour of searchers looking for medical infor-
mation have been based on the search logs of a large
commercial general purpose search engine. This paper
performs the important task of reproducing these stud-
ies as far as possible on search logs from other search
engines to find out to what extent these results can
be supported or not. An important difference with this
study compared to published studies is the use, in three
of the four cases, of domain-specific medical search en-
gines targeted at either experts or laypeople, meaning
that we have very strong priors about who is using the
search engines and what they are searching for. This
avoids assumptions that have to be made in order to
extract medical queries or extract expert or laypeople
queries from the search log of a general purpose search
engine.
Our results support those published in the literature
for the following outcomes: (1) It is possible to distin-
guish between medical experts and laypeople based on
search behaviour characteristics; (2) experts issue more
queries and modify their queries more often, meaning
that they can be either more persistent than laypeo-
ple or that their information need is more complex and
more difficult to reach.
A large difference with respect to what is published
in the literature was found for what the users are search-
ing for. Our analysis showed that diseases were the fo-
cus of the largest number of sessions (20.9%–58.2%),
as opposed to symptoms (63.8% in [8]). We suggest
that this difference is mainly due to the criteria used
to extract medical queries from the search logs of a
general purpose search engine, which skewed the re-
sults toward symptoms. This result suggests that the
occurrence of Cyberchondria [53] is less prevalent, espe-
cially on domain-specific medical search engines. A fur-
ther result from this study that is potentially useful for
search systems is the study of features for distinguish-
ing experts and laypeople, showing that although the
behavioural features were the most discriminative ones,
the combination of behavioural features with medicine-
related features reached the best results.
One of the limitations of this study is the lack of
clickthrough information, which would have allowed us
to perform a more detailed analysis of search behaviour.
A further limitation is that MetaMap can only annotate
English text. Laypeople in particular prefer to query
in their own language, as is clear from the high num-
ber of non-English queries that were removed from the
HON search logs for this study. There is certainly a
vast amount of work to be done for supporting such
a query analysis for languages other than English, in
particular due to the lack of such detailed language re-
sources for many languages. MeSH on the other hand
exists for many languages and mapping tools do exist.
Still, detecting language of very short queries is not easy
to do, so a multilingual scenario has many additional
challenges.
The results of our analysis can be used to better un-
derstand the users through building detailed user pro-
files based on user behaviour in order to provide users
with documents and query suggestions suited to their
level of expertise. We can also identify new features for
improving a search engine, such as the suggestions aris-
ing from this analysis to add a filter or facets for body
regions to the GoldMiner search engine.
By using logfiles of several domain-specific medical
search engines, this paper explores complementary in-
formation to most analyses of medical log files that ei-
ther use general search engine logs or PubMed logfiles.
This allows us to obtain information on user groups
in a different way compared to general search engines
where assumptions have to be made that can influence
the analysis of the group behaviour.
Acknowledgements This research was partly funded by the
European Union Seventh Framework Programme (FP7/2007-
2013) under grant agreement no257528 (KHRESMOI), partly
funded by Horizon 2020 program (H2020-ICT-2014-1) un-
der grant agreement no644753 (KCONNECT), and partly
funded by the Austrian Science Fund (FWF) project noI1094-
N23 (MUCKE)
References
1. A. R. Aronson. Effective mapping of biomedical text to
the UMLS Metathesaurus: the MetaMap program. pages
17–21, 2001.
2. A. R. Aronson, O. Bodenreider, H. F. Chang, S. M.
Humphrey, J. G. Mork, S. J. Nelson, T. C. Rindflesch,
and W. J. Wilbur. The NLM Indexing Initiative. pages
17–21, Lister Hill National Center for Biomedical Com-
munications (LHNCBC), National Library of Medicine,
Bethesda, MD 20894, USA., 2000.
3. A. R. Aronson and F. Lang. An overview of metamap:
historical perspective and recent advances. JAMIA,
17(3):229–236, 2010.
4. A. R. Aronson and T. C. Rindflesch. Query expansion
using the UMLS Metathesaurus. Proceedings of the AMIA
Annual Symposium, pages 485–489, 1997.
5. S. K. Bhavnani. Domain-specific search strategies for
the effective retrieval of healthcare and shopping infor-
mation. In CHI ’02 Extended Abstracts on Human Factors
How Users Search and What They Search for in the Medical Domain 21
in Computing Systems, CHI EA ’02, pages 610–611. ACM,
2002.
6. C. Boyer, V. Baujard, and A. Geissbuhler. Evolution of
Health Web certification through the HONcode experi-
ence. Stud Health Tech Inform, 169:53–7, 2011.
7. D. J. Brenes and D. Gayo-Avello. Stratified analysis of
AOL query log. Inf. Sci., 179(12):1844–1858, 2009.
8. M.-A. Cartright, R. W. White, and E. Horvitz. Intentions
and attention in exploratory health search. In Proceed-
ings of the 34th international ACM SIGIR conference on
Research and development in Information Retrieval, SIGIR
’11, pages 65–74, New York, NY, USA, 2011. ACM.
9. J. Cohen. Statistical Power Analysis for the Behavioral
Sciences (2nd Edition). Routledge, 2 edition, July 1988.
10. M. J. Cole, J. Gwizdka, C. Liu, N. J. Belkin, and
X. Zhang. Inferring user knowledge level from eye move-
ment patterns. Information Processing & Management,
49(5):1075 – 1091, 2013.
11. K. Collins-Thompson, P. N. Bennett, R. W. White,
S. de la Chica, and D. Sontag. Personalizing web search
results by reading level. In Proceedings of the 20th ACM
International Conference on Information and Knowledge
Management, CIKM ’11, pages 403–412, New York, NY,
USA, 2011. ACM.
12. D. Demner-Fushman, S. M. Humphrey, N. C. Ide, R. F.
Loane, J. G. Mork, P. Ruch, M. E. Ruiz, L. H. Smith,
W. J. Wilbur, and A. R. Aronson. Combining resources
to find answers to biomedical questions. In Proceedings
of The Sixteenth Text REtrieval Conference, TREC 2007,
Gaithersburg, Maryland, USA, November 5-9, 2007, 2007.
13. J. C. Denny, J. D. Smithers, R. A. Miller, and
A. Spickard. ”Understanding” medical school curriculum
content using KnowledgeMap. Journal of the American
Medical Informatics Association, 10(4):351–362, 2003.
14. S. Duarte Torres, D. Hiemstra, and P. Serdyukov. Query
log analysis in the context of information retrieval for
children. In Proceeding of the 33rd International ACM
SIGIR Conference on Research and Development in Infor-
mation Retrieval, pages 847–848, New York, July 2010.
ACM.
15. G. B. Duggan and S. J. Payne. Knowledge in the head
and on the web: using topic expertise to aid search. In
Proceedings of the SIGCHI Conference on Human Factors
in Computing Systems, CHI ’08, pages 39–48, 2008.
16. Eurobarometer. European citizens’ digital health liter-
acy. Technical report, European Commision, November
2014.
17. S. Fox. Health topics. Technical report, The Pew Internet
& American Life Project, February 2011.
18. S. Fox and M. Duggan. Health online 2013. Technical
report, The Pew Internet & American Life Project, Jan-
uary 2013.
19. D. Gayo-Avello. A survey on session detection methods
in query logs and a proposal for future evaluation. Inf.
Sci., 179(12):1822–1843, May 2009.
20. L. Goeuriot, L. Kelly, W. Li, J. Palotti, P. Pecina,
G. Zuccon, A. Hanbury, G. J. F. Jones, and H. M¨uller.
ShARe/CLEF eHealth Evaluation Lab 2014, Task 3:
User-centred Health Information Retrieval. In Working
Notes for CLEF 2014 Conference, Sheffield, UK, September
15-18, 2014., pages 43–61, 2014.
21. D. He and A. G¨oker. Detecting session boundaries from
web user logs. In Proceedings of the BCS-IRSG 22nd annual
colloquium on information retrieval research, pages 57–66,
2000.
22. J. Herskovic, L. Tanaka, W. Hersh, and E. Bernstam. A
Day in the Life of PubMed: Analysis of a Typical Day’s
Query Log. Journal of the American Medical Informatics
Association, 14(2):212–220, 2007.
23. V. Hollink, T. Tsikrika, and A. P. de Vries. Semantic
search log analysis: A method and a study on profes-
sional image search. Journal of the American Society for
Information Science and Technology, 62(4):691–713, 2011.
24. I. Hsieh-Yee. Effects of search experience and subject
knowledge on the search tactics of novice and experienced
searchers. Journal of the Association for Information Sci-
ence and Technology, 1993.
25. R. Islamaj Dogan, G. C. Murray, A. N´ev´eol, and Z. Lu.
Understanding PubMed user search behavior through
log analysis. Database, 2009, Jan. 2009.
26. A. S. Jadhav, A. P. Sheth, and J. Pathak. Online infor-
mation searching for cardiovascular diseases: An analysis
of mayo clinic search query logs. Studies in Health Tech-
nology and Informatics, pages 702–706, 2014.
27. B. Jansen, A. Spink, and I. Taksai. Handbook of research
on Web log analysis. Information Science Reference - IGI
Global Publishing, Hershey, PA, 2008.
28. B. J. Jansen and A. Spink. How are we searching the
world wide web?: a comparison of nine search engine
transaction logs. Information Processing & Management,
42(1):248–263, Jan. 2006.
29. B. J. Jansen, A. Spink, J. Bateman, and T. Saracevic.
Real life information retrieval: a study of user queries on
the web. SIGIR Forum, 32(1):5–17, Apr. 1998.
30. R. Jones and K. L. Klinkner. Beyond the session timeout:
Automatic hierarchical segmentation of search topics in
query logs. In Proceedings of the 17th ACM Conference
on Information and Knowledge Management, CIKM ’08,
pages 699–708, New York, NY, USA, 2008. ACM.
31. M. Kritz, M. Gschwandtner, V. Stefanov, A. Hanbury,
and M. Samwald. Utilization and perceived problems of
online medical resources and search tools among different
groups of european physicians. Journal of Medical Internet
Research, Jun 2013.
32. E.-M. Lacroix and R. Mehnert. The US National Library
of Medicine in the 21st century: expanding collections,
nontraditional formats, new audiences. Health Informa-
tion and Libraries Journal, 19(3):126–132, 2002.
33. M. Lui and T. Baldwin. Langid.py: An off-the-shelf lan-
guage identification tool. In Proceedings of the ACL 2012
System Demonstrations, ACL ’12, pages 25–30, Strouds-
burg, PA, USA, 2012. Association for Computational Lin-
guistics.
34. E. Meats, J. Brassey, C. Heneghan, and P. Glasziou. Us-
ing the Turning Research Into Practice (TRIP) database:
how do clinicians really search? Journal of the Medical Li-
brary Association, 95(2):156–63, 2007.
35. A. N´ev´eol, R. I. Dogan, and Z. Lu. Semi-automatic se-
mantic annotation of pubmed queries: A study on quality,
efficiency, satisfaction. Journal of Biomedical Informatics,
44(2):310–318, 2011.
36. A. N´ev´eol, W. Kim, W. J. Wilbur, and Z. Lu. Exploring
two biomedical text genres for disease recognition. In Pro-
ceedings of the Workshop on Current Trends in Biomedical
Natural Language Processing, BioNLP ’09, pages 144–152,
Stroudsburg, PA, USA, 2009. Association for Computa-
tional Linguistics.
37. NLM. UMLS Reference Manual. Bethesda (MD): National
Library of Medicine (US), Sept. 2009.
38. J. Palotti, A. Hanbury, and H. Muller. Exploiting health
related features to infer user expertise in the medical do-
main. In Proceedings of WSCD Workshop on Web Search
and Data Mining. John Wiley & Sons, Inc., 2014.
22 Jo˜ao Palotti et al.
39. J. Palotti, V. Stefanov, and A. Hanbury. User intent
behind medical queries: An evaluation of entity mapping
approaches with metamap and freebase. In Proceedings
of the 5th Information Interaction in Context Symposium,
IIiX ’14, pages 283–286. ACM, 2014.
40. J. Palotti, G. Zuccon, L. Goeuriot, L. Kelly, A. Hanbury,
G. J. F. Jones, M. Lupu, and P. Pecina. ShARe/CLEF
eHealth Evaluation Lab 2015, Task 2: User-centred
Health Information Retrieval. In Working Notes for CLEF
2015 Conference, Toulouse, France, September 8-11, 2015.,
2015.
41. G. Pass, A. Chowdhury, and C. Torgeson. A picture of
search. In Proceedings of the 1st international conference
on Scalable information systems, InfoScale ’06, New York,
NY, USA, 2006. ACM.
42. W. Pratt and M. Yetisgen-Yildiz. A study of biomedical
concept identification: Metamap vs. people. In AMIA
Annual Symposium Proceedings, volume 2003, pages 529–
533. American Medical Informatics Association, 2003.
43. K. Roberts, M. Simpson, D. Demner-Fushman,
E. Voorhees, and W. Hersh. State-of-the-art in
biomedical literature retrieval for clinical cases: A survey
of the TREC 2014 CDS Track.
44. J. Schwarz and M. Morris. Augmenting web pages and
search results to support credibility assessment. In Pro-
ceedings of the SIGCHI Conference on Human Factors in
Computing Systems, CHI ’11, pages 1245–1254, New York,
NY, USA, 2011. ACM.
45. C. Silverstein, H. Marais, M. Henzinger, and M. Moricz.
Analysis of a very large web search engine query log.
SIGIR Forum, 33(1):6–12, Sept. 1999.
46. F. Silvestri. Mining query logs: Turning search usage data
into knowledge. Foundations and Trends in Information
Retrieval, 4(1:2):1–174, Jan. 2010.
47. A. Spink, Y. Yang, J. Jansen, P. Nykanen, D. P. Lorence,
S. Ozmutlu, and H. C. Ozmutlu. A study of medical and
health queries to web search engines. Health Information
& Libraries Journal, 21(1):44–51, Mar. 2004.
48. T. Tsikrika, H. M¨uller, and C. Kahn Jr. Log analysis
to understand medical professionals’ image searching be-
haviour. In Medical Informatics Europe, 2012.
49. T. M. Walsh and T. A. Volsko. Readability assessment of
internet-based consumer health information. Respiratory
care, 53(10):1310–1315, 2008.
50. L. Wang, J. Wang, M. Wang, Y. Li, Y. Liang, and D. Xu.
Using Internet Search Engines to Obtain Medical Infor-
mation: A Comparative Study. Journal of Medical Internet
Research, 14(3):e74, May 2012.
51. M. Weeber, H. Klein, A. R. Aronson, J. G. Mork,
L. T. W. de Jong-van den Berg, and R. Vos. Text-based
discovery in biomedicine: the architecture of the dad-
system. In Proceedings of the AMIA Symposium, pages
903–907, 2000.
52. R. W. White, S. T. Dumais, and J. Teevan. Charac-
terizing the influence of domain expertise on web search
behavior. In Proceedings of the Second ACM International
Conference on Web Search and Data Mining, WSDM ’09,
pages 132–141, New York, NY, USA, 2009. ACM.
53. R. W. White and E. Horvitz. Cyberchondria: Stud-
ies of the escalation of medical concerns in web search.
ACM Transactions on Information Systems, 27(4):23:1–
23:37, Nov. 2009.
54. R. W. White and E. Horvitz. Studies of the onset and
persistence of medical concerns in search logs. In Proceed-
ings of the 35th international ACM SIGIR conference on
Research and development in information retrieval, SIGIR
’12, pages 265–274, New York, NY, USA, 2012. ACM.
55. B. M. Wildemuth. The effects of domain knowledge on
search tactic formulation. Journal of the Association for
Information Science and Technology, 55(3):246–258, Feb.
2004.
56. X. Yan, R. Y. Lau, D. Song, X. Li, and J. Ma. Toward a
semantic granularity model for domain-specific informa-
tion retrieval. ACM Transactions on Information Systems,
29(3):15:1–15:46, July 2011.
57. P. Younger. Internet-based information-seeking be-
haviour amongst doctors and nurses: a short review of
the literature. Health information and libraries journal,
27(1):2–10, Mar. 2010.
58. X. Zhang, M. Cole, and N. Belkin. Predicting users’ do-
main knowledge from search behaviors. In Proceedings
of the 34th International ACM SIGIR Conference on Re-
search and Development in Information Retrieval, SIGIR
’11, pages 1225–1226. ACM, 2011.
59. Y. Zhang. Searching for specific health-related informa-
tion in medlineplus: Behavioral patterns and user expe-
rience. Journal of the Association for Information Science
and Technology, 65(1):53–68, 2014.
60. G. Zuccon, B. Koopman, and J. Palotti. Diagnose this if
you can: On the effectiveness of search engines in finding
medical self-diagnosis information. In Advances in Infor-
mation Retrieval, pages 562–567. Springer, 2015.
... Search Engines (SEs) continue to receive users information requests across a wide range of areas, usually spanning through medicine, education, news, entertainment, e-commerce, and also culture. Currently, over 75% web users employ SEs to locate their desired web documents [3] [4]. Various approaches towards improving retrieved documents usually focus on either query formulation with clarity models to rightly define the information need or based on ranking optimization using rank models to sort by order of relevancy. ...
... Rather than focusing on developing new techniques for the already established SEs, some call for the creation of local SEs but this does not resolve the problem but raises marketing and usage issues on the newly created SEs. As significant as it was, Soldaini and Yates query clarifier [4] was efficient in the retrieval of medical terms for searchers who are not familiar with some terms within the medical domain. ...
... The relevance RS, were computed from Equation 3 after which a threshold, φ = 0.5 was used to obtain a binary relevancy scores. These were substituted as rel d in Equation 4. The Discounted Cumulative Gain (DCG) [14] functions here as the relevance evaluation metric of a query q t , for precision@k which is expressed as: ...
Chapter
Information retrieval, also known as search, is the field concerned with the acquisition, organization, and searching of predominantly knowledge-based information. Although biomedical information retrieval has traditionally focused on the retrieval of text from the biomedical literature, the types of information searchable now include images, video, chemical structures, gene and protein sequences, and a wide range of other digital material relevant to biomedical education, research, and patient care. With the proliferation of IR systems and online content, the notion of the library has changed substantially, and new digital libraries have emerged.
Chapter
This chapter focuses on the evaluation of operational biomedical and health information retrieval (IR) systems. It follows a framework that reviews studies that look at use of systems, uses for the system, user satisfaction, how well the system was used, factors associated with successful use of the system, and impact that the system had. The chapter finishes with a discussion of relevance judgments and their role and limitations in IR evaluation research.
Thesis
Full-text available
Search engines are concerned with retrieving relevant information to support a user’s information seeking task. In the health domain, access to understandable information is crucial as it has the potential to impact on people’s health decisions. In this thesis, we study two aspects that should be taken into account by modern health search engines: the user health expertise in the health domain and the document understandability. This thesis begins by considering the role of user expertise in the health domain. We investigate user search behavior through logfiles of several domain-specific health search engines. While most of the recent studies on health search behavior have been based on the search logs of commercial general purpose search engines, we performed here the important task of reproducing these studies on search logs of health search engines, finding out to what extent these results can be supported or not. Our query-log analysis can be used to understand health searchers better and even to predict the user expertise based on user behavior and their interactions with the search engine. Our investigation of document understandability in the health domain arises from the increasing concern that health documents on the Web are not suitable for health consumers. For that, we study the impact that preprocessing pipelines have on readability formulas, which are commonly used to estimate the understandability of documents. We also examined domain-specific methods to estimate the understandability of documents and how machine learning approaches can be employed to predict document understandability. In particular, for the health domain, documents should be considered more relevant if, apart from being topically relevant, they are also understandable by the searcher. For that, we need evaluation frameworks that consider other relevance dimensions beyond topicality. In this work, we propose a framework that delays the combination of scores for the different relevance dimensions, which facilitates the work of information retrieval practitioners by increasing the interpretability of the results. With such a framework, we evaluated various strategies to integrate understandability estimation into search engines, finding that learning-to-rank is the most effective approach. This work contributes to improving search engines tailored to consumer health search because it thoroughly investigates promises and pitfalls of understandability estimations and their integration into retrieval methods. As shown by our experiments, these methods would undoubtedly improve current health-focused search engines.
Article
Ein Drittel aller Ärztinnen und Ärzte „verschreibt“ Apps, und etwa ein Viertel aller Frauen verwendet sie. Von weltweit über 200.000 solcher Apps sind etwas weniger als 5000 deutschsprachig. Für die allermeisten therapeutischen Apps ist keine Wirksamkeit belegt. Musik-Apps jedoch können erwiesenermaßen Angst und Schmerzen lindern. Auch zur Behandlung von Depression und Angstzuständen gibt es erste Apps mit belegter Wirksamkeit; weitere, sich an bestehenden, erfolgreichen Computeranwendungen orientierende sind zu erwarten. In den Bereichen Bewegung, Ernährung und Gewichtskontrolle wurden selten signifikante Effekte gemessen, wenn, dann meist bescheidene. Die am meisten erfolgversprechende Therapie kombiniert einen Podcast mit einer App, die zum Hören des Podcasts motiviert. Diagnostische Apps sind in ihrem therapeutischen Nutzen limitiert, da aus rechtlichen Gründen immer empfohlen wird, einen Arzt/eine Ärztin zu konsultieren, Fragen also per App nicht abschließend beantwortet werden. Als Informationsquelle können sie jedoch hilfreich sein. Die Genauigkeit von Symptom-Checker-Apps erhöht sich, wenn die Nutzerinnen nach Symptomen und nicht nach Diagnosen suchen.
Article
Full-text available
Responsible research & innovation, or RRI, is a discipline seeking that all stakeholders (researchers, citizens, policy makers, entrepreneurs, industry and the education sector) work together during the research and innovation process, anticipating and adapting to the needs of society. To that end, there are several techniques to help those scientists willing to make an impact onto society, to understand both peers and citizens interests, so that the resulting research can be more likely to be more cited by peers. In the present article it is proposed a technique to identify research topics and search terms adapted to the information needs of both citizens and peers, based on cybermetric indicators, in order to determine research emerging trends, prioritize scientific content, and establish ties with both society and other scientists.
Conference Paper
Full-text available
Increasingly, individuals are taking active participation in learning and managing their health by leveraging online resources. Understanding online health information searching behavior can help us to study what health topics users search for and how search queries are formulated. In this work, we analyzed 10 million cardiovascular diseases (CVD) related search queries from MayoClinic.com. We performed semantic analysis on the queries using UMLS MetaMap and analyzed structural and textual properties as well as linguistic characteristics of the queries.
Article
Full-text available
This work focuses on understanding the user intent in the medical domain. The combination of Semantic Web and information retrieval technologies promises a better comprehension of user intents. Mapping queries to entities using Freebase is not novel, but so far only one entity per query could be identified. We overcome this limitation using annotations provided by Metamap. Also, different approaches to map queries to Freebase are explored and evaluated. We propose an indirect evaluation of the mappings, through user intent defined by classes such as Symptoms, Diseases or Treatments. Our experiments show that by using the concepts annotated by Metamap it is possible to improve the accuracy and F1 performances of mappings from queries to Freebase entities.
Conference Paper
Full-text available
We present langid.py, an off-the-shelf language identification tool. We discuss the design and implementation of langid.py, and provide an empirical comparison on 5 long-document datasets, and 2 datasets from the microblog domain. We find that langid.py maintains consistently high accuracy across all domains, making it ideal for end-users that require language identification without wanting to invest in preparation of in-domain training data.
Article
This study investigated the effects of subject knowledge and search experience on novices' and experienced searchers' use of search tactics in online searches. Novice and experienced searchers searched a practice question and two test questions in the ERIC database on the DIALOG system and their use of search tactics were recorded by protocols, transaction logs, and observation. Search tactics were identified from the literature and verified in 10 pretests, and nine search tactics variables were operationalized to describe the differences between the two searcher groups. Data analyses showed that subject knowledge interacted with search experience, and both variables affected searchers' behavior in four ways: (1) when questions in their subject areas were searched, experience affected searchers' use of synonymous terms, monitoring of the search process, and combinations of search terms; (2) when questions outside their subject areas were searched, experience affected searchers' reliance on their own terminology, use of the thesaurus, offline term selection, use of synonymous terms, and combinations of search terms; (3) within the same experience group, subject knowledge had no effect on novice searchers; but (4) subject knowledge affected experienced searcher's reliance on their own language, use of the thesaurus, offline term selection, use of synonymous terms, monitoring of the search, and combinations of search terms. The results showed that search experience affected searchers' use of many search tactics, and suggested that subject knowledge became a factor only after searchers have had a certain amount of search experience. © 1993 John Wiley & Sons, Inc.
Article
Providing access to relevant biomedical literature in a clinical setting has the potential to bridge a critical gap in evidence-based medicine. Here, our goal is specifically to provide relevant articles to clinicians to improve their decision-making in diagnosing, treating, and testing patients. To this end, the TREC 2014 Clinical Decision Support Track evaluated a system’s ability to retrieve relevant articles in one of three categories (Diagnosis, Treatment, Test) using an idealized form of a patient medical record . Over 100 submissions from over 25 participants were evaluated on 30 topics, resulting in over 37k relevance judgments. In this article, we provide an overview of the task, a survey of the information retrieval methods employed by the participants, an analysis of the results, and a discussion on the future directions for this challenging yet important task.