Medical case retrieval from a committee of decision trees.
ABSTRACT A novel content-based information retrieval framework, designed to cover several medical applications, is presented in this paper. The presented framework allows the retrieval of possibly incomplete medical cases consisting of several images together with semantic information. It relies on a committee of decision trees, decision support tools well suited to process this type of information. In our proposed framework, images are characterized by their digital content. It was applied to two heterogeneous medical datasets for computer-aided diagnoses: a diabetic retinopathy follow-up dataset (DRD) and a mammography-screening dataset (DDSM). Measure of precision among the top five retrieved results of 0.788 + or - 0.137 and 0.869 + or - 0.161 was obtained on DRD and DDSM, respectively. On DRD, for instance, it increases by half the retrieval of single images.
Medical case retrieval from a committee of decision
Gw´ enol´ e Quellec, Mathieu Lamard, Lynda Bekri, Guy Cazuguel, Member, IEEE, Christian Roux, Fellow
member, IEEE, B´ eatrice Cochener
Abstract—A novel content-based information retrieval frame-
work, designed to cover several medical applications, is presented in
this paper. The presented framework allows the retrieval of possibly
incomplete medical cases consisting of several images together with
semantic information. It relies on a committee of decision trees,
decision support tools well suited to process this type of information.
In our proposed framework, images are characterized by their
digital content. It was applied to two heterogeneous medical datasets
for computer aided diagnosis: a diabetic retinopathy follow-up
dataset (DRD) and a mammography screening dataset (DDSM).
Measure of precision among the top five retrieved results of
0.788±0.137 and 0.869±0.161 was obtained on DRD and DDSM,
respectively. On DRD for instance, it increases by half the retrieval
of single images.
Index Terms—information retrieval, decision trees, CBIR, CAD,
real-life clinical cases, hence the growing interest in Case-Based
Reasoning (CBR)  for computer aided diagnosis systems .
CBR assumes that analogous problems have similar solutions:
interpreting a new situation involves retrieving similar cases in
a case database. Relevance is usually modeled via a similarity
measure between a query (a new medical case analyzed by
a medical expert) and each case in a reference database. The
retrieved cases are then used to help interpreting the new case
CBR was originally designed to process structured cases such
as regular feature vectors. However, information required by
physicians to diagnose some pathologies are more complex.
To diagnose Diabetic Retinopathy (DR) for instance, physicians
analyze series of images together with — usually structured —
contextual information, such as the patient age, sex and medical
history , . Consequently, medical CBR systems should
be able to manage both symbolic information such as clinical
annotations, and numerical information such as images. Some
existing systems were designed to manage symbolic information
. Some others, relying on Content-Based Image Retrieval
(CBIR) , , were designed to manage digital images , ,
. However, there were only few attempts to merge these two
EDICAL experts base their diagnoses on a mixture of
textbook knowledge and experience acquired through
Gw´ enol´ e Quellec, Guy Cazuguel and Christian Roux are with INSTITUT
TELECOM/TELECOM Bretagne, Dpt ITI, Brest, F-29200 France.
Mathieu Lamard and B´ eatrice Cochener are with University of Bretagne
Occidentale, Brest, F-29200 France.
Gw´ enol´ e Quellec, Mathieu Lamard, Lynda Bekri, Guy Cazuguel, Christian
Roux and B´ eatrice Cochener are with Inserm, U650, Brest, F-29200 France.
Lynda Bekri and B´ eatrice Cochener are with CHU Brest, Service
d’Ophtalmologie, Brest, F-29200 France.
approaches. One existing system linearly combines a text based
and an image based similarity measure into a common similarity
measure ; however, this approach does not apply to structured
textual information. Another system lets the user restrict a CBIR
search to images acquired from the same localization and/or with
the same device . More generally, another system lets the user
restrict a CBIR search to images whose contextual information
match an SQL query specified by the user ; however, he/she
is assumed to know which queries are relevant: it is likely not
the case if such a system is needed for diagnosis aid. As a
consequence, we believe heterogeneous information retrieval —
i.e. information retrieval based on both clinical descriptors and
digital image features — is still an open issue. A novel CBR
approach that fuses these two types of information is presented
in this paper.
In the proposed framework, heterogeneous attributes (digital
images, nominal and continuous variables) have to be aggregated
and the value of some of these attributes is possibly unknown.
To solve this generalized CBR problem, the use of decision trees
(DTs) is proposed , . A novel indexing scheme based on
DTs is introduced; for improved retrieval efficiency, several DTs
can be used. To that purpose, a randomized decision tree learning
algorithm is applied so that several DTs can be generated. Finally,
a boosting strategy is proposed to handle unbalanced classes .
The proposed framework has another advantage: the time re-
quired for a user (e.g. a medical expert) to query the reference
database can be reduced. A procedure is proposed to update the
retrieval list as new attributes are inputted by the user. As a
consequence, the user can decide to stop inputting attributes if
he/she is satisfied with the results. Moreover, each time he/she
inputs an attribute, a second procedure identifies the remaining
attributes likely to be the most discriminant; in other words, a
fast path towards satisfactory results is suggested.
The paper is organized as follows. Section II presents deci-
sion trees and their advantages for heterogeneous information
retrieval. Section III explains how images can be included
in a decision tree. The proposed decision tree based retrieval
framework is presented in section IV. Section V describes the
medical datasets used for evaluation. Results are given in section
VI and we end with a discussion and conclusions in section VII.
II. DECISION TREES
A decision trees (DT) ,  is a decision support tool
relying on a set of rules dividing a population of cases into
homogenous groups. Each rule associates a conjunction of tests
on some attributes with a group (for instance “if sex=male and
hal-00515356, version 1 - 6 Sep 2010
Author manuscript, published in "IEEE Transactions on Information Technology in Biomedicine 14, 5 (2010) 1227-1235"
age<40 then the case belongs to group 3”). In our case, attributes
are either images or contextual information. These rules are
organized as a tree; the structure of this tree can be interpreted
as follows (see Fig. 1):
• each non-terminal node represents a test on a single attribute
(e.g. what is the patient sex ?)
• each edge represents a test outcome (e.g. male)
• each leaf represents a cluster of cases that provided a similar
answer to each test (e.g. males younger than 40)
from one modality (late angiography - see section V-A): in this example, these
images are clustered into 2 groups.
Toy example of decision tree. Late angiographs are images obtained
DTs were first designed to segment a population of nominal
attribute vectors (each test outcome corresponds to an attribute
value or group of values). Quinlan  extended them to con-
tinuous attributes (training cases are grouped by attribute value
ranges). More generally, DTs can process any attribute, so long
as we can provide a way to cluster cases with respect to that
attribute. Since each test is performed on a single attribute, DTs
are well suited to process heterogeneous cases.
DTs are generally used as classifiers: an unlabeled case is first
associated to a group, and then it is assigned to the most frequent
class in that group. In the presented application, we won’t use
DTs as classifiers; we will use them to define a similarity measure
between two cases.
To build a DT in an automatic fashion, we search for the most
discriminant attributes and divide the population into homoge-
nous groups according to the value of those attributes (see Fig.
2). This process is supervised and then requires classified cases.
In the medical datasets we considered, the disease severity level
was used as a class label. Before learning the tree, the dataset
has to be divided into three subsets:
• a learning set L, actually used to learn the DT (at the end
of the learning process, each case in this set is assigned to
the leaves of the tree),
• a validation set V , used to assess the performance of the
DT with different parameter settings,
• a test set T, used to assess the final performance of the DT,
using the optimal parameter setting
Note that cases assigned to V and T are not used to learn the
DT, and T is not used to tune the system at all.
At the beginning of the learning process, the tree consists of
a single node containing the whole learning set L. Then each
leaf l of the growing tree is recursively divided. In that purpose,
the most discriminant attribute f among the population P ⊂ L
assigned to leaf l is searched for. P is then divided into new
child nodes, one for each possible answer to the test on f. In the
cases is divided into subgroups, according to the value of the most discriminant
attribute within that group.
Illustration of the learning process. At each step, a group of medical
proposed method, the discriminant power of a test is measured
by the Shannon entropy gain G obtained when dividing a node v0
into its child nodes vn, n=1..N (c4.5 algorithm , see equation
where pcnis the percentage of cases assigned to class c in node
vn, c = 1..C, I0is the entropy in node v0(before it is split) and
Inis the entropy in the nthchild node vn, n = 1..N. Entropy
measures the homogeneity of each node with respect to class
label. If no test can improve the entropy enough or if population
P is too small, l is not divided.
The learning algorithm can manage missing information: we
describe herein the mechanism provided by c4.5 algorithm .
Suppose that the value of an attribute f, tested at a node v0, is
missing for a case. Then this case is assigned to each child vn
of v0with a weight w(e0n), 0 ≤ w(e0n) ≤ 1, where e0ndenotes
the edge from v0to vn. w(e0n) is the percentage of cases in v0,
whose value for f is known, assigned to vn(see Fig. 3). In other
words, w(e0n) approximates the prior probability for a case in v0
to belong to vn. Consequently, at the end of the learning process,
each learning case ciis assigned to each leaf lj, j = 1..M, with
a weight wijsuch that?M
c=1pcnlogpcn, n = 0..N
j=1wij= 1 (wij=0 or 1 if each tested
attribute is known for ci, 0 < wij< 1 otherwise).
III. INCLUDING IMAGES IN A DECISION TREE
To include images in a DT, the principle of Content-Based
Image Retrieval (CBIR) is applied . CBIR involves 1) building
a feature vector characterizing each image — this feature vector
is referred to as signature —, and 2) defining a distance measure
between two signatures. In the proposed framework, images are
hal-00515356, version 1 - 6 Sep 2010
some medical case (?), this case is assigned to all subgroups with a weight equal
to the prior probability to be assigned to that subgroup.
Managing missing information. If the tested attribute is unavailable for
characterized by their wavelet transform . Then, measuring
the distance between two images comes down to measuring the
distance between their signatures. Similarly, when building a DT,
we use the distance measure between image signatures to divide
a population of images into subgroups. An unsupervised classi-
fication algorithm is used to cluster similar image signatures, as
described in section III-C. By this process, image signatures can
be included in a DT like any other attribute.
A. Image signature
In previous studies on CBIR, we decided to extract signatures
for images from their wavelet transform . Using the wavelet
transform for database management is convenient: images can
be compressed in JPEG-2000 format , which relies on
the wavelet transform, and their signature can be extracted
directly in the compressed domain. Moreover, wavelet-based
image signatures have shown their superiority over other image
signatures . The proposed signatures model the distribution
of the wavelet coefficients in each subband of the wavelet
decomposition; as a consequence, a multiscale description of
images is obtained. To characterize the distribution of wavelet
coefficients in a given subband, Wouwer’s work was applied
: Wouwer showed that this distribution can be modeled by a
generalized Gaussian function (see equation (2)).
e−ttz−1dt,z > 0
The maximum likelihood estimators (ˆ α,ˆβ) of the wavelet co-
efficient distribution in each subband are used as a signature
. These estimators can be computed directly from JPEG-2000
compressed images, which can be useful when a large number of
images have to be processed. Any wavelet basis can be used to
decompose images. However, the effectiveness of the extracted
signatures largely depends on the choice of this basis. For this
reason, we proposed to search for an optimal wavelet basis within
the lifting scheme framework , which is at the core of the
JPEG-2000 compression standard.
B. Distance Measure
Do and Vetterli proposed the use of the Kullback-Leibler (see
equation (4)) divergence between wavelet coefficient distributions
in each subband to define a distance measure between signatures
Kullback-Leibler divergence is not symmetric, which is a re-
quirement of clustering algorithms. A symmetric version of the
divergence, Ds, is used instead (see equation (5)).
By injecting equation (2) in (5), we obtain the expression of the
distance measure between two wavelet coefficient distributions
(see equation (6), the expression in the asymmetrical case is given
Finally, the distance between two images is a weighted sum of
these symmetric divergences over the subbands . The ability
to select a weight vector and a wavelet basis makes this image
representation suitable for specialized medical datasets.
C. Signature Clustering
Thanks to the image signatures and the associated distance
measure above, a population of cases can be divided into sub-
groups using an unsupervised classification algorithm, provided
that a custom distance measure can be specified. Because it is
simple and fast, the Fuzzy C-Means algorithm (FCM)  was
used for this purpose; the Euclidian distance was replaced in
FCM by the proposed distance between signatures. Finding the
right number of clusters is generally a difficult problem. However,
when the data is labeled, mutual information between cluster
and class labels can be used to determine the optimal number of
clustersˆK  (see equation (7)).
ˆK = argmax
where c = 1..C are the class labels, p(c,k) is the joint probability
distribution function of the class and cluster labels, p(c) and p(k)
are the marginal probability distribution functions.
IV. DECISION TREE BASED RETRIEVAL
Let cqbe a case placed as a query by a user. The objective of
the proposed framework is to retrieve the R most similar cases
in a reference database. For diabetic retinopathy follow-up, the
number of cases retrieved by the system is set to R = 5, at
ophthalmologist’s request; they consider this number sufficient
for time reasons and in view of the results provided by the
system. Consequently, the satisfaction of the user’s needs is
assessed by the precision at R, denoted πR, defined as the
percentage of cases relevant for cqamong the topmost R results.
B. Single Tree Based Indexing
To find the R most similar cases, we need to compute a
similarity measure between cqand each case ciin the reference
database. To that purpose, we propose to compare their assign-
ment weights to each leaf lj: wqj and wij, j = 1..M. These
weights have been computed for each learning case (subset L of
hal-00515356, version 1 - 6 Sep 2010
attribute is available, only the leaves for Groups 1 and 3 will be browsed.
Illustration of the retrieval process. In (b), where attribute ‘late angio.’ is missing, the leaves for Groups 1 to 3 will be browsed, whereas for (c), where this
the reference database) while building the tree (see section II-B).
They can be computed a posteriori for each remaining case in the
reference database — in particular those added after the learning
phase — and for the query cq. In that purpose, the weight w(e) of
each edge e in the tree is stored (see section II-B). The retrieval
system is illustrated on an example in Fig. 4.
A similarity measure Sab between two cases ca and cb is
defined in equation (8); Sab relies on the assignment weight
(waj)j=1..M(resp. (wbj)j=1..M) of ca(resp. cb) to each to each
leaf lj, j = 1..M.
This similarity measure, the scalar product of (waj)j=1..M and
(wbj)j=1..M, maps to [0;1]. It is maximal when a and b are
completely assigned to the same leaf. It is minimal when there
is no leaf to which both cases are at least partially assigned.
The similarity measure between the query cq and each case ci
in the reference database can be computed very quickly. It does
not require browsing the entire reference database:
• For each leaf lj in the DT, a list Lj containing every case
ci such that wij ?= 0 is built during the learning process.
These lists are updated each time a new case is included in
the reference database.
• At the beginning of the retrieval process, each similarity
measure Sqiis set to 0.
• For each leaf ljsuch that wqj?= 0, Ljis browsed: for each
case ci∈ Lj, Sqiis increased by wqjwij.
C. Multiple Tree Based Indexing
Because of the hierarchical architecture of the system above,
some attributes might be given too much weight. In the example
of Fig. 4 for instance, a male and a female both aged 30 would be
regarded as completely dissimilar, because of their different sex,
whereas age might play a significant role. To solve this problem,
we propose a retrieval system relying not only on one DT but
rather on several (say τ) DTs. Retrieving similar cases from a
single DT or from several DTs can be done similarly: instead
of computing the scalar product between the assignment weights
to the leaves of one DT, we simply compute the scalar product
between the assignment weights to the leaves of each of these τ
DTs. The expression of the new similarity measure S′
in equation 9.
where watj is the assignment weight of case cato the jthleaf
of the tthtree and Mtis the number of leaves in the tthtree.
Several methods have been proposed in the literature to generate
sets, or committees, of DTs: Random Forests  or randomized
c4.5  for instance. They usually perform better as classifiers
than single DTs. To generate DT committees, the learning
algorithm is randomized as follows: to decide which test should
be selected for dividing a tree node, the k most discriminant
attributes, according to the entropy measure (see equation (1)),
are identified one of them is picked randomly with uniform
D. Retrieval System Boosting
When applied to unbalanced datasets, DTs tend to be biased
towards the largest classes . If DTs are used as classifiers,
this problem can be alleviated thanks to boosting . Boosting
algorithms typically build a DT committee in iterations, by
incrementally adding weak classifiers (i.e. with a predictive
accuracy at least better than chance) to a final strong classifier. At
each iteration k, a weak classifier hkis learnt from the learning
set with respect to a distribution (learning cases are assigned
more or less weight); the weight distribution is initially uniform.
The weak classifier is then added to the final strong classifier
and the learning cases are reweighted: misclassified cases gain
weight and correctly classified cases lose weight. We followed the
example of Adaboost , the most popular boosting algorithm,
to define a boosting strategy for our retrieval system. In our
application, hkdenotes a set of DTs used as a “weak retriever”
(see section IV-C). At each iteration k, the weight dk(ci) of case
ciis updated as follows:
1) the weighted average retrieval error ǫkof hkis computed:
ǫk= 1 −?
3) a variable γk(ci) indicating whether dk(ci) should be
increased (γk(ci) > 0) or decreased (γk(ci) < 0) is
computed: γk(ci) = 1 − 2πk
4) dk(ci) is updated according to αkand γk(ci): dk+1(ci) ∝
The final “strong retriever” H is thus a set of DT sets hkweighted
by αk. Equivalently, H is a DT set in which each tree t in hk
is assigned a weight αt = αk in H. Consequently, the final
similarity measure becomes S′′
2) the weight αkof hkis updated: αk=1
ab, given in equation (10).
hal-00515356, version 1 - 6 Sep 2010
STRUCTURED CONTEXTUAL INFORMATION FOR DIABETIC RETINOPATHY PATIENTS
general clinical context
family clinical context
medical clinical context
surgical clinical context
ophthalmologic clinical context
diabetes, glaucoma, blindness, misc.
arterial hypertension, dyslipidemia, protenuria, renal dialysis, allergy, misc.
cardiovascular, pancreas transplant, renal transplant, misc.
cataract, myopia, AMD, glaucoma, unclear medium, cataract surgery, glaucoma surgery, misc.
none, type I, type II
< 1 year, 1 to 5 years, 5 to 10 years, > 10 years
good, bad, fast modifications, glycosylated hemoglobin
insulin injection, insulin pump, anti-diabetic drug + insulin, anti-diabetic drug, pancreas transplant
none, systematic ophthalmologic screening - known diabetes, recently
diagnosed diabetes by check-up, diabetic diseases other than ophthalmic ones
none, infection, unilateral decreased visual acuity (DVA), bilateral DVA,
Neovascular glaucoma, intra-retinal hemorrhage, retinal detachment, misc.
focal edema, diffuse edema, none, ischemic
and diabetes context
eye symptoms reported
before the angiography test
V. APPLICATION TO TWO MEDICAL DATASETS
The proposed framework has been applied to two heteroge-
neous medical datasets. The first dataset (DRD) is being built
at the LaTIM laboratory (Inserm U650), in collaboration with
ophthalmologists from Brest University Hospital. The second one
(DDSM) is a well-known public access dataset .
A. Diabetic Retinopathy Dataset (DRD)
Diabetic retinopathy is damage to the retina caused by com-
plications of diabetes, which can eventually lead to blindness.
The diabetic retinopathy dataset contains retinal images of dia-
betic patients, with associated anonymized information on the
pathology. The dataset consists of 86 patient files containing
1399 photographs altogether. Patients have been recruited at
Brest University Hospital since June 2003 and images were
acquired by experts using a Topcon Retinal Digital Camera
(TRC-50IA) connected to a computer. Images have a definition
of 1280 pixels/line for 1008 lines/image. They are lossless
compressed images. An image series is given in Fig. 5. The
contextual information available is the patients’ age and sex and
structured medical information (see table I). If patients records
were comprehensive, they would consist of 10 images per eye
(see Fig. 5) and of 13 contextual attributes. However, in our
dataset, 11.9% of images and 39.7% of contextual attribute values
are missing. The disease severity level, according to ICDRS
classification , was assessed by one three-year experienced
expert for each patient. The distribution of the disease severity
among the above-mentioned 86 patients is given in table II.
(f) (g) (h) (i) (j)
Fig. 5. Photograph sequence of a patient’s eye. Photographs (a), (b) and (c) were
obtained with different color filters. Photographs (d) to (j) constitute a temporal
angiographic series: a contrast product is injected and photographs are taken at
different stages (early (d), intermediate (e), (g)-(j) and late (f)). (g)-(j) are images
from the periphery of the retina.
PATIENT DISEASE SEVERITY DISTRIBUTION
dataset disease severitynumber of
no apparent diabetic retinopathy
treated/non active diabetic retinopathy
B. Digital Database for Screening Mammography (DDSM)
The DDSM project , involving the Massachusetts General
Hospital, the University of South Florida and the Sandia Na-
tional laboratories, has built a mammographic image database
for research on breast cancer screening. It consists of 2277
patient files. Each one includes two images of each breast, along
with some associated patient information (age at time of study,
subtlety rating for abnormalities, American College of Radiology
breast density rating and keyword description of abnormalities)
and image information (scanner, spatial resolution, etc.). The
following contextual attributes were included in the system:
• age at time of study
• breast density rating
The remaining attributes were not used, either because they are
regarded as useless (date of study, date digitized, etc.) or because
they require advanced expert interaction (the description of the
lesions visible in images). Images have a varying definition, of
about 2000 pixels/line for 5000 lines/image. Each patient file has
been graded by a physician. Patients are then classified in three
groups: normal, benign and cancer. The distribution of grades
among the patients is given in table II.
C. Attributes of a medical case
In those datasets, each patient file consists of a mixture
of digital images and contextual information. Contextual at-
tributes (13 in DRD, 3 in DDSM) do not require advanced
preprocessing: textual attributes (such as “treatments” in DRD)
were translated into codes and processed as nominal attributes;
numerical contextual attributes (such as “breast density rating” in
DDSM) did not require any preprocessing at all. Images, on the
hal-00515356, version 1 - 6 Sep 2010