Conference PaperPDF Available

Intelligent Clinical Decision Support Systems for Non-invasive Bladder Cancer Diagnosis

Authors:
  • Regional Oncology Institute

Abstract and Figures

Currently, there are some paradigm shifts in medicine, from the search for a single ideal biomarker, to the search for panels of molecules, and from a reductionistic to a systemic view, placing these molecules on functional networks. There is also a general trend to favor non-invasive biomarkers. Identifying non-invasive biomarkers in high- throughput data, having thousands of features and only tens of samples is not trivial. Here, we proposed a methodology and the related concepts to develop intelligent molec- ular biomarkers, via knowledge discovery in data, illustrated on bladder cancer diagno- sis. A knowledge discovery in data approach, with computational intelligence methods, is used to identify the relevant features and discover their relationships with the diagno- sis. The intelligent non-invasive diagnosis systems, is based on a team of mathematical models, discovered with genetic programming, and taking as inputs plasma microRNA. This systems share with other intelligent systems we build, using this methodology but different computational/artificial intelligence techniques and clinical situations— chronic hepatitis, bladder cancer progression, and prostate cancer—the best published accuracy, even 100%. Computational intelligence could be a strong foundation for the newly emerging Knowledge-Based-Medicine. The impact of this paradigm shift on medical practice could be enormous. Instead of offering just hints or evidences to the clinicians, like Evidence-Based-Medicine, Knowledge-Based-Medicine which is made possible and co-exists with Evidence-Based-Medicine, offers intelligent clinical deci- sion supports systems.
Content may be subject to copyright.
Proceedings of CIBB 2010 1
BLADDER CANCER NON-INVASIVE DIAGNOSIS
I-BIOMARKERS BASED ON PLASMA MICRORNA WITH
100% ACCURACY
Alexandru G. Floares
(1)
, Carmen Floares
(2)
, Oana Vermesan
(1)
, Tiberiu Popa
(1)
Michael Williams
(3)
, Sulaimon Ajibode
(3)
, Liu Chang-Gong
(3)
, Diao Lixia
(3)
, Wang
Jing
(3)
, Traila Nicola
(5)
, David Jackson
(4)
, Colin Dinney
(3)
and Liana Adam
(3)
(1) OncoPredict & SAIA & IOCN
Artificial Intelligence Department, Str. Vlahuta, Bloc Lama C/45, email:
alexandru.floares@ieee.org
(2) OncoPredict & IOCN
email: carmen.floares@iocn.ro
(3) UT-MDAnders on Cancer Center, Houston, Tx; Departments of Urology and
Bioinformatics
(4) Life Biosystems, Heidelberg, Germany
(5) UMF, Timisoara, Romania
Keywords: intelligent clinical decision support systems, knowledge discovery in data,
genetic programming, diagnosis biomarker, bladder cancer.
Abstract. Currently, there are some paradigm shifts in medicine, from the search for a
single ideal biomarker, to the search for panels of molecules, and from a reductionistic to
a systemic view, placing these molecules on functional networks. There is also a general
trend to favor non-invasive biomarkers. Identifying non-invasive biomarkers in high-
throughput data, having thousands of features and only tens of samples is not trivial.
Here, we proposed a methodology and the related concepts to develop intelligent molec-
ular biomarkers, via knowledge discovery in data, illustrated on bladder cancer diagno-
sis. A knowledge discovery in data approach, with computational intelligence methods,
is used to identify the relevant features and discover their relationships with the diagno-
sis. The intelligent non-invasive diagnosis systems, is based on a team of mathematical
models, discovered with genetic programming, and taking as inputs plasma microRNA.
This systems share with other intelligent systems we build, using this methodology
but different computational/artificial intelligence techniques and clinical situations—
chronic hepatitis, bladder cancer progression, and prostate cancer—the best published
accuracy, even 100%. Computational intelligence could be a strong foundation for the
newly emerging Knowledge-Based-Medicine. The impact of this paradigm shift on
medical practice could be enormous. Instead of offering just hints or evidences to the
clinicians, like Evidence-Based-Medicine, Knowledge-Based-Medicine which is made
possible and co-exists with Evidence-Based-Medicine, offers intelligent clinical deci-
sion supports systems.
1 Introduction
It is estimated that 2010 will bring 70530 new cases of bladder cancer while the
deaths are estimated at 14680. Bladder cancer is about four times more common for
men then for women and two times higher for white men then African American men.
To the present day, there is no screening method recommended for individuals at aver-
age risk. The diagnostic is determined by microscopic examination of cells from urine
Proceedings of CIBB 2010 2
or bladder tissue and examination of the bladder wall with a cystoscope. The 5-year
relative survival rate is 80% for all the stages. Starting from bladder cancer discovered
while the tumor is in situ (97% 5-year survival) it decreases down to 6% for distant stage
disease (6% 5-year survival), making it very important to discover the cancer in time.
[Society, 2010]
As Rabiya S. Tuma pointed out [Tuma, 2008], personalized medicine is the ultimate
goal of modern cancer treatment, and its success depends on the availability of tumor
biomarkers that can be used to guide treatment. Molecular biomarkers represent alter-
ations in gene sequences, expression levels, protein structure or function which can be
used as to detect cancers at an early stage, determine prognosis, and monitor disease
progression or therapeutic response [Sidransky, 2002].
They are invaluable tools for both cancer research and clinical practice, yet few
biomarkers are in clinical use despite decades of intense effort. This is because ”It
is a hard statistical problem, it is a hard clinical problem, and it is a hard biological
problem.”, as Marc Buyse emphasized (cited in [Tuma, 2008]).
In our opinion, there are some general paradigm shifts in medicine affecting biomark-
ers field too. One is from the search for a single molecule, functioning like an ideal
biomarker, to the search for panels of biomarkers. This is a natural consequence of the
(gen)omics enterprize. An other one is from a reductionistic to a systemic view, placing
these molecules on functional networks and pathways. There is also a general trend to
favor non-invasive biomarkers, usually from serum, urine, and other body fluids.
Usually, in high-throughput experiments one investigates thousands of molecules in
parallel. Statistical and bioinformatics tools must be used to select and rank a subset of
molecules, hundreds or preferably tens, capable to discriminate between two or more
medical situations. Most studies end up with such lists of ranked molecules and p-value
is the most common ranking criterion. One can also place these molecular alterations
on networks and pathways, entering in the realm of systems biology. These systems
have a completely or partially known structure but no dynamics. While this is a real
scientific progress of the last decade, it does not help too much our understanding of the
complex molecular dynamical systems, nor is very helpful in clinical practice. Impor-
tant questions are:1) Can artificial/computational intelligence help our understanding
of complex molecular dynamical systems? 2) Can artificial/computational intelligence
help developing clinically useful tools?
Our answer to the first question is yes. We developed RODES [Floares, 2010a], a
class of algorithms based on artificial intelligence to automatically extract mathematical
models, in the form of systems of differential equations (dynamical systems), from high-
throughput time-series data. It is based on a combination of knowledge discovery in
data and knowledge mining, making use of genetic programming (GP), when all the
variables of the system are available, and neural networks control when some variables
are missing.
In this paper we will focus mainly on the second questions. More precisely, the
question is: Can we use artificial intelligence to transform, an interesting but not very
useful list of ranked genes, in an intelligent system, based on the most relevant subset(s)
of these genes, capable to predict a diagnosis or any other important clinical outcome,
and supporting in this way clinical decisions? The answer to this challenging biomedical
informatics questions is yes too, and we will illustrate this with our recent investigations
on non-invasive bladder cancer diagnosis (BCa); more details will be given elsewhere
(manuscript in preparation).
The goal of this paper was to to develop intelligent molecular biomarkers, for the
non-invasive diagnosis of bladder cancer, based on plasma microRNAs (or miRs), via
a knowledge discovery in data approach, using computational intelligence methods. To
the best of our knowledge, this is the first time when plasma miRs are combined, using
Proceedings of CIBB 2010 3
artificial intelligence, to predict in an non-invasive way the pathologist diagnosis, and
the prediction accuracy of most of the intelligent systems is 100%.
2 Methods
In [Floares et al., 2010] we proposed a general methodology for developing i-Biomarkers,
a basic taxonomy, and some relationships with other intelligent (the prefix ”i-” comes
from intelligent) clinical decision support systems (i-CDDSs), we have developed. These
are illustrated here presenting the main steps for developing a non-invasive bladder can-
cer diagnosis i-Biomarker. i-Biomarkers are a subset of i-CDSS, a concept first in-
troduced by us in [Floares, 2010b]. Stated briefly, these are clinical decision systems
[Berner, 1998] based on artificial intelligence. Generally, i-CDSSs are the result of a
knowledge discovery in data approach: 1) Extracting and integration information from
various biomedical data sources, after a careful preprocessing consisting mainly in: a)
cleaning features and patients, b) various treating of missing data, c) background cor-
rection, d) normalization, e) various transformations, f) ranking features, g) selecting
features, h) balancing data, etc. 2) Testing various classifiers or predictive modeling
algorithms. 3) Testing various ensemble methods for combining classifiers.
The dataset used to develop the non-invasive BCa i-biomarker was acquired using
customized microRNA array. The first steps consist in exploratory data analysis and
data pre-processing. Background correction aims to adjust the intensity readings for
technical variability between arrays due to subtle differences in handling labeling, hy-
bridization and scanning. It is essential to use background correction in order to obtain
good sensitivity from the data. The background correction was made by substracting
B635 from F635.Median and the resulted values represent the gene expression data. The
F635 Median values represent the median feature pixel intensity for the 635 nm channel
and the B635 values represent the median feature background intensity for the same
channel. The next step was to transform the raw data into log2 scale and normalize it
with quantile normalization method. The quality of the data was assessed using density
plots (not shown), boxplots and unsupervised hierarchical clustering (based on Pear-
son correlation and the Ward linkage rule), and principal components analysis (PCA).
The replicates of each probe were averaged after normalization and the non-human mi-
croRNA probes were filtered.
After preprocessing and filtering out all non-human microRNA, the number of total
probesets for analysis was 2247, and we tried to find first the miRNAs that are differ-
entially expressed between two sample groups. A two-sample t-test was used for this
purpose. The distributions of p-values are shown in Fig 1.
Because of the multiple testing involved in this approach (feature-by-feature), the
individual p-values are not particularly meaningful. However, when we look across
the entire set of tests, the distribution of p-values (under the null hypothesis that no
miRs were differentially expressed) should be uniform (indicated by the black line in
Figure 1). If, on the other hand, some features are differentially expressed, we would
expect an overabundance of small p-values. We can capture this situation by mod-
eling the distribution of the p-values as a beta-uniform mixture (BUM) described by
Pounds and Morris in [Pounds and Morris, 2003]. For all these pre-processing steps we
used the freely available bioinformatics collection of software packages Bioconductor
(http://www.bioconductor.org/).
The best i-Biomarker is a combination of potentially contradictory features: a) high
accuracy, b) transparency, and c) low number of predictors. Five artificial intelligence
methods were tested for developing intelligent clinical decision support systems for di-
agnosis prediction: 1) Classification and Regression Trees [Breiman et al., 1984] 2) C5,
the last version of Quinlan’s C4.5 algorithm [Quinlan, 1993] 3) Artificial Neural Net-
works (ANN) [Bishop, 1995], and 4) Logistic Regression (LR) [Hastie et al., 2000] 5)
Proceedings of CIBB 2010 4
Genetic Programming (GP) [Koza, 1992]. A good reference for using the first four com-
putational intelligence methods for knowledge discovery in data is [Nisbet et al., 2009],
and [Langdon, 2008] for the last one. The data was split into a training set (50% pa-
tients) and a testing set (50% patients) and the reported error is on training set, for
choosing the best algorithm (see Table 3), and on test set for the final intelligent system
building. We used a linear version of steady state genetic programming proposed by
Banzhaf (see [Brameier and Banzhaf, 2007] for a detailed introduction and the litera-
ture cited there). In linear genetic programming the individuals are computer programs
represented as a sequence of instructions from an imperative programming language or
machine language. Nordin introduced the use of machine code in this context (cited in
[Brameier and Banzhaf, 2007]).
The major preparatory steps for GP consist in determining: a) the set of terminals, b)
the set of functions, c) the fitness measure, d) the parameters for the run, e) the method
for designating a result, and f) the criterion for terminating a run. The function set, also
called instructions set in linear GP, can be composed of standard arithmetic or program-
ming operations, standard mathematical functions, logical functions, or domain-specific
functions. The terminals are the attributes or variables and parameters. There are several
good commercial and non-commercial GP implementations.
In data cleaning, we removed variables that have very small standard deviation (al-
most constants). and variables that have a coefficient of variation CV < 0.1 (CV = stan-
dard deviation/mean). For ranking the features, an important step of feature selection,
also important for understanding the biomedical problem, we used a simple but effec-
tive method which considers one feature at a time, to see how well each feature alone
predicts the target variable. For each feature, the value of its importance is calculated as
(1 - p), where p is the p value of the corresponding statistical test of association between
the candidate feature and the target variable. The target variable was categorical with
two categories for all investigated problems and all inputs were continuous.
For these continuous inputs variables, p values based on the F statistic are used.
For each continuous variable a one-way ANOVA F test is performed to see if all the
different classes of Y have the same mean as X. The p value based on F statistic is
calculated as the probability that F (J 1, N J) > F , where F (J 1, N J) is
a random variable that follows and F distribution with degrees of freedom J 1 and
N J, and
F =
J
j=1
N
j
(x x)
2
/(J 1)
J
j=1
(N
j
1s
2
j
/(N J)
If the denominator for a feature was zero, the p value of that feature was set to zero.
The features were ranked first by sorting them by p value in ascending order, and if ties
occurred, they were sorted by F in descending order. If ties still occurred, they were
sorted by N in descending order.
Based on the features’ importance (1 p), with p calculated as explained above, we
ranked and grouped features in three categories: 1) important features, with (1 p)
between 0.95 and 1, 2) moderately important features, with (1 p) between 0.90 and
0.95, and 3) unimportant features, with (1 p) less than 0.90.
3 Results
3.1 Patients Data
This study is based on a lot of 38 individuals, 20 (52%) patients with bladder cancer,
5 females and 15 males, 10 (26%) with invasive BCa, and 10 (26%) with superficial
BCa, and 18 (48%) individuals without any known cancer. The BCa patients were 7
(35%) with stage Ta, 3 (15%) with stage T1, 7 (35%) with stage T2, 1 (5%) with stage
T3 and 2 (10%) with stage T4, 2 (5%) with grade 1, 5 (25%) with grade 2, and 14
Proceedings of CIBB 2010 5
Table 1: Summary of significant miRNAs by different FDR cutoff values
FDR # sign P-value cutoff
1 0.20 18 1.74e-03
2 0.25 23 2.60e-03
3 0.30 30 3.70e-03
4 0.35 31 5.09e-03
5 0.40 34 6.85e-03
6 0.45 45 9.11e-03
7 0.50 55 1.20e-02
Figure 1: Histogram showing the distribution of p-values from the feature-by-feature two sample t-test
(70%) with grade 3. The molecular biology data consists in 38 samples of microRNAs
isolated from plasma of these individuals. There where measured 19200 miRs from
blood plasma.
3.2 Preprocessing Results
The results of the preprocessing steps performed to find the differentially expressed
miRs are summarized in the following figures and tables: 1) Figure 1 shows the distribu-
tions of p-values from the feature-by-feature two sample T-test, 2) Table 1 summarizes
the numbers of significant features using different false discovery rates (FDR) cutoff
values from the feature-by-feature two sample T-test, and 3) Figure 2 shows a heatmap
of the most differentially expressed features selected at an FDR level of 0.20.
Using Bioconductor we obtained a list of features that are differentially expressed
among samples by performing feature-by-feature two-sample t-tests. The genes were
ordered by their p-value and the first 252 were selected to perform feature selection in
order to obtain the most significant genes that influence the output, normal or BCa. A
list of 63 genes was created and it included the genes that had a chi square value equal
or higher than 0.97. With these genes we performed the following experiments. The
first experiment had as inputs all the 63 genes and as output the diagnosis, normal or
BCa. The Best program and the Best team had an accuracy of 100%. The genes that
resulted from the best program formula were hsa-mir-212-P, hsa-miR-134, hsa-miR-
331-5p-AS, hsa-miR-521, hsa-miR-1306, hsa-miR-340, hsa-miR-631, hsa-miR-1301-
AS. The second experiment had as inputs the genes that remained after removing those
Proceedings of CIBB 2010 6
Figure 2: Heatmap of the most differentially expressed miRNAs selected at an FDR level of 0.2
Table 2: Experiment results
1st experiment 2nd experiment 3rd experiment 4th experiment 5th experiment
hsa-miR-33b hsa-miR-96* hsa-miR-1826 hsa-miR-92b hsa-miR-923
hsa-mir-565-A-AS hsa-miR-1290-AS hsa-miR-1302-5-P hsa-miR-1246 hsa-miR-487a
hsa-miR-548d-3p-AS hsa-mir-1914 hsa-miR-220a hsa-miR-548i4-P hsa-miR-638
hsa-miR-758-AS hsa-mir-603-P hsa-miR-608-AS hsa-miR-1273 hsa-miR-455-3p
hsa-mir-1913-AS hsa-miR-369-3p-AS hsa-mir-646-P
hsa-miR-25 hsa-miR-190b-AS hsa-miR-634
hsa-miR-16-AS hsa-miR-92a
hsa-miR-194* hsa-miR-1274b-P
hsa-mir-320
hsa-miR-1268-AS
hsa-miR-1181
resulted from the first experiment. Again there was a 100% accuracy for both Best pro-
gram and Best team and the genes that were included in the formula are hsa-miR-493*,
hsa-mir-647-P, hsa-miR-412-AS, hsa-miR-1206-AS, hsa-miR-29a*-AS, hsa-miR-425*,
hsa-miR-1250. There were performed four more experiments that had 100% accuracy.
All had as inputs the genes that remained after removing those which resulted from the
previous experiments. Table 2 contains all the genes that were included in the best pro-
gram formula of each experiment. We stopped doing these experiments when the the
best program’s accuracy dropped below 100%.
3.3 Algorithms Selection and i-Biomarkers Panel Development
Genetic Programming was the most accurate algorithm (see Table 3) and it was used
to develop intelligent clinical decision support systems. Similar results (not shown)
were obtained with neural networks (manuscript in preparation).
The three GP i-CDSSs developed have one categorical output with two diagnosis
categories - bladder cancer versus normal, invasive versus superficial, and recurrence
and metastasis versus no recurrence and metastasis. The results are similar but only the
first i-CDSS are presented here. The data were split into a training set (50% patients)
and a testing set (50% patients); the reported error (hit-rate) is on the test set.
Following the aforementioned preparatory steps (see section 2), after some experi-
Proceedings of CIBB 2010 7
Table 3: Best ve algorithms compared
Model Accuracy (%) No. fields used AUC
CART 95.91 7 0.971
GP 95.45 5 0.958
C5 85.71 2 0.889
NN 75.51 3 0.710
LR 73.46 7 0.776
Table 4: Genetic Programming Parameters
Parameter Setting
Population size 500
Mutation frequency 95%
Block mutation rate 30%
Instruction mutation rate 30%
Instruction data mutation rate 40%
Crossover frequency 50%
Homologous crossover 95%
Program Size 512
Demes
Crossover between demes 0%
Number of demes 10
Migration rate 1%
Dynamic Subset Selection
Target subset size 50
Selection by age 50%
Selection by difficulty 50%
Stochastic selection 0%
Frequency (in generation equivalents) 1
Function set complex
Terminal set 64 = j + k
Constants j
Inputs k
ments, we found the parameters settings leading to the best evolved program (see Table
4). The accuracy (hit-rate) and the Area Under the Curve (AUC) of the Receiver Operat-
ing Characteristics (ROC) graphs are presented in Table 5. The resulted formula (patent
pending) contains arithmetic, trigonometric and logical operations. Then, we tried to
increase the accuracy using ensemble methods. The results for the best team (patent
pending) are presented in Table 6.
The confusion matrix, for the best program, shows how often a prediction of each
class is actually correct. There are 6 patients observed as Malign Class and predicted as
Malign Class and 14 patients observed as Benign Class and predicted as Benign Class.
Also there is a misclassification, because there are 2 patients observed as Malign Class
and predicted as Benign Class.
The ROC Chart provides a look at how well the program does at ranking the patients
in terms of how likely it is the patients are in the Malign Class. It shows how many
false negatives are encountered to find each level of True Positives. In Figure 3 can
be observed that 75% patients can be correctly classified in the Malign class without
classifying any other patient as belonging to the Benign class.
Proceedings of CIBB 2010 8
Table 5: Best program
All Classes Class One Class Zero
Hit-rate AUC Probability of AUC Hit-rate Hit-rate
Training 100 1 0.00002 100 100
Validation 90.91 0.875 0.00207 75 100
Combined 95.45 0.95 0 87.5 100
Table 6: Best team
All Classes Class One Class Zero
Hit-rate AUC Probability of AUC Hit-rate Hit-rate
Training 100 1 0.00002 100 100
Validation 100 0.84375 0.0043 100 100
Combined 100 0.96635 0 100 100
In the confusion matrix resulted for the best team, there are 8 patients observed and
predicted as Class One (Malign) and 14 patients as Class Zero (Benign).
In Figure 3 can be observed that to predict correctly all of the patients who belong to
the Malign class, there will be 28,57% patients who will be predicted as belonging to
the Benign class.
The GP based clinical decision support system has 100% accuracy. As it was pre-
viously stated, they are molecular i-Biomarkers which can be used as i-CDSS for non-
invasive prostate cancer diagnosis. This is an important step toward prostate biopsy
replacement in prostate cancer suspicion or at least a mean for reducing the number of
necessary biopsy.
4 Discussion
The main thesis of this work is that artificial intelligence could be a strong foun-
dation of the newly emerging knowledge based medicine (KBM) having roots in and
co-existing with evidence based medicine. KBM is the medicine of the postgenomic
and knowledge age and i-biomarkers and dynamic i-biomarkers, based on artificial in-
telligence combined with dynamical systems biology (manuscript in preparation), are
powerful and important KBM building blocks. Postgenomic era is characterized by
high-throughput experiments investigating thousands of molecules—messenger RNA,
microRNA, proteins, etc.—in parallel. Bioinformatics and statistical tools are used to
select and rank subsets of hundreds or preferably tens molecules, capable to discrim-
inate between two or more medical situations. The results of most studies are lists of
ranked molecules and p-value is the most common ranking criterion. Using a systems
biology approach, these molecular alterations can be placed on biological networks and
Figure 3: ROC Chart for best program and best team
Proceedings of CIBB 2010 9
pathways. These systems have a completely or partially known structure but no dynam-
ics.
Unfortunately, these impressive scientific progresses of the last decade do not help
too much our understanding of the complex molecular dynamical systems, and they are
not very helpful in clinical practice. Important questions are:1) Can artificial/computational
intelligence help our understanding of complex molecular dynamical systems? 2) Can
artificial/computational intelligence help developing clinically useful tools?
Our answer to the first question is yes. We developed RODES [Floares, 2010a], a
class of algorithms based on artificial intelligence to automatically extract mathematical
models, in the form of systems of differential equations (dynamical systems), from high-
throughput time-series data. It is based on a combination of knowledge discovery in
data and knowledge mining, making use of genetic programming (GP), when all the
variables of the system are available, and neural networks control when some variables
are missing.
In this paper we will focus mainly on the second questions. More precisely, the ques-
tion is: Can we use artificial intelligence to transform, an interesting but not very useful
list of ranked genes, in an intelligent system, based on the most relevant subset(s) of
these genes, capable to predict a diagnosis or any other important clinical outcome, and
supporting in this way clinical decisions? The approach of the present study - a knowl-
edge discovery in data, using artificial intelligence - is totally different. While the first
approach offers to clinicians some useful suggestions, the second one offers intelligent
clinical decision support systems, which can be directly used in clinical practice.
We strongly believe that in the Information Age scientific research should be viewed
as a cycle, from data to information and from information to knowledge and back. For
most of the present day biomedical investigations this cycle is broken—the results are
limited to data and their (elementary) statistics. Nevertheless, data plus statistics is the
foundation of the Evidence-Based-Medicine, but nowadays all conditions are present to
move to the next level - Knowledge-Based-Medicine (KBM).
The other one is knowledge discovery in data, using soft computing to develop i-
CDSS, by identifying the relevant features and discovering relationships between them
and the clinical outcome or output.
These particular intelligent clinical decision support systems are non-invasive molec-
ular i-Biomarkers.
The fact that the proposed methodology and the related concepts are useful is cor-
roborated by the results of our two recent studies in prostate non-invasive diagnosis
and bladder cancer progression. Both of them are based on knowledge discovery in
high-throughput data and the resulted i-biomarkers, one for progression prediction (see
[Williams et al., 2010]) and one for non-invasive diagnosis (see [Floares et al., 2010]),
has 100% accuracy and are patent pending.
Five soft computing methods were tested for developing molecular i-Biomarkers for
prostate cancer pathological diagnosis prediction: two types of decision trees - Classifi-
cation and Regression Trees, and C5 - Artificial Neural Networks, Logistic Regression,
and Genetic Programming. Usually, but not always, logistic regression is outperformed
by the neural networks. An important drawback of the neural networks, as a base for
intelligent clinical decision support system, is that they are black-box system, while
decision trees and the GP mathematical formula are transparent. Usually, when the
accuracies of different algorithms are similar, the white-box systems are preferred.
5 Conclusion
Soft computing or artificial/computational intelligence could be a strong foundation
for the newly emerging Knowledge-Based-Medicine. The impact of this paradigm shift
on medical practice could be enormous. Instead of offering just hints or evidences to the
Proceedings of CIBB 2010 10
clinicians, like Evidence-Based-Medicine, Knowledge-Based-Medicine which is made
possible and co-exists with Evidence-Based-Medicine, offers intelligent clinical deci-
sion supports systems. The methodology and concepts proposed here, for developing
intelligent clinical decision support systems, proved to be effective, allowing us to build
various intelligent systems for diagnosis and prognosis in chronic hepatitis, bladder
cancer, and prostate cancer, all having the best published accuracy, some of them even
100%.
Acknowledgments
References
[Berner, 1998] Berner, E. S. (1998). Clinical Decision Support Systems: Theory and Practice. Springer,
New York.
[Bishop, 1995] Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University
Press, Inc., New York, NY, USA.
[Brameier and Banzhaf, 2007] Brameier, M. and Banzhaf, W. (2007). Linear Genetic Programming.
Genetic and Evolutionary Series. Springer.
[Breiman et al., 1984] Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and
Regression Trees. Wadsworth and Brooks, Monterey, CA.
[Floares et al., 2010] Floares, A., Balacescu, O., Floares, C., Balacescu, L., Popa, T., and Vermesan, O.
(2010). Mining knowledge and data to discover intelligent molecular biomarkers: prostate cancer
i-biomarkers. In Proceedings of the 4th International Workshop on Soft Computing Applications.
[Floares, 2010a] Floares, A. G. (2010a). New Trends in Technologies, chapter Toward Personal-
ized Therapy Using Artificial Intelligence Tools to Understand and Control Drug Gene Networks.
INTECH. Available from: http://sciyo.com/articles/show/title/toward-personalized-therapy-using-
artificial-intelligence-tools-to-understand-and-control-drug-gene-.
[Floares, 2010b] Floares, A. G. (2010b). Using computational intelligence to develop intelligent clin-
ical decision support systems. In Francesco Masulli, Leif E. Peterson, R. T., editor, Computational
Intelligence Methods for Bioinformatics and Biostatistics. Springer.
[Hastie et al., 2000] Hastie, T., Tibshirani, R., and Friedman, J. (2000). The Elements of Statistical
Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics).
Springer, 2nd ed. 2009. corr. 3rd printing edition.
[Koza, 1992] Koza, J. R. (1992). Genetic Programming: On the Programming of Computers by Means
of Natural Selection. MIT Press, Cambridge, MA.
[Langdon, 2008] Langdon, W. B. (2008). A field guide to genetic programing.
[Nisbet et al., 2009] Nisbet, R., Elder, J., and Miner, G. (2009). Handbook of Statistical Analysis and
Data Mining Applications. Academic Press.
[Pounds and Morris, 2003] Pounds, S. and Morris, S. W. (2003). Estimating the occurrence of false
positives and false negatives in microarray studies by approximating and partitioning the empirical
distribution of p-values. Bioinformatics, 19(10):1236–1242.
[Quinlan, 1993] Quinlan, J. R. (1993). C4.5: programs for machine learning. Morgan Kaufmann Pub-
lishers Inc., San Francisco, CA, USA.
[Sidransky, 2002] Sidransky, D. (2002). Emerging molecular markers of cancer. Nat Rev Cancer,
2(3):210–219.
[Society, 2010] Society, A. C. (2010). Cancer facts and figures 2010.
[Tuma, 2008] Tuma, R. S. (2008). Biomarker developers face big hurdles. J. Natl. Cancer Inst.,
100(7):456–461.
[Williams et al., 2010] Williams, M., Floares, A., Choi, W., Siefker-Radtke, A., McConkey, D., Dinney,
C., and Adam, L. (2010). Prognostic significance of mir-200 family in bladder cancer progression. In
EMT and Cancer Progression and Treatment Proceedings, Arlington, Virginia.
... A solution we proposed is the development of systems based on artificial intelligence (AI): Intelligent Clinical Decision Support Systems (i-CDSS), a concept first introduced by us in [8], and i-Biomarkers (see [7] for a general methodology), a subset of i-CDSSs. We have developed intelligent systems for bladder cancer diagnosis [9] and progression [17] based on plasma microRNA measurements, with an accuracy reaching as high as 100%. This study is focused on invasive bladder cancer of urothelial origin, also known as transitional cell carcinoma. ...
Conference Paper
Full-text available
Bladder cancer is the fourth most common malignancy in men in the western countries. The aim of this study was to develop intelligent systems for invasive bladder cancer progression prediction. The proposed methodology combines knowledge discov- ery in data using artificial intelligence and knowledge mining. These are used both in feature selection and classifier development. The approach is designed to avoid overfit- ting and overoptimistic results. To our knowledge, these are the first intelligent systems for prediction of bladder cancer progression, based on boosted C5 decision trees, and their accuracy of 100% is the best published by now.
... However, this accuracy proved to be robust and it is an important progress from the list of genes in progression prediction. Previously [19], [20] , we introduced the i-Biomarker concept , representing an intelligent system (indicated by the prefix " i " ), which could function in a similar manner to a biomarker. A biomarker is defined as: " a characteristic that is objectively measured and evaluated as an indicator of normal biologic processes, pathogenic processes, or pharmacologic response to a therapeutic intervention. ...
Conference Paper
Full-text available
The aim of this study is to propose a methodology for developing intelligent systems for cancer diagnosis and evaluate it on bladder cancer. Owing to recent advances in high-throughput experiments, large data repositories are now freely available for use. However, the process of extracting information from these data and transforming it into clinically useful knowledge needs to be improved. Consequently, the research focus is shifting from merely data production towards developing methods to manage and analyze it. In this study, we build classification models that are able to discriminate between normal and cancer samples based on the molecular biomarkers discovered. We focus on transparent and interpretable models for data analysis. We built molecular classifiers using decision tree models in combination with boosting and cross-validation to distinguish between nor- mal and malign samples. The approach is designed to avoid overfitting and overoptimistic results. We perform experimental evaluation on a data set related to the urothelial carcinoma of the bladder. We identify a set of tumor microRNAs biomarkers, which integrated in an ensemble of decision tree classifiers, can discriminate between normal and cancer samples with the best published accuracy.
Conference Paper
Bladder cancer is the fourth most common malignancy in men in the western countries. The aim of this study was to develop intelligent systems for invasive bladder cancer progression prediction. The proposed methodology combines knowledge discovery in data using artificial intelligence and knowledge mining. These are used both in feature selection and classifier development. The approach is designed to avoid overfitting and overoptimistic results. To our knowledge, these are the first intelligent systems for prediction of bladder cancer progression, based on boosted C5 decision trees, and their accuracy of 100% is the best published by now.
Article
Full-text available
Classification of patient samples is an important aspect of cancer diagnosis and treatment. The support vector machine (SVM) has been successfully applied to microarray cancer diagnosis problems. However, one weakness of the SVM is that given a tumor sample, it only predicts a cancer class label but does not provide any estimate of the underlying probability. We propose penalized logistic regression (PLR) as an alternative to the SVM for the microarray cancer diagnosis problem. We show that when using the same set of genes, PLR and the SVM perform similarly in cancer classification, but PLR has the advantage of additionally providing an estimate of the underlying probability. Often a primary goal in microarray cancer diagnosis is to identify the genes responsible for the classification, rather than class prediction. We consider two gene selection methods in this paper, univariate ranking (UR) and recursive feature elimination (RFE). Empirical results indicate that PLR combined with RFE tends to select fewer genes than other methods and also performs well in both cross-validation and test samples. A fast algorithm for solving PLR is also described.
Chapter
Full-text available
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Book
The Handbook of Statistical Analysis and Data Mining Applications is a comprehensive professional reference book that guides business analysts, scientists, engineers and researchers (both academic and industrial) through all stages of data analysis, model building and implementation. The Handbook helps one discern the technical and business problem, understand the strengths and weaknesses of modern data mining algorithms, and employ the right statistical methods for practical application. Use this book to address massive and complex datasets with novel statistical approaches and be able to objectively evaluate analyses and solutions. It has clear, intuitive explanations of the principles and tools for solving problems using modern analytic techniques, and discusses their application to real problems, in ways accessible and beneficial to practitioners across industries - from science and engineering, to medicine, academia and commerce. This handbook brings together, in a single resource, all the information a beginner will need to understand the tools and issues in data mining to build successful data mining solutions. Written "By Practitioners for Practitioners" Non-technical explanations build understanding without jargon and equations Tutorials in numerous fields of study provide step-by-step instruction on how to use supplied tools to build models using Statistica, SAS and SPSS software Practical advice from successful real-world implementations Includes extensive case studies, examples, MS PowerPoint slides and datasets CD-DVD with valuable fully-working 90-day software included: "Complete Data Miner - QC-Miner - Text Miner" bound with book.
Book
The Handbook of Statistical Analysis and Data Mining Applications is a comprehensive professional reference book that guides business analysts, scientists, engineers and researchers (both academic and industrial) through all stages of data analysis, model building and implementation. The Handbook helps one discern the technical and business problem, understand the strengths and weaknesses of modern data mining algorithms, and employ the right statistical methods for practical application. Use this book to address massive and complex datasets with novel statistical approaches and be able to objectively evaluate analyses and solutions. It has clear, intuitive explanations of the principles and tools for solving problems using modern analytic techniques, and discusses their application to real problems, in ways accessible and beneficial to practitioners across industries - from science and engineering, to medicine, academia and commerce. This handbook brings together, in a single resource, all the information a beginner will need to understand the tools and issues in data mining to build successful data mining solutions.
Article
Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, ***, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.
Article
MicroRNAs are small noncoding RNAs which are able to regulate gene expression at both the transcriptional and translational levels. There is a growing recognition of the role of microRNAs in nearly every tissue type and cellular process. Thus there is an increasing need for accurate quantitation of microRNA expression in a variety of tissues. Microarrays provide a robust method for the examination of microRNA expression. In this chapter, we describe detailed methods for the use of microarrays to measure microRNA expression and discuss methods for the analysis of microRNA expression data.