Conference PaperPDF Available

Using deep learning to enhance cancer diagnosis and classification

Using deep learning to enhance cancer diagnosis and classification
Rasool Fakoor
Faisal Ladhak
Azade Nazi
Manfred Huber
Computer Science and Engineering Dept, University of Texas at Arlington, Arlington, TX 76019 USA
Using automated computer tools and in par-
ticular machine learning to facilitate and en-
hance medical analysis and diagnosis is a
promising and important area. In this paper,
we show that how unsupervised feature learn-
ing can be used for cancer detection and can-
cer type analysis from gene expression data.
The main advantage of the proposed method
over previous cancer detection approaches is
the possibility of applying data from various
types of cancer to automatically form fea-
tures which help to enhance the detection and
diagnosis of a specific one. The technique is
here applied to the detection and classifica-
tion of cancer types based on gene expression
data. In this domain we show that the per-
formance of this method is better than that
of previous methods, therefore promising a
more comprehensive and generic approach for
cancer detection and diagnosis.
1. Introduction
Studying the correlation between gene expression pro-
files and disease states or stages of cells plays an impor-
tant role in biological and clinical applications (Tan &
Gilbert,2003). The gene expression profiles can here
be obtained from multiple tissue samples and by com-
paring the genes expressed in normal tissue with the
ones in diseased tissue, one can obtain better insight
into the disease pathology (Tan & Gilbert,2003). One
of the challenges that has been addressed in this way
is to determine the difference between cancerous gene
expression in tumor cells and the gene expression in
normal, non-cancerous tissues. To address this, quite
Proceedings of the 30th International Conference on Ma-
chine Learning, Atlanta, Georgia, USA, 2013. JMLR:
W&CP volume 28. Copyright 2013 by the author(s).
a number of machine learning classification techniques
have been used to classify tissue into cancerous and
normal. However, due to the high dimensionality of
gene expression data (a.k.a the high dimensionality of
the feature space) and the availability of only a few
hundred samples for a given tumor, this application
requires a number of specific considerations to deal
with these data. The first challenge here is how to re-
duce the dimensionality of the feature space in a way
that ensures that the resulting feature space still con-
tains sufficient information to perform accurate clas-
sification. In addition, small sample sets (i.e. a small
number of training examples) make the problem much
harder to solve and increase the risk of overfitting. For
years, many solutions have been proposed to address
the cancer detection problem, most of which perform
feature space reduction by deriving compact feature
sets by selecting and constructing features either man-
ually or in supervised ways. This, however, leads to
the problems with those methods that they are mostly
not scalable and can not be generalized to new cancer
types without the re-design of new features. In addi-
tion, these techniques can not take effective advantage
of tissue samples from other cancers when, for exam-
ple, breast cancer detection is to be learned, being
effectively restricted to only data from breast cancer
and normal tissue when building the classifier. Given
this restriction, in turn, likely leads to limitations in
the way these methods scale to new cancer detection
tasks when only a handful of samples are available.
To deal with this problem and to facilitate and develop
more generalized versions of cancer classifiers, we pro-
pose in this paper a more general way of learning fea-
tures by applying unsupervised feature learning and
deep learning methods. We use a sparse autoencoder
method to learn a concise feature representation from
unlabeled data. In contrast to the previous methods
where data has to be strictly from the cancer type to
be detected in order to provide the appropriate label
for supervised learning, the unlabeled data can here be
Using deep learning to enhance cancer diagnosis and classification
obtained by combing data from different tumor cells
provided that they are generated using the same mi-
croarray platform (i.e. given that they contain the
same gene expression information). For example, for
the feature learning that forms the basis for prostate
cancer classification we can use samples from breast
cancer, lung cancer, and many other cancers which
are available in that platform. The resulting features
from all these sets are then used as a basis for the
construction of the classifier.
The remainder of this paper is organized as follows:
Section 2provides some background about gene ex-
pression. Section 3reviews prior research. Section
4outlines the proposed method and Section 5shows
results of our method and compares them to results
achieved using other methods. Finally, Section 6con-
cludes the paper.
2. Gene Expression
Gene expression data measures the level of activity
of genes within a given tissue and thus provides in-
formation regarding the complex activities within the
corresponding cells. This data is generally obtained
by measuring the amount of messenger ribonucleic
acid(mRNA) produced during transcription which, in
turn, is a measure of how active or functional the cor-
responding gene is (Aluru,2005). As cancer is associ-
ated with multiple genetic and regulatory aberrations
in the cell, these should reflect in the gene expres-
sion data. To capture these abnormalities, microar-
rays, which permit the simultaneous measurement of
expression levels of tens of thousands of genes, have
been increasingly utilized to characterize the global
gene-expression profiles of tumor cells and matched
normal cells of the same origin. Specifically, microar-
rays are used to identify the differential expression
of genes between two experiments, typically test vs.
control, and to identify similarly expressed genes over
multiple experiments. The processing pipeline of mi-
croarray data involves the raw data pre-processing to
obtain a gene expression matrix, and then analyzing
the matrix for differences and/or similarities of expres-
sion. The gene expression matrix, GEM, contains pre-
processed expression values with genes in the rows, and
experiments in the columns. Thus, each column cor-
responds to an array (or gene-chip) experiment, and
could contain multiple experiments if there were repli-
cates. Each row in the matrix represents a gene ex-
pression profile (Aluru,2005).
Gene-chips can hold probes for tens of thousands of
genes, whereas the number of experiments, limited by
resources like time and money, is much smaller, at
most in the hundreds. Thus, the gene expression ma-
trix is typically very narrow (i.e. number of genes, n,
is significantly larger than the number of experiments,
m). This is known as the dimensionality curse and it
is a serious problem in gene network inference (Aluru,
3. Related Work
Quite few methods have been proposed to detect can-
cer using gene expression data. In (C. Aliferis et al.,
2003), Aliferis et al. used recursive feature elimina-
tion and univariate association filtering approaches to
select a small subset of the gene expressions as a re-
duced feature set. Ramaswamy et al. in (Ramaswamy
et al.,2001) applied recursive feature elimination us-
ing SVM to find similarly a small number of gene
expressions to be used as the feature space for the
classification. In (Wang et al.,2005b), Wang et al.
showed that by combining a correlation-based feature
selector with different classification approaches, it is
possible to select relevant genes with high confidence
and obtain good classification accuracy compared with
other methods. Sharma et. al (Sharma et al.,2012)
proposed a feature selection method aimed at finding
an informative subset of gene expressions. In their
method, genes were divided into small subsets and
then informative genes in these smaller subsets were
selected and then merged, ending up with an informa-
tive subset of genes. Nanni et. al in (Nanni et al.,
2012) proposed a method for gene microarray classi-
fication that combines different feature reduction ap-
proaches. In most of those methods the focus was
on how to learn features and reduce the dimension-
ality of the gene expression data. The majority of
these methods use manually designed feature (e.g fea-
ture engineering) selectors to reduce the dimension-
ality of gene expression and select informative sets of
genes. The potential problems with these feature selec-
tion methods are scalability and generality of features
(i.e. whether the selected/designed features can be ex-
tended and applied to new classification tasks and data
sets). In addition, since specific cancer data are usu-
ally rare and most of the mentioned methods can not
efficiently take advantage of data from other cancers
than the one to be detected or classified, these meth-
ods have to operate with very small data sets, limiting
the effectiveness of the automatic feature learning ap-
proaches used. For example, prostate cancer data can
not be used in selecting features for breast cancer de-
tection, reducing the basis for feature learning. In con-
trast to these methods, our proposed method can use
data from different cancer types in the feature learn-
ing step, promising the potential for effective feature
Using deep learning to enhance cancer diagnosis and classification
learning in the presence of very limited data sets.
4. Approach
Unsupervised feature learning methods and deep
learning have been widely used for image and audio
applications such as (Lee et al.,2009b;Huang et al.,
2012), etc. In these domains, these techniques have
shown a strong promise in automatically representing
the feature space using unlabeled data in order to in-
crease the accuracy of subsequent classification tasks.
Using additional properties of the data, these capabil-
ities have been further extended to facilitate learning
in very high dimensional feature spaces. For example,
by using image characteristics such as locality and sta-
tionary of images, Lee in (Lee et al.,2009a) proposed a
method to scale the unsupervised feature learning and
deep learning methods to high dimensional and full-
sized images. Similarly, Le in (Le et al.,2012) applied
an unsupervised feature learning method (in particu-
lar Reconstruction Independent Subspace Analysis) in
the context of cancer detection by applying it in the
classification of histological image signatures and clas-
sification of tumor architecture. However, to the best
of our knowledge, unsupervised feature learning meth-
ods have not been applied to gene expression analy-
sis (it should be noted that Le’s method (Le et al.,
2012) still has been applied to images not gene ex-
pression). Some of the reasons for this can be seen in
the extremely high dimensionality of gene expression
data, the lack of sufficient data samples, and the lack
of global known characteristics such as locality in gene
expression data which limit the applicability of tech-
niques such as convolution or pooling which have been
highly successful in the above-mentioned image data
In the method proposed here, we try to address this
dimensionality problem in the area of gene expression
data. In our method, we first reduce the dimensional-
ity of the feature space using PCA, and then apply the
result of PCA as a compressed feature representation
which still encodes the data available in the sample
set, along with some randomly selected original gene
expressions (i.e. original raw features) as a more com-
pact feature space to either a one or a multi-layered
sparse auto-encoder to find a sparse representation for
data that will then be used for the classification. This
overall approach to building and training a system to
detect and classify cancer from gene expression data
is shown in Figure 1. As shown in the figure, the ap-
proach proposed here consists of two parts, the feature
learning phase and the classifier learning phase.
Input Features I Output
Gene Expression Profile
(Feature Set)
Raw Features
Autoencoder I
PCA a1
Features I Features II Output
Input Features I Features II
Gene Expression Profile
(Training Data)
Raw Features
Stacked Autoencoder
Autoencoder II
Figure 1. Overall Approach.
4.1. Feature Learning
Our proposed feature learning approach uses two
phases, at first, PCA-based phase aimed at reducing
the dimensionality of the feature space while maintain-
ing the information content of the data. Second, based
on an augmented form of the PCA features in addtion
to some random raw features, develops a sparse en-
coding of the data samples to obtain high level and
complex features for use by the classification approach.
The main reason for this two-phase approach is that
since the dimensionality of gene expression data is ex-
tremely high (on the order of 20000 to 50000 features)
and these contain redundant and noisy data, we ap-
ply PCA to reduce the dimensionality of data without
a significant loss of information. However, there is
a problem with directly using the PCA components
as features for classification. PCA performs a linear
transformation on the data. In other words, after ap-
plying PCA, the resulting extracted features are sim-
ply a linear function of the original input data (Raina
et al.,2007). However, in order to provide an opportu-
nity to also capture the non-linearity of the relations
between expressions of different genes, a different fea-
ture learning approach is needed. To facilitate this
and to obtain more discriminating features, we use an
unsupervised feature learning method in the second
stage and, in order to provide it with the opportunity
to capture additional non-linear relations that are hid-
den by the PCA features, we randomly add some of the
original raw features to the PCA features to form an
augmented basis for the second stage feature learning
For the second phase of the feature learning ap-
praoch, we use the framework of the sparse autoen-
coder (Coates et al.,2011;Bengio et al.,2007;Ng).
The autoneocder neural network is an unsupervised
feature learning method in which the input is used
as the target for the output layer (Ng). In this way
Using deep learning to enhance cancer diagnosis and classification
it learns a function hw,b(x)xthat represents an
approximation of the input data constructed from a
limited number of feature activations represented by
the hidden units of the network. The sparse autoen-
coder is constructed by three layers in the neural net-
work (i.e input layer, hidden layer, and output layer)
in which the hidden layer contains Knodes. The units
in the hidden layer force the network to learn a repre-
sentation of the input with only Khidden unit activa-
tions, representing Kfeatures. To train the network
it uses the back-propagation method to minimize the
squared reconstruction error with an additional spar-
sity penalty (Coates et al.,2011;Raina et al.,2007):
s.t. kbjk21,j1, .., s
where x(i)
uis unlabeled training example, bis basis
vector, and ais the vector of activations of the ba-
sis (Raina et al.,2007). The sparsity penalty included
in the form of the one-norm of the activation vector, a,
here biases the learner towards features, bj, that allow
the data items to be represented using a combination
of a small number of these features. The rationale
for using a sparse encoding for feature learning in the
gene regulation data is that features that allow for a
sparse representation are more likely to encode dis-
criminatory and functionally monolithic properties of
the original data and thus are more likely to form a
good basis for classification learning. Within the neu-
ral network, the sigmoid function has been selected as
the activation function g:
g(z) = 1
1 + exp(z)
As an additional option and for further comparison
we have used a stacked auto-encoder with two layers in
which greedy layer-wise learning has been used to train
the deep network (Bengio et al.,2007). In this greedy
layer-wise approach , we separately train each network.
However, the output of the first network functions is
an input to the second network.
4.2. Classifier Learning
In order to perform the task of cancer detection and
cancer type classification, the features learned in the
proposed unsupervised feature learning approach are
subsequently used with a set of labeled data for specific
cancer types to learn a classifier. For the results in
this paper we used softmax regression as the learning
approach for the classifier.
For comparison in the experiments presented in this
paper, a sparse autoencoder with one layer and one
with two layers (aka. stacked autoencoder) has been
used as the unsupervised feature learning method
to learn a sparse representation from unlabeled data
which then served as the input representation for clas-
sifier learning using the softmax regression classifier.
In addition, we performed an additional experiment in
which we used the fine-tuning method (Bengio et al.,
2007) in order to tune the weights of the features of the
stacked autoencoder to better match the requirements
of the classification task. In this method, the weights
of the features learned by the unsupervised feature
learner are tuned through the classifier using labeled
data. While this makes the features less generic by
tuning them towards the specific classification task,
it also promises the possibility of higher classification
accuracy in some situations.
Overall, the method presented in this paper takes
its strength from the combination of dimensionality
reduction through PCA and unsupervised non-linear
sparse feature learning for the construction of effective
features for general classification tasks. This method-
ology allows for the effective use of unlabeled data, and
thus of microarray data unrelated to the specific clas-
sification task, to assist in and improve the classifica-
tion accuracy. As mentioned earlier, since the number
of gene expression data samples for a specific cancer
type is generally low, other cancer data from the same
platform (i.e. with the same genes in the microarray)
are a good candidate to be used in this method as
unlabeled data for feature learning. One of the signifi-
cant advantages of this approach as compared to most
previous work is that it generalizes the feature sets
across different types of cancers. For instance, data
from prostate, lung, and other cancers can be used as
unlabeled data for feature learning in a breast cancer
detection or classification problem. The potential of
this is further demonstrated by the results of Lu et.
al in (Lu et al.,2007) who showed via comprehensive
gene analyses that it is possible to find common cancer
genes across different cancer data. This finding inten-
sifies our argument for having generalized feature sets
across data from various cancer types.
5. Results
To demonstrate the feasibility and applicability of the
proposed method, we first obtained 13 different data
sets from various papers/sources as summarized in Ta-
ble 1. In Table 1, columns 2 and 3 show the dimen-
sionality of the data. Column 3 shows labeled data
which is used for training the classifier. For the fea-
Using deep learning to enhance cancer diagnosis and classification
ture learning, we use the unlabeled data in column 2.
Some of the feature sets have been expanded to in-
clude samples of various different types of cancer from
other data sets of the same microarray platform. This
feature expansion provides the ability to the feature
learning algorithm to learn more generalized features
that are not specific to an individual cancer but rather
reflect features of interest in general cancers.
Due to the high dimensionality of the data, the data
was, as described in the approach section, prepro-
cessed by applying PCA to reduce its dimensional-
ity. Three different of sparse encoders have been used
to learn features: sparse autoencoder which contains
just one hidden layer, two layer stacked autoencoder,
and stacked autoencoder with fine-tuning, which is
trained based on a greedy layer-wise approach. The
fine-tuning method uses labeled data to tune the pa-
rameters from the stacked autoencoder during the clas-
sifier training stage. Columns 2, 3, and 4 in Table 2
correspond to results of each of these methods, respec-
To evaluate the robustness of the classifier, we per-
formed 10-fold cross-validation and results are pre-
sented in terms of the average classification accuracy.
In addition, the standard deviation of the classification
accuracy across the different learning trials is repre-
sented in the table. We also compare our proposed
algorithm against two baselines: SVM with Gaussian
kernel, and Softmax Regression. Note, that these
methods also use the principal component projections
as features to address the very high dimensionality and
relatively small number of samples in the datasets. Ta-
ble 2reports the results of the data sets where only
the result of the better of the two baseline algorithms
is reported. From this we can see that the proposed
method which uses the sparse autoencoder features de-
rived from PCA and randomly selected raw features
outperforms the baseline algorithms which do not use
unsupervised sparse features. The only exceptions are
the second and ninth data sets, for which our method
does not outperform the baseline algorithms. We be-
lieve that we can actually improve those results by
adding more unlabeled data to the feature sets. We
were unable to do so because the platforms of the
data sets were either a very specialized micro-array
for which there was not a lot of samples available, or
we could not find data from the same platform.
6. Conclusion
In this paper, we propose a method to enhance can-
cer diagnosis and classification from gene expression
data using unsupervised and deep leaning methods.
The proposed method, which uses PCA to address
the very high dimensionality of the initial raw feature
space followed by sparse feature learning techniques
to construct discriminative and sparse features for the
final classification step, provides the potential to over-
come problems of traditional approaches with feature
dimensionality as well as very limited size data sets. It
does this by allowing data from different cancers and
other tissue samples to be used during feature learning
independently of their applicability to the final classifi-
cation task. Applying this method to cancer data and
comparing it to baseline algorithms, our method not
only shows that it can be used to improve the accu-
racy in cancer classification problems, but also demon-
strates that it provides a more general and scalable
approach to deal with gene expression data across dif-
ferent cancer types.
Alon, U., Barkai, N., Notterman, D. A., Gish, K.,
Ybarra, S., Mack, D., and Levine, A. J. Broad
patterns of gene expression revealed by clustering
analysis of tumor and normal colon tissues probed
by oligonucleotide arrays. Proc. of the National
Academy of Sciences of the United States of Amer-
ica, 96:6745–6750, Jun. 1999.
Aluru, S. Handbook of computational molecular biol-
ogy, volume 9. Chapman & Hall/CRC, 2005.
Bengio, Yoshua, Lamblin, Pascal, Popovici, Dan, and
Larochelle, Hugo. Greedy layer-wise training of deep
networks. pp. 153–160, 2007.
C. Aliferis, I. Tsamardinos, P. Massion N. Fananan-
pazir D. Hardin, A. Statnikov, N. Fananapazir, and
Hardin, D. Machine learning models for classifica-
tion of lung cancer and selection of genomic markers
using array gene expression data. In FLAIRS Conf.,
Cheok, M. H., Yang, W., Pui, C. H., Downing, J. R.,
Cheng, C., Naeve, C. W., Relling, M. V., and Evans,
W. E. Treatment-specific changes in gene expres-
sion discriminate in vivo drug response in human
leukemia cells. Nat Genet, 34:85–90, May 2003.
Coates, A., Lee, H., and Ng, A. Y. An analysis of
single-layer networks in unsupervised feature learn-
ing. In AISTATS, 2011.
Fujiwara, T., Hiramatsu, M., Isagawa, T., Ninomiya,
H., Inamura, K., Ishikawa, S., Ushijima, M., Mat-
suura, M., Jones, M.H., Shimane, M., Nomura,
H., Ishikawa, Y., and Aburatani, H. ASCL1-
coexpression profiling but not single gene expression
Using deep learning to enhance cancer diagnosis and classification
Table 1. Dataset
ID Dataset Feature Ma-
Data Matrix Data Labels
1 AML (Mills et al.,2009) 54613×2341 54613×183 1=AML, 2=MDS
2 Adenocarcinoma (Fujiwara et al.,
34749×193 34749×28 1=adenocarcinoma , 2=squamous cell
3 Breast Cancer (Woodward et al.,
30006×1047 30006×1047 1=non-IBC, 2=IBC
4 Leukemia (Klein et al.,2009) 54675×2284 54675×125 1=NPM1+, 2=NPM1-
5 Leukemia (Cheok et al.,2003) 12600×658 12600×60 1=MP,2=HDMTX,
6 AML (Yagi et al.,2003) 12625×625 12625×27 1=Complete Remission, 2=Relapse
7 Breast Cancer (Wang et al.,2005a) 22277×2301 22277×143 1=ER+, 2=ER-
8 Seminoma (Gashaw et al.,2005) 12625×618 12625×20 1=stage I, 2=state II and III
9 Ovarian Cancer (Petricoin et al.,
15154×153 15154×100 1=cancer, 2=normal
10 Colon Cancer (Alon et al.,1999) 2000×32 2000×30 1=cancer, 2=non-cancer
11 Medulloblastoma (Pomeroy et al.,
7129×30 7129×30 1=class0, 2=class1
12 Prostate Cancer (Singh et al.,2002) 12600×102 12600×34 1=tumor, 2=normal
13 Leukemia (Verhaak et al.,2009) 54613×2389 54613×230 1=NPM1+, 2=NPM1-
Table 2. Results
ID Sparse Autoen-
Stacked Autoen-
Stacked Autoencoder
with Fine Tunning
PCA + Softmax / SVM
(with Gaussian kernel)
1 74.36±0.062 51.35±0.019 95.15±0.047 94.04±0.03
2 91.67±0.18 87.5±0.16 87.5±0.16 93.33±0.14
386.67±0.219 63.33±0.204 83.33±0.272 85.0±0.241
4 56.09±0.024 56.09±0.024 93.65±0.049 92.95±0.09
546.76±0.23 33.71±0.038 33.71±0.038 46.33±0.18
681.67±0.298 55.0±0.137 55.0±0.137 73.33±0.196
785.454±0.10 73.48±0.02 73.48±0.020 84.07±0.069
8 35.0±0.337 56.67±0.161 80.0±0.258 76.67±0.251
9 75.45±0.135 55.03±0.0336 99.0±0.032 100.0±0.0
10 66.67±0.0 66.67±0.0 83.33±0.176 83.33±0.236
11 66.67±0.0 66.67±0.0 76.67±0.225 76.67±0.274
12 97.5±0.079 73.33±0.102 73.33±0.102 94.167±0.124
13 69.18±0.108 65.66±0.01 91.26±0.055 90.39±0.081
profiling defines lung adenocarcinomas of neuroen-
docrine nature with poor prognosis. Lung Cancer,
Jul. 2011. ISSN 01695002.
Gashaw, I., Gr¨ummer, R., Klein-Hitpass, L., Dushaj,
O., Bergmann, M., Brehm, R., Grobholz, R., Kli-
esch, S., Neuvians, T., Schmid, K., Ostau, C., and
Winterhager, E. Gene signatures of testicular semi-
noma with emphasis on expression of ets variant
gene 4. Cellular and Molecular Life Sciences, 62
(19):2359–2368, Oct. 2005.
Huang, G. B., Lee, H., and Learned-Miller, E. Learn-
ing hierarchical representations for face verification
with convolutional deep belief networks. In IEEE
Conf. on Computer Vision and Pattern Recognition,
pp. 2518–2525, 2012.
Klein, H.U., Ruckert, C., Kohlmann, A., Bullinger, L.,
Thiede, C., Haferlach, T., and Dugas, M. Quanti-
tative comparison of microarray experiments with
published leukemia related gene expression signa-
tures. BMC Bioinformatics, 10, 2009.
Le, Q.V., Han, J., Gray, J.W., Spellman, P.T.,
Borowsky, A., and Parvin, B. Learning invariant
features of tumor signatures. In ISBI, pp. 302–305.
IEEE, 2012. ISBN 978-1-4577-1858-8.
Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Con-
volutional deep belief networks for scalable unsu-
pervised learning of hierarchical representations. In
Proc. of the 26th Int’l Conf. on Machine Learning,
pp. 609–616, 2009a.
Lee, H., Largman, Y., Pham, P., and Ng, A.Y. Unsu-
Using deep learning to enhance cancer diagnosis and classification
pervised feature learning for audio classification us-
ing convolutional deep belief networks. In Advances
in Neural Information Processing Systems 22, pp.
1096–1104. 2009b.
Lu, Y., Yi, Y., Liu, P., Wen, W., James, M., Wang, D.,
and You, M. Common human cancer genes discov-
ered by integrated gene-expression analysis. PLoS
ONE, pp. e1149, 11 2007.
Mills, K.I., Kohlmann, A., Williams, P.M., Wieczorek,
L., Liu, W., Li, R., Wei, W., Bowen, D.T., Loef-
fler, H., Hernandez, J.M., Hofmann, W., and Hafer-
lach, T. Microarray-based classifiers and progno-
sis models identify subgroups with distinct clini-
cal outcomes and high risk of aml transformation
of myelodysplastic syndrome. Blood, 114:1063–72,
Nanni, L., Brahnam, S., and Lumini, A. Combin-
ing multiple approaches for gene microarray clas-
sification. Bioinformatics, 28:1151–1157, Apr. 2012.
ISSN 1367-4803.
Ng, A.Y. Unsupervised feature learning and
deep learning @ONLINE. URL http://ufldl.
Petricoin, E. F., Ardekani, A. M., Hitt, B. A., Levine,
P. J., Fusaro, V. A., Steinberg, S. M., Mills, G. B.,
Simone, C., Fishman, D. A., Kohn, E. C., and Li-
otta, L. A. Use of proteomic patterns in serum to
identify ovarian cancer. Lancet, 359(9306):572–577,
Feb 2002.
Pomeroy, S.L., Tamayo, P., Gaasenbeek, M., Sturla,
L.M., Angelo, M., Mclaughlin, M.E., Kim, J.Y.H.,
Goumnerova, L.C., Black, P.M., Lau, C., Allen,
J.C., Zagzag, D., Olson, J.M., Curran, T., Wet-
more, C., Biegel, J. A., Poggio, T., Mukherjee,
S., Rifkin, R., Califano, A., Stolovitzky, G., Louis,
D.N., Mesirov, J.P., Lander, E.S., and Golub, T.R.
Prediction of central nervous system embryonal tu-
mour outcome based on gene expression. Nature,
415(6870):436–442, Jan. 2002.
Raina, R., Battle, A., Lee, H., Packer, B., and Ng,
A.Y. Self-taught learning: transfer learning from
unlabeled data. In Proc. of the 24th Int’l Conf. on
Machine learning, pp. 759–766. ACM, 2007. ISBN
Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee,
S., Yeang, C., Angelo, M., Ladd, C., Reich, M., Lat-
ulippe, E., Mesirov, J.P., Poggio, T., Gerald, W.,
Loda, M., Lander, E.S., and Golub, T.R. Multiclass
cancer diagnosis using tumor gene expression signa-
tures. Proc. of the National Academy of Sciences of
the United States of America, 98(26):15149–15154,
Dec. 2001.
Sharma, A., Imoto, S., and Miyano, S. A top-r feature
selection algorithm for microarray gene expression
data. IEEE/ACM Trans. Comput. Biol. Bioinfor-
matics, 9:754–764, May 2012. ISSN 1545-5963.
Singh, D., Febbo, P.G., Ross, K., Jackson, D.G.,
Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A.,
D’Amico, A.V., and Richie, J.P. Gene expression
correlates of clinical prostate cancer behavior. Can-
cer Cell, 1:203–209, Mar. 2002.
Tan, Aik C.C. and Gilbert, D. Ensemble machine
learning on gene expression data for cancer clas-
sification. Applied bioinformatics, 2, 2003. ISSN
Verhaak, R.G.W., Wouters, B.J., Erpelinck, C.A.J.,
Abbas, S., Beverloo, H.B., Lugthart, S., Lwenberg,
B., Delwel, H.R., and Valk, P.J.M. Prediction of
molecular subtypes in acute myeloid leukemia based
on gene expression profiling. Haematologica, 94:131–
134, Jan. 2009.
Wang, Y., Klijn, J.G.M., Zhang, Y., Sieuwerts, A.M.,
Look, M.P., Yang, F., Talantov, D., Timmermans,
M., Meijer-van Gelder, M.E., Yu, J., Jatkoe, T.,
Berns, E.M.J.J., Atkins, D., and Foekens, J.A.
Gene-expression profiles to predict distant metas-
tasis of lymph-node-negative primary breast cancer.
The Lancet, 365(9460):671–679, Feb. 2005a.
Wang, Y., Tetko, I.V., Hall, M.A., Frank, E., Facius,
A., Mayer, K.F. X., and Mewes, H.W. Gene selec-
tion from microarray data for cancer classification-a
machine learning approach. Comput. Biol. Chem.,
29(1):37–46, Feb. 2005b. ISSN 1476-9271.
Woodward, W.A., Krishnamurthy, S., Yamauchi, H.,
El-Zein, R., Ogura, D., Kitadai, E., Niwa, S., Cristo-
fanilli, M., Vermeulen, P., Dirix, L., Viens, P., Laere,
S., Bertucci, F., Reuben, J.M., and Ueno, N.T. Ge-
nomic and expression analysis of microdissected in-
flammatory breast cancer. Breast Cancer Research
and Treatment, pp. 1–12, 2013. ISSN 0167-6806.
Yagi, T., Morimoto, A., Eguchi, M., Hibi, S., Sako,
M., Ishii, E., Mizutani, S., Imashuku, S., Ohki, M.,
and Ichikawa, H. Identification of a gene expression
signature associated with pediatric aml prognosis.
Blood, 102:1849–56, 2003. ISSN 0006-4971.
... Disease diagnosis AD/MCI classification [4][5][6][7][8], cancer [9][10][11][12][13], cataract grading [1], chronic gastritis [14,15], geriatric depression [16] Alzheimer's disease classifier [17], autism spectrum disorder [13], pulmonary classification [18] SAE, DBN Automatic image segmentation [21,22], MRI image reconstruction [23,24] SAE, CNN Medical data modeling Brain development [25,26], emotional analysis [27] DBN Protein structure prediction and reconstruction ...
... It is a feedforward network, where the input from the input layer flows though hidden layers and reaches the output layer. The output of each layer is calculated as Equation (11). ...
Full-text available
How to use multi-dimensional time series data is a huge challenge for big data analysis. Multiple trajectories of medical use in electronic medical data are typical time series data. Although many artificial-intelligence techniques have been proposed to use the multiple trajectories of medical use in predicting the risk of concurrent medical use, most existing methods pay less attention to the temporal property of medical-use trajectory and the potential correlation between the different trajectories of medical use, resulting in limited concurrent multi-trajectory applications. To address the problem, we proposed a multi-stage neural network-based application mode of multi-dimensional time series data for feature learning of high-dimensional electronic medical data in adverse event prediction. We designed a synthetic factor for the multiple -trajectories of medical use with the combination of a Long Short Term Memory–Deep Auto Encoder neural network and bisecting k-means clustering method. Then, we used a deep neural network to produce two kinds of feature vectors for risk prediction and risk-related factor analysis, respectively. We conducted extensive experiments on a real-world dataset. The results showed that our proposed method increased the accuracy by 5%~10%, and reduced the false rate by 3%~5% in the risk prediction of concurrent medical use. Our proposed method contributes not only to clinical research, where it helps clinicians make effective decisions and establish appropriate therapy programs, but also to the application optimization of multi-dimensional time series data for big data analysis.
... Over the past ten years, machine learning algorithms have been developed well and used in a wide variety of health care fields, including preventive medicine [8], cancer diagnosis [9], and organ transplant [10] 2021). In the case of ED triage, developing fast and accurate machine learning models can improve triage accuracy. ...
... In traditional SA, the search starts with a large temperature, which allows a high probability of accepting a worse solution. Then, the temperature continues to decline by a fixed value (Equation 9) as the number of iterations increases until a stopping criterion is met. In other words, as the search continues, the probability of replacing a new solution with an old solution decreases, which increases the chance of trapping in a local maxima. ...
Patient triage at emergency departments (EDs) is necessary to prioritize care for patients with critical and time-sensitive conditions. Different tools are used for patient triage and one of the most common ones is the emergency severity index (ESI), which has a scale of five levels, where level 1 is the most urgent and level 5 is the least urgent. This paper proposes a framework for utilizing machine learning to develop an e-triage tool that can be used at EDs. A large retrospective dataset of ED patient visits is obtained from the electronic health record of a healthcare provider in the Midwest of the US for three years. However, the main challenge of using machine learning algorithms is that most of them have many parameters and without optimizing these parameters, developing a high-performance model is not possible. This paper proposes an approach to optimize the hyperparameters of machine learning. The metaheuristic optimization algorithms simulated annealing (SA) and adaptive simulated annealing (ASA) are proposed to optimize the parameters of extreme gradient boosting (XGB) and categorical boosting (CaB). The newly proposed algorithms are SA-XGB, ASA-XGB, SA-CaB, ASA-CaB. Grid search (GS), which is a traditional approach used for machine learning fine-tunning is also used to fine-tune the parameters of XGB and CaB, which are named GS-XGB and GS-CaB. The six algorithms are trained and tested using eight data groups obtained from the feature selection phase. The results show ASA-CaB outperformed all the proposed algorithms with accuracy, precision, recall, and f1 of 83.3%, 83.2%, 83.3%, 83.2%, respectively.
... [7][8][9] Successful solutions have been offered for classification problems using deep models developed by many researchers on medical images. [10][11][12][13][14][15] Studies on biomedical data not only use images but also text processing which are very popular. Several problems can be solved using text processing techniques on text data. ...
Full-text available
Image‐based computer‐aided diagnosis systems are frequently utilized to detect vital disorders. These systems consist of methods based on machine learning and work on data obtained from imaging technologies such as x‐rays, magnetic resonance imaging, and computed tomography. In addition to image data, clinical findings usually consist of text data that have a critical role in diagnosing diseases. In this study, an effective classification approach that can automatically detect diseases using a deep learning algorithm is proposed. A clinically usable deep machine‐learning model is presented with an approach based on the hybrid use of image processing and text processing methods. Although image‐processing techniques are used frequently in the literature by using images, text processing techniques are not often used in this field. The proposed method was evaluated on a dataset consisting of x‐ray images of subjects with and without Covid‐19 disease and their clinical notes. When the dataset was classified using image processing and text processing techniques accuracy values were obtained as 90.9% and 92.0% respectively. With the hybrid approach proposed in this study, the classification performance was increased and an accuracy value of 93.5% was obtained.
... 4) Cancer metastasis, which means that cancer cells spread and metastasize at an extremely frightening speed. Even if cancer is discovered and treated at this time, it is impossible to totally suppress all cancer cells, and the cure rate is exceedingly low [2]. Therefore, detecting cancer before it progresses to stage 4 and implementing prompt treatment measures can successfully prevent cancer from progressing further. ...
Full-text available
Objective: To overcome the phenomenon of “dimensionality curse” and class imbalance existing in biomolecule data and take full advantage of it to realize pan-cancer prediction and biomarker identification. Methods: We collected a total of 6133 samples with 33 types of tumor from the cancer genome atlas (TCGA), which had been classified and labeled for training,validation and testing, composed of multi-omics data, including DNA methylation,RNA expression and reverse phase protein array (RPPA). Then, we integrate variational autoencoder (VAE) and graph convolution neural networks (GTCN) as an entity capable of projecting high-dimensional features to low latent space, generating samples similar to input data, fulfilling pan-cancer prediction and vital biomarkers associated with specific tumor types. The suggested model’s performance was tested following 10-fold cross validation and then compared to main-flow relevant models. Results: The average accuracy of the proposed model reached 93.90±1.01% after 10-fold cross validation in pan-cancer prediction task. Additional relevant evaluation metrics F1 score, precision and recall were 92.03±0.3%, 90.05±1.02% and 91.08±1.0%, respectively. For the prediction performance of single type of cancer,such as LGG and BRCA datasets, the proposed model achieved 83.46±0.19% and 84.02±0.25% prediction accuracy, separately. Furthermore, the model also identified a few essential biomarkers which were proved by the survival curve analysis. Conclusion: We developed a pan-cancer prediction and biomarker identificationsystem based on joining VAE and GTCN using multi-omics data. This approach indicates multi-omics data have close relationship with cancer and is conducive to understanding of the mechanism of cancer formation. The predictive results canbe consulted by doctors. Keywords: Multi-omics data; Variational auto-encoder; Graph tree convolutionnetwork; Pan-cancer prediction; Biomarker identification
... Makine öğrenmesi ve hassas tıp uygulamalarının büyük çoğunluğu, sonuç değişkeninin (örneğin, hastalığın başlangıcı) bilindiği bir eğitim veri seti gerektirir; buna denetimli öğrenme denilmektedir. Sağlık hizmetlerinde yaygın bir derin öğrenme uygulaması radyoloji görüntülerinde potansiyel olarak kanserli lezyonların tanınmasıdır (Fakoor, 2013). Derin öğrenme, radyomiklere veya insan gözünün algılayabileceğinin ötesinde görüntüleme verilerinde klinik olarak ilgili özelliklerin saptanmasına giderek daha fazla uygulanmaktadır (Vial, 2018). ...
... The AI technologies in the healthcare industry include machine learning (ML), natural language processing (NLP), physical robots, robotic process automation, etc. [5]. In ML, neural network models and deep learning with various features are being applied in imaging data to identify clinically significant elements at the early stages, especially in cancer-related diagnoses [6,7]. NLP uses computational techniques to comprehend human speech and • Forecasting of an epidemic/pandemic. ...
Full-text available
Artificial intelligence (AI) is a branch of computer science that allows machines to work efficiently, can analyze complex data. The research focused on AI has increased tremendously, and its role in healthcare service and research is emerging at a greater pace. This review elaborates on the opportunities and challenges of AI in healthcare and pharmaceutical research. The literature was collected from domains such as PubMed, Science Direct and Google scholar using specific keywords and phrases such as ‘Artificial intelligence’, ‘Pharmaceutical research’, ‘drug discovery’, ‘clinical trial’, ‘disease diagnosis’, etc. to select the research and review articles published within the last five years. The application of AI in disease diagnosis, digital therapy, personalized treatment, drug discovery and forecasting epidemics or pandemics was extensively reviewed in this article. Deep learning and neural networks are the most used AI technologies; Bayesian nonparametric models are the potential technologies for clinical trial design; natural language processing and wearable devices are used in patient identification and clinical trial monitoring. Deep learning and neural networks were applied in predicting the outbreak of seasonal influenza, Zika, Ebola, Tuberculosis and COVID-19. With the advancement of AI technologies, the scientific community may witness rapid and cost-effective healthcare and pharmaceutical research as well as provide improved service to the general public.
... And the most complex forms of machine learning contain deep learning which anticipates results. It is used in healthcare to recognize potential cancerous lesions in radiology images (Fakoor et al., 2013). ...
Full-text available
Artificial intelligence (AI) has revolutionized healthcare by enhancing the quality of patient care. Despite its advantages, doctors are still reluctant to use AI in healthcare. Thus, the authors' main objective is to obtain an in-depth understanding of the barriers to doctors' adoption of AI in healthcare. The authors conducted semi-structured interviews with 11 doctors. Thematic analysis as chosen to identify patterns using QSR NVivo (version 12). The results showed that the barriers to AI adoption are lack of financial resources, need for special training, performance risk, perceived cost, technology dependency, need for human interaction, and fear of AI replacing human work.
Full-text available
Cancer is a major medical problem worldwide. Due to its high heterogeneity, the use of the same drugs or surgical methods in patients with the same tumor may have different curative effects, leading to the need for more accurate treatment methods for tumors and personalized treatments for patients. The precise treatment of tumors is essential, which renders obtaining an in-depth understanding of the changes that tumors undergo urgent, including changes in their genes, proteins and cancer cell phenotypes, in order to develop targeted treatment strategies for patients. Artificial intelligence (AI) based on big data can extract the hidden patterns, important information, and corresponding knowledge behind the enormous amount of data. For example, the ML and deep learning of subsets of AI can be used to mine the deep-level information in genomics, transcriptomics, proteomics, radiomics, digital pathological images, and other data, which can make clinicians synthetically and comprehensively understand tumors. In addition, AI can find new biomarkers from data to assist tumor screening, detection, diagnosis, treatment and prognosis prediction, so as to providing the best treatment for individual patients and improving their clinical outcomes.
Full-text available
The novel coronavirus disease (COVID-19) has spread as a pandemic to 228 nations, with alarming effects on healthcare, socioeconomic conditions, and international relations. The study's primary objective is to demonstrate how artificial intelligence and 5G-enabled technologies are used now and what it means for defeating COVID-19 and limiting the shortcomings of existing healthcare systems. This book also deeply explores other digital technologies like machine learning and deep learning, the Internet of Things, big data analytics, cloud computing, robotics, and other digital platforms that can be combined with 5G technology to create innovative solutions when battling the coronavirus pandemic. The healthcare industry is rapidly adopting 5G-based technology to improve health services, medical research, quality of life, and the experiences of healthcare providers and their patients in any location. This book also presents several artificial intelligence approaches that are crucial for achieving positive health, social, and economic outcomes. It highlights and organizes their use in tackling COVID-19 in areas like detection and diagnosis, reports analysis and treatment procedures, health research and drug development, social control and services, and outbreak prediction. This book also aims to showcase 5G-based solutions that can address COVID-19 difficulties in various contexts by concentrating on 5G technology and existing healthcare issues. It is intended that this ongoing study will assist academics in modeling healthcare systems and encourage additional research into innovative technology. Finally, we recommend further research directions and suggest that attractive AI approaches, probabilistic models, and supervised learning is necessary to address future epidemiological issues. It also requires more studies on how to establish a digital society based on 5G connectivity and how to handle the problems of safety, security, privacy, availability, accessibility, and integrity during future health emergencies.
Conference Paper
Full-text available
Most modern face recognition systems rely on a feature representation given by a hand-crafted image descriptor, such as Local Binary Patterns (LBP), and achieve improved performance by combining several such representations. In this paper, we propose deep learning as a natural source for obtaining additional, complementary representations. To learn features in high-resolution images, we make use of convolutional deep belief networks. Moreover, to take advantage of global structure in an object class, we develop local convolutional restricted Boltzmann machines, a novel convolutional learning model that exploits the global structure by not assuming stationarity of features across the image, while maintaining scalability and robustness to small misalignments. We also present a novel application of deep learning to descriptors other than pixel intensity values, such as LBP. In addition, we compare performance of networks trained using unsupervised learning against networks with random filters, and empirically show that learning weights not only is necessary for obtaining good multilayer representations, but also provides robustness to the choice of the network architecture parameters. Finally, we show that a recognition system using only representations obtained from deep learning can achieve comparable accuracy with a system using a combination of hand-crafted image descriptors. Moreover, by combining these representations, we achieve state-of-the-art results on a real-world face verification database.
Conference Paper
Full-text available
We present a novel method for automated learning of features from unlabeled image patches for classification of tumor architecture. In contrast to previous manually-designed feature detectors (e.g., Gabor basis function), the proposed method utilizes inexpensive un-labeled data to construct features. The algorithm, also known as reconstruction independent subspace analysis, can be described as a two-layer network with non-linear responses, where the second layer represents subspace structures. The technique is applied to tissue sections for characterizing necrosis, apoptotic, and viable regions of Glioblastoma Multifrome (GBM) from TCGA dataset. Experimental results show that this method outperforms more complex expert-designed approaches. The fact that our approach learns features automatically from unlabeled data promises a wider application of self-learning strategies for tissue characterization.
Conference Paper
A great deal of research has focused on algorithms for learning features from un- labeled data. Indeed, much progress has been made on benchmark datasets like NORB and CIFAR by employing increasingly complex unsupervised learning al- gorithms and deep models. In this paper, however, we show that several very sim- ple factors, such as the number of hidden nodes in the model, may be as important to achieving high performance as the choice of learning algorithm or the depth of the model. Specifically, we will apply several off-the-shelf feature learning al- gorithms (sparse auto-encoders, sparse RBMs and K-means clustering, Gaussian mixtures) to NORB and CIFAR datasets using only single-layer networks. We then present a detailed analysis of the effect of changes in the model setup: the receptive field size, number of hidden nodes (features), the step-size (stride) be- tween extracted features, and the effect of whitening. Our results show that large numbers of hidden nodes and dense feature extraction are as critical to achieving high performance as the choice of algorithm itselfso critical, in fact, that when these parameters are pushed to their limits, we are able to achieve state-of-the- art performance on both CIFAR and NORB using only a single layer of features. More surprisingly, our best performance is based on K-means clustering, which is extremely fast, has no hyper-parameters to tune beyond the model structure it- self, and is very easy implement. Despite the simplicity of our system, we achieve performance beyond all previously published results on the CIFAR-10 and NORB datasets (79.6% and 97.0% accuracy respectively).
In this paper, we propose a new unsupervised feature learning framework, namely Deep Sparse Coding (DeepSC), that extends sparse coding to a multi-layer architecture for visual object recognition tasks. The main innovation of the framework is that it connects the sparse-encoders from different layers by a sparse-to-dense module. The sparse-to-dense module is a composition of a local spatial pooling step and a low-dimensional embedding process, which takes advantage of the spatial smoothness information in the image. As a result, the new method is able to learn several levels of sparse representation of the image which capture features at a variety of abstraction levels and simultaneously preserve the spatial smoothness between the neighboring image patches. Combining the feature representations from multiple layers, DeepSC achieves the state-of-the-art performance on multiple object recognition tasks.
Sequence Alignments Pairwise Sequence Alignments Benjamin N. Jackson and Srinivas Aluru Spliced Alignment and Similarity-Based Gene Recognition Alexey D. Neverov, Andrey A. Mironov, and Mikhail S. Gelfand Multiple Sequence Alignment Osamu Gotoh, Shinsuke Yamada, and Tetsushi Yada Parametric Sequence Alignment David Fernandez-Baca and Balaji Venkatachalam String Data Structures Lookup Tables, Suffix Trees and Suffix Arrays Srinivas Aluru Suffix Tree Applications in Computational Biology Pang Ko and Srinivas Aluru Enhanced Suffix Arrays and Applications Mohamed I. Abouelhoda, Stefan Kurtz, and Enno Ohlebusch Genome Assembly and EST Clustering Computational Methods for Genome Assembly Xiaoqiu Huang Assembling the Human Genome Richa Agarwala Comparative Methods for Sequence Assembly Vamsi Veeramachaneni Information Theoretic Approach to Genome Reconstruction Suchendra Bhandarkar, Jinling Huang, and Jonathan Arnold Expressed Sequence Tags: Clustering and Applications Anantharaman Kalyanaraman and Srinivas Aluru Algorithms for Large-Scale Clustering and Assembly of Biological Sequence Data Scott J. Emrich, Anantharaman Kalyanaraman, and Srinivas Aluru Genome-Scale Computational Methods Comparisons of Long Genomic Sequences: Algorithms and Applications Michael Brudno and Inna Dubchak Chaining Algorithms and Applications in Comparative Genomics Enno Ohlebusch and Mohamed I. Abouelhoda Computational Analysis of Alternative Splicing Mikhail S. Gelfand Human Genetic Linkage Analysis Alejandro A. Schaffer Combinatorial Methods for Haplotype Inference Dan Gusfield and Steven Hecht Orzack Phylogenetics An Overview of Phylogeny Reconstruction C. Randal Linder and Tandy Warnow Consensus Trees and Supertrees Oliver Eulenstein Large-Scale Phylogenetic Analysis Tandy Warnow High-Performance Phylogeny Reconstruction David A. Bader and Mi Yan Microarrays and Gene Expression Analysis Microarray Data: Annotation, Storage, Retrieval and Communication Catherine A. Ball and Gavin Sherlock Computational Methods for Microarray Design Hui-Hsien Chou Clustering Algorithms for Gene Expression Analysis Pierre Baldi, G. Wesley Hatfield, and Li M. Fu Biclustering Algorithms: A Survey Amos Tanay, Roded Sharan, and Ron Shamir Identifying Gene Regulatory Networks from Gene Expression Data Vladimir Filkov Modeling and Analysis of Gene Networks Using Feedback Control Theory Hana El Samad and Mustafa Khammash Computational Structural Biology Predicting Protein Secondary and Supersecondary Structure Mona Singh Protein Structure Prediction with Lattice Models William E. Hart and Alantha Newman Protein Structure Determination via NMR Spectral Data Guohui Lin, Xin Tu, and Xiang Wan Geometric Processing of Reconstructed 3D Maps of Molecular Complexes Chandrajit Bajaj and Zeyun Yu In Search of Remote Homolog Dong Xu, Ognen Duzlevski, and Xiu-Feng Wan Biomolecular Modeling using Parallel Supercomputers Laxmikant V. Kale, Klaus Schulten, Robert D. Skeel, Glenn Martyna, Mark Tuckerman, James C. Phillips, Sameer Kumar, and Gengbin Zheng Bioinformatic Databases and Data Mining String Search in External Memory: Data Structures and Algorithms Paolo Ferragina Index Structures for Approximate Matching in Sequence Databases Tamer Kahveci and Ambuj K. Singh Algorithms for Motif Search Sanguthevar Rajasekaran Data Mining in Computational Biology Mohammed J. Zaki and Karlton Sequeira Index