Content uploaded by Ashkan Zeinalzadeh
Author content
All content in this area was uploaded by Ashkan Zeinalzadeh on Nov 30, 2016
Content may be subject to copyright.
A Neural Network Model to Classify Liver Cancer Patients Using Data
Expansion and Compression
Ashkan Zeinalzadeh1, Tom Wenska, Gordon Okimoto
Abstract— We develop a neural network model to classify
liver cancer patients into high-risk and low-risk groups using
genomic data. Our approach provides a novel technique to
classify big data sets using neural network models. We prepro-
cess the data before training the neural network models. We
first expand the data using wavelet analysis. We then compress
the wavelet coefficients by mapping them onto a new scaled
orthonormal coordinate system. Then the data is used to train a
neural network model that enables us to classify cancer patients
into two different classes of high-risk and low-risk patients. We
use the leave-one-out approach to build a neural network model.
This neural network model enables us to classify a patient using
genomic data as a high-risk or low-risk patient without any
information about the survival time of the patient. The results
from genomic data analysis are compared with survival time
analysis. It is shown that the expansion and compression of
data using wavelet analysis and singular value decomposition
(SVD) is essential to train the neural network model.
I. INTRODUCTION
The goal of this study is to build a neural network model
to classify patients into high-risk and low-risk patients based
on genomic data. To build this model we use the genomic
data of 390 patients. This model enables us to determine
the risk status for a new patient without any knowledge
about the patient’s survival time, although the results of
classification using the neural network must be comparable
with the classification result from survival-time analysis.
The liver cancer data are big data sets. Neural network
models are not computationally effective for big data sets.
The first challenge to apply the neural network to the
genomic data is the size of the data. The genomic data
include over 20000 genes. We define genes as the parameters
of a desired signal. The high number of parameters increases
the calculation complexity for the classification of the data in
the neural networks [1]-[4]. The authors have developed an
algorithm to reduce the number of genes (parameters) to less
than 40 [5]-[7]. We consider the data as a matrix, in which
rows are the genes and columns are samples (patients).
Based on the survival-time analysis, less than 17 percent of
patients are low-risk and the rest are high-risk. This type of
data is called imbalanced data in the literature [2], in which
one or a few subsets of the clustered data has significantly
smaller size in comparison to the rest of subsets in the
clustered data. There are different techniques to analyze the
imbalanced data, e.g. regenerating the data by resampling [8].
The techniques for resampling from data are data dependent.
Authors in [9], develop a method to resample from data based
1Ashkan Zeinalzadeh was with the Cancer Research Center, University
of Hawaii at Manoa, HI, USA azeinalz@nd.edu
on the statistics of the data. We do not regenerate the data
using resampling. Building a neural network model to cluster
imbalanced data is difficult, because the subsets of data with
small sizes are ignored in the neural model. Despite the fact
that the low-risk patients are a small subset of the patients,
it is important not to recognize a low-risk patient as a high-
risk patient because of the consequences of the subsequent
treatment.
The data is composed of two different sets of censored data
and uncensored data. The censored data corresponds to the
patients for which survival time is unknown (live patients)
and uncensored data corresponds to those for which survival
time is known (deceased patients).
We compute the one-dimensional continuous wavelet coef-
ficients for each patient separately. We perform the wavelet
analysis using the Mexican hat wavelet. We vectorize the
output of the wavelet and construct a new matrix by replacing
each column of the original data matrix with the vectorized
form of the wavelet coefficients. We map the data onto
orthonormal bases, which are the left singular vectors of the
new data matrix. The new data matrix is used to train a neural
network model. The parameters of the neural network models
are trained based on an iterative method. These parameters
are optimized to provide classification results comparable
with those from survival time analysis. It is shown that the
expansion and compression of data leads to a better neural
network model to classify the liver cancer patients.
The rest of the paper is structured as follows. In Section II-
A, the survival time analysis is used to classify the patients.
In Section II-B, the wavelet analysis and singular value de-
composition is applied to expand the data to time-frequency
waveforms and compress them respectively. In Section II-
C, a neural network model is trained to classify the patients.
We describe how the parameters of the neural network model
are optimized. In Section II-D, we show the numerical results
obtained through simulation evaluations. Finally, concluding
remarks are provided in Section III.
II. METHODS
A. Survival-Time Analysis
Patients with survival-time of more than 5years are
called low-risk patients. The deceased patients (uncensored
patients) with survival time less than 5 years are called high-
risk patients. The risk status of a live patient (censored
patient) with survival time, ST , of less than 5years is
determined based on the Kaplan-Meier estimator, [10]. The
advantage of the Kaplan-Meier estimator is to take into
account the censored data. From 390 patients, 173 patients
arXiv:1611.07588v2 [stat.ML] 25 Nov 2016
(44%) are censored and 217 patients (56%) are uncensored.
Let ST be a random variable that denotes the survival time,
Pbe the Kaplan-Meier cumulative distribution function for
ST . For a given tless than 5, the conditional cumulative
distribution function (CDF) is defined as follows
P(ST ≥5|S T ≥t) = 1−P(ST ≤5)
1−P(ST ≤t).(1)
If the conditional CDF P(ST ≥5|S T ≥t)is equal to
or bigger than 0.75, then the patient is considered low-
risk, otherwise the patient is considered high-risk. Out of
390 patients, 67 patients (17%) are low-risk and 323 (83%)
are high-risk. It is observed that a small number of patients
are low risk patients. This can complicate the algorithm for
training the neural network model to classify the patients. In
figure 1, the Kaplan-Meier CDF for 390 patients is plotted.
The vertical axis is the probability of survival time and the
horizontal axis is the survival time in number of days. The
red point in the figure 1 represents the five year survival time.
It is observed that P(ST ≤5) = 0.71.
Fig. 1. The Kaplan-Meier CDF for the survival time of 390 patients.
B. Eigen Wavelet Features
Let X= [x1x2··· xn]be an m×nmatrix. Each
column of Xcontains the data for one patient. The rows
of the matrix Xcorrespond to genes. We use a continuous
wavelet transform to analyze how the frequency content
of a signal changes over a patient’s genes. The wavelet
transform compares the signal xwith shifted and scaled
copies of a basic wavelet. We use the Mexican hat wavelet
as the mother wavelet. This wavelet is very advantageous for
analyzing genetic signals because of its explicit mathematical
expression, smoothness, symmetry, and rapid decay. Let
Ψ(k)denotes the Mexican hat function of width σ
Ψ(k) = 2
√3σπ 1
4
(1 −k2
σ2) exp
−k2
2σ2.(2)
The larger the value of σthe more the energy of Ψ(k)is
spread out over the genes (horizontal axis). The continuous
wavelet transform (CWT) of a signal xat a scale s > 0and
position τ∈Ris expressed by the following integral
Wi(s, τ ;xi(t),Ψ(t)) = 1
p|s|Zx(t)Ψ∗(t−τ
s)dt, (3)
where ∗denotes the complex conjugate. The CWT coeffi-
cients are affected by scale, s, position, τ, and the mother
wavelet function Ψ. In this analysis, the value of scale s
is fixed and the the position τis between 1and T. An
appropriate window size Tis chosen for the time-frequency
localization. The value for Tis chosen by visually inspecting
the wavelet coefficients. Wiis a matrix of T×m. Let Vi
be the vectorized form of the matrix Wi. We reconstruct the
matrix H= [V1···Vn]. The wavelet coefficients His a
matrix of T m ×n. The wavelet transform (w) expands the
data from the space m×nto the space T m ×n,
Xw
−→ H. (4)
The expansion using wavelet analysis maps the data onto a
larger space. It also expands a signal (through the genes)
into a waveform whose time and frequency properties are
the same as the original signal. We center the data matrix H
by taking the average of each row of data and subtracting it
from each entry in that row.
We then compress the wavelet coefficients by mapping
them onto a number of bases. These bases are the left
singular vectors of the waveform coefficients H. The singular
value decomposition of the matrix His given as
H=LSRT.(5)
We project the data onto the first kleft singular vectors of the
matrix H. Let liand ribe the ith column of the orthonormal
matrix Land Rrespectively
ˆ
H= [l1···lk]TH= [σ1r1·· · σkrk]T.(6)
The mapping in (6), compresses the waveform coeffi-
cients onto a new scaled orthonormal coordinate system
[σ1r1···σkrk],
H[l1···lk]
−→ ˆ
H. (7)
Thus we shrink the size of the waveform coefficients from
T m ×nto k×nand filter unwanted signals and noise.
Choosing a right value for the parameter kis crucial to
classifying the patients correctly. The experimental results
shows that the value of kshould be less than the number
of genes m. We use an iterative method to choose a range
of values for kfrom 1to m. We finally select a value
for kthat is able to classify the low-risk and high-risk
patients efficiently. Choosing the value of kdepends on other
parameters of the neural network model as will be described
in the next section.
C. Neural Network Model
We first classify the patients into two groups of high
risk and low risk patients using survival time analysis as
explained in Section II-A. We randomly choose Ppercent
of the high-risk and low-risk patients as the training data
and 1−Ppercent of the high-risk and low-risk patients as
the validation data. The value of Pis chosen based on an
iterative method that will be explained in the next paragraph.
The training and validation data are disjoint. The neural
models are trained for nseparate times on all the patients’
data except for one patient, and then prediction is made for
that patient. The output of the neural network is a number
between zero and one. We therefore choose a threshold
between zero and one. If the output of the neural network is
larger than the threshold, then the patient is considered high-
risk, otherwise the patient is considered low-risk. Choosing
the right value for the threshold is crucial for the analysis as
described below.
The parameters for expansion, compression, and the train-
ing of the neural network are chosen based on an iterative
method. These parameters are summarized as follows:
•Window size Tfor the expansion of data using wavelet
analysis.
•Number of right singular vectors kfor the compression
of data.
•Percentage of the data Pthat is used for the validation
of the neural network model.
•Number of hidden layers h.
•Threshold for the output of the neural network T h.
After iteration of the algorithm over a range of values, a set
of values for these parameters are chosen to give us a high
value of the true positive rate for low-risk patients and a
small value for the false positive rate for the corresponding
group of patients. The number of hidden layers hmust be
smaller than or equal to the number of right singular vectors
k. The threshold T h is a number between zero and one.
D. Validation Results
In this work, false Positive Rate (FPR) is defined as the
probability that a high-risk patient is recognized as a low-risk
patient. True Positive Rate (TPR) is defined as the probability
that a low-risk patient is recognized as a low-risk patient. Our
class of interest is the low-risk patients. We consider two sets
of genes, one including 36 genes and another one including
40 genes. These genes have been found by analyzing larger
genomic data sets by the authors, as described in [7]. The
genes in this signatures carry signal information that classify
the ovarian cancer patients by their response to standard
chemotherapy. We do the analysis for two separate groups of
patients. The first group has 54 patients. Out of 54 patients,
20 patients (37%) are low-risk and the rest are high-risk. The
second group has 99 patients. Out of 99 patients, 19 patients
(19%) are low-risk and the rest are high-risk. In figures 2-
5, the receiver operating characteristics (ROCs) for the two
sets of genes and two groups of patients are plotted. The
vertical axis is the True Positive Rate and the horizontal axis
is the False Positive Rate. An optimal threshold for the neural
network model and a set of parameters for the data expansion
and compression, is chosen based on an iterative algorithm
as given in Table II. The true positive rate, false positive
rate, and area under the ROC, for the optimal parameters are
given in Table I.
We second classify the patients into two groups of high
risk and low risk patients using regular neural network
model. We consider 390 of patients as the training data. We
train a model to classify 54 patients. Out of 390 patients,
67 patients (17%) are low-risk and 323 (83%) are high-risk.
We randomly consider 164 of high-risk patients and all the
low-risk patients (67 patients) as the training data. The rest
of high-risk patients are not considered in the analysis. We
train the neural network model on 231 patients, in which
29% are low-risk and 71% are high-risk. Then, prediction is
made for 54 patients. In figures 6-7, the receiver operating
characteristics (ROCs) for the two sets of genes are plotted.
It is observed that the result from leave-ne-out is similar to
regular neural network when the number of high-risk patient
is reduced. Similar to the leave-one approach Ppercent of
the high-risk and low-risk patients as the training data and
1−Ppercent of the high-risk and low-risk patients as the
validation data. The training and validation data are disjoint.
Similarly, a threshold is chosen for the output of the neural
network.
Fig. 2. ROC for 36 genes and 54 patients using leave-one-out approach.
Fig. 3. ROC for 36 genes and 99 patients using leave-one-out approach.
Fig. 4. ROC for 40 genes and 54 patients using leave-one-out approach.
Fig. 5. ROC for 40 genes and 99 patients using leave-one-out approach.
TABLE I
TRUE POSITIVE RATE (TPR), FAL SE POSITIVE RATE (FPR), ARE A
UN DER TH E ROC WITH EXPANSION AND COMPRESSION (AREA1), AND
AR EA UND ER THE RO C WITHOUT EXPANSION AND COMPRESSION
(AREA2)
Genes Patients TPR FPR AREA1 AREA2
40 99 0.83 0.07 0.80 0.24
40 54 0.97 0.11 0.88 0.27
36 99 0.92 0.13 0.83 0.36
36 54 0.97 0.07 0.91 0.36
III. CONCLUSIONS
It is observed that expansion and compression of the
data, enable the neural network model to classify patients
significantly better. The results from leave-one-out approach
is comparable with results from regular neural network.
Choosing the right parameters for the compression, expan-
sion and training model is crucial for the analysis. Reducing
the number of high-risk patients helps to train a neural
network model to classify high-risk and low-risk patients.
REFERENCES
[1] MA Jarrahi, H Samet, H Raayatpisheh, A Jafari, M Rakhshan, An
ANFIS-Based fault classification approach in double-circuit transmis-
sion line using current samples, International Work-Conference on
Artificial Neural Networks, pp. 225–236, 2015.
[2] M Rakhshan, F Shabani-nia, M ShaSadeghi, ANFIS Approach for
tracking control of mems triaxial gyroscope, Modeling and Simulation
in Electrical and Electronics Engineering 1 (1), pp. 35–40, 2015.
[3] M Rakhshan, E Moula, F Shabani-nia, B Safarinejadian, S Khorshidi,
Active noise control using wavelet function and network approach,
Journal of Low Frequency Noise, Vibration and Active Control 35
(1), pp. 4–16, 2016.
[4] M Rakhshan, S Khorshidi, B Safarinejadian, Active noise control in
presence of disturbance using adaptive neuro fuzzy inference system
Journal of Computational Intelligence and Electronic Systems 3 (2),
pp. 99–105, 2014.
[5] A Zeinalzadeh, T Wenska, G Okimoto, Integrated analysis of multiple
high-dimensional data sets by joint rank-1 matrix approximations,
2015 54th IEEE Conference on Decision and Control (CDC), 3852–
3857.
[6] Ashkan Zeinalzadeh, An iterated version of the generalized singular
value decomposition for the joint analysis of two high-dimensional
data sets, University of Hawaii at Manoa, 2013.
[7] G. Okimoto, A. Zeinalzadeh, T. Wenska, M. Loomis, J.B. Nation, T.
Fabre, M. Tiirikainen, B. Hernandez, O. Chan, L. Wong, S. Kwee,
Joint analysis of multiple high-dimensional data types using sparse
matrix approximations of rank-1with applications to ovarian and liver
cancer, BioData Mining, 9 (1), 24.
TABLE II
THE PARAMETERS OF EXPANSION,COMPRESSION AND NEURAL
NE TWOR K
Genes Patients T k P h Th
40 99 5 7 0.7 5 0.84
40 54 5 7 0.8 6 0.83
36 99 5 7 0.7 6 0.84
36 54 5 7 0.8 3 0.83
Fig. 6. ROC for 36 genes and 54 patients using regular neural network
approach.
[8] R. Longadge, S. S. Dongre, L. Malik, Class imbalance problem in
data mining, International Journal of Computer Science and Network,
Vol 2, Issue 1, pp. 83-87, 2013.
[9] R. T. Hadke, P. Khobragade, An approach for class imbalance using
oversampling technique, International Journal of Innovative Research
in Computer and Communication Engineering, Vol. 3, Issue 11, pp.
11451–11455 ,2015.
[10] E. L. Kaplan, and P. Meier, Nonparametric estimation from incomplete
observations, Journal of the American Statistical Association, Vol. 53,
No. 282, pp. 457–481,1958.
Fig. 7. ROC for 40 genes and 54 patients using regular neural network
approach.