Content uploaded by Dulani Meedeniya
Author content
All content in this area was uploaded by Dulani Meedeniya on Jan 22, 2021
Content may be subject to copyright.
Survival prediction and risk estimation of Glioma
patients using mRNA expressions
Navodini Wijethilake
Department of Computer Science Engineering
University of Moratuwa
Sri Lanka
navodiniw@cse.mrt.ac.lk
Dulani Meedeniya
Department of Computer Science Engineering
University of Moratuwa
Sri Lanka
dulanim@cse.mrt.ac.lk
Charith Chitraranjan
Department of Computer Science Engineering
University of Moratuwa
Sri Lanka
charithc@cse.mrt.ac.lk
Indika Perera
Department of Computer Science Engineering
University of Moratuwa
Sri Lanka
indika@cse.mrt.ac.lk
Abstract—Gliomas are lethal type of central nervous system
tumors with a poor prognosis. Recently, with the advancements in
the micro-array technologies thousands of gene expression related
data of glioma patients are acquired, leading for salient analysis
in many aspects. Thus, genomics are been emerged into the field
of prognosis analysis. In this work, we identify survival related 7
gene signature and explore two approaches for survival prediction
and risk estimation. For survival prediction, we propose a novel
probabilistic programming based approach, which outperforms
the existing traditional machine learning algorithms. An average
4 fold accuracy of 74% is obtained with the proposed algorithm.
Further, we construct a prognostic risk model for risk estimation
of glioma patients. This model reflects the survival of glioma
patients, with high risk for low survival patients.
Index Terms—Glioma, gene expression, risk score, probabilistic
programming, bayesian neural networks
I. INTRODUCTION
Gliomas are the most common central nervous system
tumor, that derive from neuroglial cells and progenitor cells
[1]. Gliomas account for 30% of the primary brain tumors
and 80% of the malignant brain tumors, causing majority of
deaths from primary brain tumors [2]. Despite the treatment,
the aggressive forms of gliomas have a mortality within
months. Nowadays, mostly the treatment planning in glioma
patients rely on histology and other clinical parameters such
as age. Based on the histology, Gliomas are been classified
into astrocytic, oligodendrogial and ependymal tumors, and
additionally considering the malignancy, the natural disease
cause, absense and presense of anaplastic features, they are
been assigned into World Health Organization (WHO) I-IV
grades [3]. Yet, the intra-tumor heterogeneity and alterations
in molecular levels are associated with the prognosis of glioma
patients than the underlying histology [4].
Recently, with the progression of the genomic, transcrip-
tomic and epigenetic profiling and with the technological
advances in Ribonucleic acid (RNA) sequencing, which is a
molecule that support several biological roles of genes, and
microarrays, novel approaches for classifying and analysing
gliomas are recognized. Moreover, these underlying molecular
pathogenesis lead to identify genetic alterations which can
cause gliomas that can also be complementary to histological
classifications and diagnostics [2]. Thus, following the WHO
2016 classification, Isocitrate dehydrogenase (IDH) family
IDH1/2 mutant, 1p/19q co-deleted tumors, mostly with oligo-
denrial histology have the best prognosis and belongs to WHO
grade II. WHO grade III gliomas are IDH1/2 mutant, 1p/19q
non co-deleted, Telomerase Reverse Transcriptase (TERT)
promoter-wild type, tumor protein p53 (TP53) mutant tumors
with an astrocytic morphology and have an intermediate sur-
vival. IDH wild-type, 1p/19q non codeleted tumors have a poor
prognosis and are mostly grade IV Glioblastoma multiforme
(GBM). In addition to these, there are several other molecular
signatures associated with gliomas in diagnosis and prognosis,
such as TERT, TP53, O-6-methylguanine Deoxyribonucleic
acid (DNA) methyltransferase (MGMT) and phosphatase and
tensin homolog (PTEN) [5].
Generally oncologists assess the survival of patients, based
on their experience and clinical factors. There factors include
the size, location and stage of the cancer for the survival
estimation. Hence, these decision can be biased, optimistic
and inaccurate [6]. However, accurate survival prediction ap-
proaches lead to less invasive better treatment with optimal
usage of resources [7]. Gene expression data is used in several
types of cancers for survival prediction and shows promising
improvements over other traditional approaches such as ra-
diomic based algorithms [8]–[10]. As a supervised machine
learning algorithm, artificial neural networks are developed for
survival prediction using gene expression data for many cancer
types [11], [12]. However the use of machine learning based
algorithms on gene expression data are not much seen for
survival prediction in days or in class wise of glioma patients.
In our previous work [9], we have demonstrated that in-
corporating both imaging biomarkers and gene expression
arXiv:2011.00659v1 [q-bio.GN] 2 Nov 2020
biomarkers outperforms the survival prediction accuracy of
glioblastoma patients compared to using both features sepa-
rately. However, the lack of data is a limitation of that study
and thus have motivated our study to have comparatively large
dataset by using all the glioma cases, both higher grade and
lower grade gliomas, of The Cancer Genome Atlas (TCGA)
and Chinese Glioma Genome Atlas (CGGA) cohorts. Thus, we
expect that this work will lead for new directions in survival
predictions of glioma patients, focusing more on genomics.
The proposed study has two major parts of survival esti-
mation in glioma patients. We depict machine learning algo-
rithms for survival prediction of glioma patients where we
execute prediction after performing feature selection on high
dimensional gene expression data. Further, we propose a novel
probabilistic programming based Bayesian neural network for
survival prediction of glioma patients in 3 categories; long,
short and medium survival. The selected prognostic genes are
also used to develop a prognostic risk score model. The main
outline of the work is shown in Fig. 1. The main objective
of this study is to identify the prognostic gene signature and
construct a prognostic model for risk estimation, that provide
useful insights for the prognosis of glioma patients. Thus,
the 7 gene signature based survival prediction algorithm with
probabilistic programming and prognostic risk model are the
novel contributions of this study.
13298 genes
TCGA GBM &
LGG Cohort (252)
16510 genes
Training (80%)
Validating (20%)
CGGA Glioma
Cohort (315)
24326 genes
Testing
Log2 & median centering
Iterative Bayesian Model
Averaging Algorithm
Prognositc Risk
Score Development
Overall survival
prediction
Cleaning and filtering
7 genes
Short
(<300 days)
Medium
(300-450days)
Long
(> 450days)
Fig. 1. Overview of the proposed solution.
The paper is structured as follows: Section II explores the
related work, Section III explains our approach for the analysis
and Section IV reveals our observation and the evaluation
summary. In Section V we discuss the important aspects,
limitations of our research.
II. BACKGROU ND
The mutations, methylations and other phenomena occur
in molecular level reflect the associations with the expression
levels of those biomarkers [13]. TP53 is recognized as a neg-
ative prognostic marker, causing poor prognosis in astrocytic
and oligoastrocytic gliomas. Hence, the expression of TP53
gene has an inverse correlation with survival of glioma patient
[14]. Correspondingly, TERT promoter mutation is common in
gliomas and also shows a negative prognosis in glioma patients
[13]. In fact, the expression of TERT gene is significantly high
with mutated TERT promoters [13]. Given the above, it is clear
that expression of genes in gliomas are significantly associated
with other molecular biomarkers and also effect the survival
and prognosis of glioma patients.
Gene expression profiling is commonly used for clustering
and subtype classifications of gliomas, using both unsuper-
vised and supervised approaches [15], [16]. Artificial neural
network (ANN) based subtype classification is also performed
on gene expression profiles of glioma patients [17].
Several studies related to cancer prognosis have used gene
expression details of genetic biomarkers for predicting the
cancer occurrence, for predicting the recurrence of cancers and
also for predicting outcomes after the diagnosis, such as mor-
tality, life expectancy, drug sensitivity etc. The first application
of machine learning, an artificial neural network is found
in early 1995 [18]. Moreover, O’Neill et al [19] also have
employed a neural network for diagnosing diffuse large B-cell
lymphoma with micro-array gene expression profiles. Later,
Chen et al [11] have proposed gene expression based artificial
neural network for predicting survival time of lymphoma
patients. Similar methodology is employed by Lancashire et
al [20] for breast cancer survival outcome prediction.
Nonetheless, Gene expression profiling is utilized for esti-
mating survival of glioma patients, as it is capable of revealing
the unrecognized heterogeneity of gliomas through hierarchi-
cal clustering [21]. Bonata at el [22] have proposed a Bayesian
ensemble model for survival prediction with high dimensional
gene expression profiles by selecting the genes potentially
related gliomagenesis. Identifying the potential biomarkers
related to glioma survival is also typically involves gene
expression profile analysis [23]. Typically, utilizing neural
networks the survival outcome of neuroblastoma patients after
5 years from the diagnosis is predicted with expression data
[24].
Risk score formulation, as a prognosis estimation tool, is
also established with the gene expression data for Glioma
patients [25], [26]. For this the most prominent features are
chosen with statistical methods based on the relationships of
each gene and survival. The predominant statistical analysis
based approaches are univariate and multivariate cox propor-
tional hazard regression analysis. Thus, based on the Hazard
ratio and the p value of each gene, the features associated with
survival are identified. In some studies [25], the features have
chosen based on the methylation status of each gene as well.
However, some studies have mentioned that risk score is not
an accurate reflection of the survival probability [26].
Nomograms are another initiation for estimating survival
probability after a particular time period of glioblastoma
patients [27], [28].
III. SYS TE M MOD EL A ND METHODOLOGY
A. Dataset
The publicly available gene expression GBM and lower-
grade glioma (LGG), which are biomarkers to classify subjects
into risk groups, datasets of TCGA and CGGA are downloaded
for our study. TCGA dataset comprises of 252 subject cases
with overall survival information and gene expression profiles
of 16510 genes, obtained using the Illumina HiSeq RNA
Sequencing platform. Particularly, WHO grade II, III are
included in the TCGA-LGG cohort and WHO grade IV is
included in the TCGA-GBM cohort. CGGA dataset consists
of 315 cases with overall survival data and gene expression
profiles of 24326 genes, acquired using Illumina HiSeq 2000
platform, which is a powerful high-throughput sequencing
system. The initial gene expression values, are normalized
based on the gene length into fragments per kilo-base per
million mapped reads (FPKM) [29] in comma-separated values
(CSV) format.
FPKM = Total fragments mapped reads million
Exon length kilobase pair (1)
Moreover, all the gene expression features are log transformed
and normalized median centered before analysing. After nor-
malization, 13094 gene common for both TCGA and CGGA
datasets are chosen. The CGGA dataset is divided into training
and validating datasets while the TCGA dataset is used for
testing. Based on the overall survival in days, the classes of
the each patient are determined, where the three classes are,
short survival (overall survival in days <300 days), medium
survival (300-450 days of survival) and long survival (overall
survival >450 days). The dataset distribution for the classes
short, medium and long survival are given in Table I. The
CGGA dataset is divided in 4 folds, with equal distribution
of classes in each fold and the ratio of classes in training
and validating folds are maintained without overlapping cases.
Table I shows the class distribution in each dataset, CGGA and
TCGA cohorts and in the training and validating datasets of
CGGA cohort.
TABLE I
DATASET DESCRIPTION
Class CGGA dataset TCGA
Training Validating Testing
Short 63 21 68
Medium 29 9 44
Long 145 48 140
B. Prognostic Gene Identification
We utilize Bayesian Model Averaging (BMA) [30] based
feature selection approach in order to obtain a robust learning
model. BMA algorithm overcomes the model uncertainty by
obtaining the average over the posterior probability distribu-
tions of several models. Thus, the posterior probability of
being Φgiven the training dataset Dcan be as follows.
Pr(Φ|D) = X
i∈S
Pr (Φ|D, Mi)·Pr (Mi|D)(2)
where, Miis a given model in the subset, S=
{M1, M2, . . . , Mn}. Initially the genes are ranked based on
Cox proportional hazard analysis (Cox-PH), which has the
ability to deal with censored data. Cox hazard function ((3))
for a given covariate vector of subject p,zp= (z1p, . . . , zip),
depicts the probability of dying at a given time T, if the patient
survived until time T.
λ(T,zp) = λ0(T) exp (zpα)(3)
where baseline hazard function is λ0and the coefficients
of the each (z1p, . . . , zip)covariates is given by α=
(α1p, . . . , αip). Since the baseline hazard function is same for
a single case, for different time T, it can be neglected. Thus,
an approximation for αis required and it can be calculated
with the following partial likelihood association.
PL(α) =
n
Y
p=1 exp (zpα)
P`∈Rpexp (z`α)!δx
(4)
In this Equation (4), Rpis the risk set, subjects that have not
experienced an event by the time tpand δiis the event status
(censored or not) of patient x. The θparameter is obtained
by maximizing the partial likelihood, and thus, according
descending order of log likelihood of those values, genes are
ranked.
The top ranked 25 genes are assigned to a window and tra-
ditional BMA algorithm is applied for survival analysis. Based
on the posterior probabilities of those genes, the genes with
low posterior probabilities (<1%) are eliminated retaining the
genes with high posterior probabilities. The window is moved
along the top ranked genes until 7 genes with high posterior
probabilities are obtained. These 7 genes and their posterior
probabilities are given in the Table II. Accordingly, TXLNA
gene has the highest posterior probability and the other genes
have posterior probabilities between 1 and 100%.
C. Survival Prediction
We utilized the most commonly used machine learning
algorithms to predict the overall survival class, short, medium
and long. For all these the input is the selected 7 gene
signature. In fact, To overcome the class imbalance, shown
in Table I, the minority class (i.e. short and medium classes)
entries in the training dataset are over sampled randomly.
TABLE II
SEL ECT ED 7GENES AND THEIR CORRESPONDING POSTERIOR
PROBABILITIES
Gene Posterior probability
TXLNA 100.0
WDR77 50.5
TAF12 48.9
STK40 35.9
YTHDF2 14.7.
SNRNP40 8.0.
SLC30A7 2.0
1) Decision Tree: Decision tree (DT) [31] is a tree based
structure, where each node in the tree specifies a condition
of a input covariate that decides the distribution or the final
class. The decision tree is fine tuned to obtain the best results,
with a maximum depth of the tree to be 4 and the minimum
samples required to make a decision at a node to be 10.
2) Random Forest Classification: Random Forest Classifier
(RFC) [32] is a widely used ensemble model, that outperforms
a single decision tree by overcoming the instability and the
variance, with multiple decision trees. Hence, in RFC the
number of trees in the forest are tuned to 100.
3) XGBoost: XGBoost [33] is an ensemble model but each
tree is trained in a sequential manner, boosting the prediction
accuracy. The fine tuned XGBoost model, consists of 50 trees
trained sequentially, with a maximum depth of a tree, 4 layers.
4) CatBoost: Catboost [34] is a sequentially trained boost-
ing algorithm, which outperforms the existing boosting algo-
rithms with its efficiency. We utilize multi-class loss function
with a learning rate of 0.005 for 2000 epochs for 4 layered
symmetric tree.
5) Support Vector Machine: Support Vector Machine
(SVM) [35] is a common machine learning algorithm that is
been used for survival prediction of glioma patients. Thus, we
explore a support vector classification based on SVM with a
linear kernel.
Moreover, we propose a novel Bayesian neural network
based on deep probabilistic programming. Deep probabilistic
programming uses the power of deep neural networks with
probabilistic models, to enhance the performance while opti-
mizing the related costs [36]. The traditional more frequently
utilized algorithms show less performance, as discussed in the
results section. Therefore, a need for a better performing algo-
rithm, which can recognize the uncertainties in the predictions,
occurred.
6) Proposed Bayesian Neural Network: We propose prob-
abilistic programming based Bayesian neural network (BNN)
for the survival class prediction of glioma patients. The first
layer receives the chosen 7 features as the input. The final
layers contains the predicted class, short, medium or long.
In the middle the architecture consists of 2 layers comprised
with 24 and 12 neurons in each layer respectively. Both
hidden layers are comprise of 50% dropout with 1st hidden
layer followed by a Tanh activation and the 2nd hidden layer
followed by a rectified linear activation unit (ReLU). All the
nodes are fully connected with nodes in the adjacent layers.
BNN differs from the traditional artificial neural network by
assigning distributions for the all the parameters, weights and
biases instead of a single value. Mean and scale parameters of
all the distributions are initialized to 0.01 and 0.1 respectively.
Stochastic gradient descent optimization with learning rate
0.001 is utilized for training the BNN. Further, stochastic
variational inference optimizes the trace implementation of
evidence lower bound (ELBO) in order to diverge probability
distribution of parameters. The proposed BNN architecture is
shown in Fig. 2.
Implementations are developed with Pyro (version 1.3.1)
probabilistic programming language with pytorch (version
1.5.0) [37]. For the other machine learning algorithms Scikit-
learn library [38] is utilized. To measure and compare the
performance of the applied machine learning algorithms, 4
metrics Accuracy, Precision, Recall and F1 scores are occu-
pied.
Output Layer
Input Layer
Short
Medium
Long
TXLNA
STK40
SLC30A7
YTHDF2
SNRNP40
TAF12
WDR77
1st Layer
(24 neurons)
2nd Layer
(12 neurons)
Dropout 50%
Tanh activation
Dropout 25%
Relu activation
Weight distribution assigned based on the Bayes theorem
Fig. 2. Proposed Bayesian Neural network architecture.
D. Prognostic Risk model Construction
Further, in order to estimate a risk, we calculate a risk
score with the survival related genes of glioma patients. We
use univariate cox proportional hazards regression analysis to
evaluate the association between the 7 genes obtained from the
iterative Bayesian Model Averaging algorithm and the survival
status and time. Typically the genes with p values <0.01 and
the hazard ratio (HR) >1are considered as the criteria for
them to be candidate genes for the survival estimation. The
obtained βvalues, i.e. the regression coefficient obtained from
the cox analysis are used to calculate the risk score along with
the expression values of the corresponding genes (expi). The
univariation cox analysis is performed with the R package,
survival (version 3.2-3) [39]. The Equation (5) indicates the
prognostic risk score formula used to obtain the risk score
prognostic model.
Risk score =
n
X
i=1
βi∗expi(5)
nis the number of genes chosen to be included in the
prognostic signature. After obtaining the risk score for the
CGGA cohort, the median value of the prognostic risk score is
used to separate the patients in the high and low risk groups. To
clarify, the performance of the proposed signature is validated
on the TCGA cohort.
IV. SYS TE M EVALUATION
A. Overall survival prediction
The Accuracy, Precision, Recall and F1 scores were calcu-
lated to compare the techniques utilized to predict the survival
class. Each algorithm were trained on CGGA training splits
and validated on the corresponding fold validation split. Table
III shows the average metric results over 4 folds CGGA
validation splits. Thus, Based on the all 4 metrics the proposed
BNN algorithm outperforms the rest of frequently utilized ML
techniques, with 74.50% average accuracy. According to the
studies, the highest reported accuracy is 68% for radiomics
[40], with a cohort >100 patients used for the training.
According to our results, second best performing algorithm
was Random Forest Regression with over 60% accuracy.
TABLE III
COMPARISON OF OVERA LL SURVIVAL PREDICTION WITH MACH IN E
LEARNING - 4 FOLD CROSS VALIDATION ON CGGA CO HO RT
ML methods Accuracy Precision Recall F1 score
DT [31] 55.25% 57.50% 55.25% 56.00%
RFC [32] 62.25% 61.25% 62.25% 61.00%
XGBoost [33] 56.25% 62.75% 56.25% 58.25%
CatBoost [34] 57.25% 62.75% 57.25% 59.00%
SVC [35] 52.75% 67.25% 52.75% 56.50%
BNN 74.50% 67.50% 73.00% 70.25%
Further, we evaluated all the trained algorithms of each
fold on TCGA cohort. Thus, we obtained the average over
the metrics obtained from the testing on each fold trained
with CGGA. We could clearly observe the BNN performing
better compared to the other algorithms in testing phase.
The support vector machine and decision trees, that are been
widely utilized in survival prediction, showed a comparatively
low performance with gene expression data. The Boosting
algorithms showed an average performance and the ensemble
algorithm, RFC indicated the second best performance with
regard to the other algorithms. The performance comparison
on TCGA cohort validation is given in Table IV.
B. Prognostic risk model
We initiated a prognostic model, based on the univariate cox
regression analysis coeefcients and the expression of the most
prominent 7 genes, TXLNA, STK40, SLC30A7, YTHDF2,
SNRNP40, TAF12 and WDR77 associated with the survival.
TABLE IV
COMPARISON OF OVERA LL SU RVIVAL PREDICTION WITH MAC HIN E
LEARNING -TESTING ON TCGA CO HORT
ML methods Accuracy Precision Recall F1 score
DT 44.50% 46.00% 44.50% 45.00%
RFC 54.25% 51.25% 54.25% 51.00%
XGBoost 51.00% 52.25% 51.00% 51.75%
CatBoost 50.25% 52.75% 50.25% 50.75%
SVC 47.25% 56.00% 47.25% 50.00%
BNN 59.75% 57.25% 55.25% 51.00%
Univariate cox regression analysis results are shown in the
Table V.
TABLE V
UNIVARIATE COX REGRESSION ANALYSIS ON THE CHOSEN 7GE NES
Gene Coef in coxPH Hazard Ratio (95% CI) p value
TXLNA 0.87 2.4 (2.1-2.7) 2.2e
−36
STK40 0.77 2.2 (1.9-2.4) 4.8e
−33
SLC30A7 0.8 2.2 (2-2.5) 2.6e
−33
YTHDF2 0.74 2.1 (1.8-2.4) 1.5e
−29
SNRNP40 0.73 2.1 (1.8-2.4) 1.2e
−31
TAF12 0.73 2.1 (1.8-2.3) 1.5e
−34
WDR77 0.77 2.2 (1.9-2.5) 5.1e
−32
This analysis proved that the genes chosen with the iterative
BMA algorithm are highly associated with the survival, with
HR over 2 for all the 7 genes. All the coefficients of cox
analysis were statistically significant (p values <0.01 for all
the 7 genes) and all the coefficients were positive. Accordingly
we can conclude that the high expression of these genes are
associated with the poor survival of glioma patients. Thus, the
risk score formula is shown in Equation (6).
(6)
Risk score = 0.87 ∗expT X LN A +0.77 ∗expST K 40
+0.8∗expSLC 30A7+0.74 ∗expY T H DF 2
+0.73 ∗expSN RN P 40 +0.73
∗expT AF 12 +0.77 ∗expW DR77
Based on the above prognostic risk model, the risk score
was calculated for all the cases in the CGGA and TCGA
cohorts. Further, the threshold of the high risk & low risk
determination, was considered as the median of the risk scores
of the CGGA cohort.
As shown in Fig. 3(a), it can be seen that the high risk
patients have a high expression in all the 7 genes and the
survival was mostly short and medium. In fact, when the 7
genes are low expressed, the majority of the patients have a
long survival and thus, a low risk.
This prognostic risk model was validated on the TCGA
cohort by obtaining the risk score using the Equation (6).
The same threshold was used to divide the patients into
high and low risk groups. Consequently, the aforementioned
relationship could also be seen when we observe the Fig 3 (b)
obtained for the TCGA cohort. Most of the patients with a
highly expressed 7 genes also had a low survival and a high
risk.
TXLNA
STK40
SLC30A7
YTHDF2
SNRNP40
TAF12
WDR77
Count
0 20 40 60 80 100
Value
-2 0 2
High Risk
Low Risk
Short
Medium
Long
TXLNA
STK40
SLC30A7
YTHDF2
SNRNP40
TAF12
WDR77
Count
0 20 40 60 80
Value
-2 0 2
-1-3 3
1
(a)
(b)
Fig. 3. Heat map for the (a) CGGA cohort (b) validation TCGA cohort with
proposed gene signature.
This could further be clarified by observing the distribution
of the overall survival in days in each risk group shown in Fig.
4. Patients with a high risk significantly had a low survival and
patients with high risk had a large span of overall survival in
days.
TCGA CGGA TCGA CGGA
0
1000
2000
3000
4000
Overall Survival in days
High Risk
Low Risk
Fig. 4. Overall survival distribution of high and low risk groups.
For CGGA cohort for the high risk and low risk groups
have an overall survival of 572.2658 ±662.5993 days and
2244.361±1385.437 days, respectively. For testing, on TCGA,
the overall survival of the high risk and low risk groups
were 468.1624 ±412.3076 days and 1124.43 ±1131.1days,
respectively. Thus, we could observe that the high risk patient
group has a low survival in contrast to the low risk patient
group.
Kaplan-Meier plots were acquired for both TCGA and
CGGA cohorts, thus verifying the low percentage of survival
with respect to the overall survival for high risk patients.
Fig. 5 verifies the performance of the prognostic risk model
demonstrating the above observations. Correspondingly, The
high risk group TCGA cohort demonstrated a low percentage
of survival with respect to the overall survival, as shown in
Fig. 5 (b). This verified the prognostic risk model behaviour
for CGGA cohort, shown in 5 (a).
0 2000 4000 6000
0
50
100
Overall Survival in days
Percent survival
High Risk
Low Risk
(a)
(b)
0 1000 2000 3000 4000 5000
0
50
100
Overall Survival in days
Percent survival
High Risk
Low Risk
Fig. 5. Kaplan-Meier curves obtained for the high risk and low risk groups.
(a) CGGA (b) TCGA dataset.
Fig. 6 shows the gene expression distribution of the the
most prominent genes we choose for prognostic risk model
development, in high risk and low risk groups. It can be
observed that the mean expression value of each gene in the
low risk group is lower than the mean expression value of the
corresponding gene in the high risk group.
TXLNA STK40 SLC30A7 YTHDF2 SNRNP40 TAF12 WDR77
-4
-2
0
2
4
Gene
Log2(FPKM+1)
High Risk Low Risk
Fig. 6. mRNA expression value distribution of each selected genes for the
high and low risk groups.
V. DISCUSSION
Many studies reveal that genetic alterations and the molec-
ular heterogeneity plays a vital role in gliomagenesis and
prognosis. Therefore, recently many prognosis assessment
tools are originated to with these different types of omics
data related to gliomas. In this current study we identi-
fied 7 molecular biomarkers, with the potential of survival
prediction, which depict the underlying heterogeneity. The
state-of-art survival prediction in gliomas are mainly with
radiomic features, extracted from tumor region [40], [41]. As
in the Brain Tumor Segmentation challenge 2018 the highest
accuracy obtained is 68% with imaging features [40]. Here,
we proposed a machine learning based approach for survival
prediction with gene expression data with an accuracy over
70%. Moreover, we established a prognostic model to estimate
the risk of glioma patients. The high-throughput genomics
depict underlying molecular biology of gliomas and thus, have
shown improvements over radiomics for survival prediction.
Out of the considered 7 genes namely TXLNA, STK40,
SLC30A7, YTHDF2, SNRNP40, TAF12 and WDR77, the
gene TAF12 is said to have associations with gliomas ac-
cording to the previous studies. Mostly the grade II & III
gliomas have mutated IDH1/IDH2. As claimed by Ren et al
[42] mutated IDH1 down regulates the TAF12 expression. Our
observation of low expression of TAF12 in high risk category
with low survival as shown in Fig. 3, is on par with above
findings. Yet, our findings disclose that there are 6 other genes,
which has not been identified to have association with survival
exists, and can be used for risk and survival estimation.
However most of these studies rely on a single cohort to
avoid the inconsistencies occur between different datasets.
These inconsistencies occur often due to the technical lim-
itations, as they have been acquired from different high-
throughput platforms. This can also be affected by the probes’
cross hybridization, redundancy and annotations [43]. There-
fore, the significantly low performance of TCGA compared to
the CGGA in both of our survival prediction, is mostly due to
these reasons. Besides, the prognostic model, as shown in fig
3 (b), reveals that although it is developed on CGGA cohort,
it still has the ability to separate patients in TCGA cohort, in
to risk groups as a relatively promising prognostic model.
For the prognostic risk score model, unlike previous studies,
we consider the complete cohort of gliomas, without separat-
ing them into grades. Hence, without the prior diagnosis of
the grade, the prognostic risk model can be deployed for risk
estimation. Further, for this prognostic risk score model, only
require 7 genes, making it less complex for the clinicians.
According to studies, probabilistic programming languages
are capable of learning with less number of samples [44].
Thus, in our application with a deficit of learning samples
this algorithm showed improvements over other traditional
machine learning algorithms. Further, Deep Probabilistic Pro-
gramming Languages (DPPL) that combines deep learning
model with probabilistic programming, has shown strong
potential to achieve promising outcomes in deep learning
computations [36]. In the future, the overall survival prediction
task will be extended using an explainable method in order to
identify the contribution of expression of genes with a high
accuracy.
VI. CONCLUSION
This study has presented a solution for survival prediction
of glioma patients based on genomics with a comparatively
large dataset. We identified 7 gene signature associated with
survival and proposed two approaches for prognosis prediction
which have a potential ability to separate glioma patients into
groups based on their survival and risk. We proposed a novel
Bayesian neural network, for survival prediction that surpasses
the state-of-art survival prediction approaches. Moreover, we
established a comprehensive prognostic risk model based on
Cox-PH, that estimates the risk of glioma patients including
both GBM and LGG. Both of these approaches have promising
predictive ability of survival and risk for glioma patients, with
over 70% accuracy for overall survival class prediction.
VII. ACKN OWLEDGEMENT
We acknowledge the support from the Senate Research
Committee Grant SRC/LT/2019/18, University of Moratuwa,
Sri Lanka
REFERENCES
[1] E. Aquilanti, J. Miller, S. Santagata, D. P. Cahill, and P. K. Brastianos,
“Updates in prognostic markers for gliomas,” Neuro-Oncology, vol. 20,
no. suppl 7, pp. vii17–vii26, 2018.
[2] M. Weller, W. Wick, K. Aldape, M. Brada, M. Berger, S. M. Pfister,
R. Nishikawa, M. Rosenthal, P. Y. Wen, R. Stupp et al., “Glioma,”
Nature reviews Disease primers, vol. 1, no. 1, pp. 1–18, 2015.
[3] G. N. Fuller and B. W. Scheithauer, “The 2007 revised world health
organization (who) classification of tumours of the central nervous
system: newly codified entities,” Brain pathology, vol. 17, no. 3, pp.
304–307, 2007.
[4] L. A. Gravendeel, M. C. Kouwenhoven, O. Gevaert, J. J. de Rooi, A. P.
Stubbs, J. E. Duijm, A. Daemen, F. E. Bleeker, L. B. Bralten, N. K.
Kloosterhof et al., “Intrinsic gene expression profiles of gliomas are a
better predictor of survival than histology,” Cancer research, vol. 69,
no. 23, pp. 9065–9072, 2009.
[5] K. Ludwig and H. I. Kornblum, “Molecular markers in glioma,” Journal
of neuro-oncology, vol. 134, no. 3, pp. 505–512, 2017.
[6] M. Moghtadaei, M. R. H. Golpayegani, F. Almasganj, A. Etemadi, M. R.
Akbari, and R. Malekzadeh, “Predicting the risk of squamous dysplasia
and esophageal squamous cell carcinoma using minimum classification
error method,” Computers in biology and medicine, vol. 45, pp. 51–57,
2014.
[7] M. S. Bal, V. K. Bodal, J. Kaur, M. Kaur, and S. Sharma, “Patterns
of cancer: A study of 500 punjabi patients,” Asian Pac J Cancer Prev,
vol. 16, no. 12, pp. 5107–10, 2015.
[8] A. Bashiri, M. Ghazisaeedi, R. Safdari, L. Shahmoradi, and H. Ehte-
sham, “Improving the prediction of survival in cancer patients by using
machine learning techniques: experience of gene expression data: a
narrative review,” Iranian journal of public health, vol. 46, no. 2, p.
165, 2017.
[9] N. Wijethilake, M. Islam, and H. Ren, “Radiogenomics model for overall
survival prediction of glioblastoma.” Medical & Biological Engineering
& Computing, 2020.
[10] N. Wijethilake, M. Islam, D. Meedeniya, C. Chitraranjan, I. Perera, and
H. Ren, “Radiogenomics of glioblastoma: Identification of radiomics as-
sociated with molecular subtypes,” 2nd MICCAI workshop on Radiomics
and Radiogenomics in Neuro-oncology using AI, Springer, LNCS, (to
appear), 2020.
[11] Y.-C. Chen, W.-W. Yang, and H.-W. Chiu, “Artificial neural network
prediction for cancer survival time by gene expression data,” in 2009
3rd International Conference on Bioinformatics and Biomedical Engi-
neering. IEEE, 2009, pp. 1–4.
[12] W. Rasanjana, S. Rajapaksa, I. Perera, and D. Meedeniya, “A svm
model for candidate y-chromosome gene discovery in prostate cancer,”
in Proceedings of 11th International Conference, vol. 60, 2019, pp. 129–
138.
[13] M. Labussiere, A. Di Stefano, V. Gleize, B. Boisselier, M. Giry,
S. Mangesius, A. Bruno, R. Paterra, Y. Marie, A. Rahimian et al.,
“Tert promoter mutations in gliomas, genetic associations and clinico-
pathological correlations,” British journal of cancer, vol. 111, no. 10,
pp. 2024–2032, 2014.
[14] X. Wang, J.-x. Chen, J.-p. Liu, C. You, Y.-h. Liu, and Q. Mao, “Gain
of function of mutant tp53 in glioblastoma: prognosis and response to
temozolomide,” Annals of surgical oncology, vol. 21, no. 4, pp. 1337–
1344, 2014.
[15] R. G. Verhaak, K. A. Hoadley, E. Purdom, V. Wang, Y. Qi, M. D.
Wilkerson, C. R. Miller, L. Ding, T. Golub, J. P. Mesirov et al.,
“Integrated genomic analysis identifies clinically relevant subtypes of
glioblastoma characterized by abnormalities in pdgfra, idh1, egfr, and
nf1,” Cancer cell, vol. 17, no. 1, pp. 98–110, 2010.
[16] M. Vitucci, D. Hayes, and C. Miller, “Gene expression profiling of
gliomas: merging genomic and histopathological classification for per-
sonalised therapy,” British journal of cancer, vol. 104, no. 4, pp. 545–
553, 2011.
[17] L. P. Petalidis, A. Oulas, M. Backlund, M. T. Wayland, L. Liu, K. Plant,
L. Happerfield, T. C. Freeman, P. Poirazi, and V. P. Collins, “Improved
grading and survival prediction of human astrocytic brain tumors by
artificial neural network analysis of gene expression microarray data,”
Molecular cancer therapeutics, vol. 7, no. 5, pp. 1013–1024, 2008.
[18] D. Faraggi and R. Simon, “A neural network model for survival data,”
Statistics in medicine, vol. 14, no. 1, pp. 73–82, 1995.
[19] M. C. O’Neill and L. Song, “Neural network analysis of lymphoma
microarray data: prognosis and diagnosis near-perfect,” BMC bioinfor-
matics, vol. 4, no. 1, p. 13, 2003.
[20] L. J. Lancashire, D. Powe, J. Reis-Filho, E. Rakha, C. Lemetre,
B. Weigelt, T. Abdel-Fatah, A. R. Green, R. Mukta, R. Blamey et al.,
“A validated gene expression profile for detecting clinical outcome in
breast cancer using artificial neural networks,” Breast cancer research
and treatment, vol. 120, no. 1, pp. 83–93, 2010.
[21] W. A. Freije, F. E. Castro-Vargas, Z. Fang, S. Horvath, T. Cloughesy,
L. M. Liau, P. S. Mischel, and S. F. Nelson, “Gene expression profiling
of gliomas strongly predicts survival,” Cancer research, vol. 64, no. 18,
pp. 6503–6510, 2004.
[22] V. Bonato, V. Baladandayuthapani, B. M. Broom, E. P. Sulman, K. D.
Aldape, and K.-A. Do, “Bayesian ensemble methods for survival pre-
diction in gene expression data,” Bioinformatics, vol. 27, no. 3, pp. 359–
367, 2011.
[23] J. B.-K. Hsu, T.-H. Chang, G. A. Lee, T.-Y. Lee, and C.-Y. Chen,
“Identification of potential biomarkers related to glioma survival by gene
expression profile analysis,” BMC medical genomics, vol. 11, no. 7,
p. 34, 2019.
[24] J. S. Wei, B. T. Greer, F. Westermann, S. M. Steinberg, C.-G. Son, Q.-
R. Chen, C. C. Whiteford, S. Bilke, A. L. Krasnoselsky, N. Cenacchi
et al., “Prediction of clinical outcome using gene expression profiling
and artificial neural networks for patients with neuroblastoma,” Cancer
research, vol. 64, no. 19, pp. 6883–6891, 2004.
[25] W.-J. Zeng, Y.-L. Yang, Z.-Z. Liu, Z.-P. Wen, Y.-H. Chen, X.-L. Hu,
Q. Cheng, J. Xiao, J. Zhao, and X.-P. Chen, “Integrative analysis of
dna methylation and gene expression identify a three-gene signature for
predicting prognosis in lower-grade gliomas,” Cellular Physiology and
Biochemistry, vol. 47, no. 1, pp. 428–439, 2018.
[26] S. Zuo, X. Zhang, and L. Wang, “A rna sequencing-based six-gene
signature for survival prediction in patients with glioblastoma,” Scientific
reports, vol. 9, no. 1, pp. 1–10, 2019.
[27] L. Wang, Z. Yan, X. He, C. Zhang, H. Yu, and Q. Lu, “A 5-gene
prognostic nomogram predicting survival probability of glioblastoma
patients,” Brain and behavior, vol. 9, no. 4, p. e01258, 2019.
[28] H. Gittleman, A. E. Sloan, and J. S. Barnholtz-Sloan, “An independently
validated survival nomogram for lower-grade glioma,” Neuro-oncology,
vol. 22, no. 5, pp. 665–674, 2020.
[29] F. Abbas-Aghababazadeh, Q. Li, and B. L. Fridley, “Comparison of
normalization approaches for gene expression studies completed with
high-throughput sequencing,” PloS one, vol. 13, no. 10, 2018.
[30] A. Annest, R. E. Bumgarner, A. E. Raftery, and K. Y. Yeung, “Iterative
bayesian model averaging: A method for the application of survival
analysis to high-dimensional microarray data,” BMC bioinformatics,
vol. 10, no. 1, p. 72, 2009.
[31] E. Marubini, A. Morabito, and M. Valsecchi, “Prognostic factors and risk
groups: some results given by using an algorithm suitable for censored
survival data,” Statistics in medicine, vol. 2, no. 2, pp. 295–303, 1983.
[32] L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp.
5–32, 2001.
[33] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”
in Proceedings of the 22nd acm sigkdd international conference on
knowledge discovery and data mining, 2016, pp. 785–794.
[34] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin,
“Catboost: unbiased boosting with categorical features,” in Advances in
neural information processing systems, 2018, pp. 6638–6648.
[35] A. J. Smola and B. Sch¨
olkopf, “A tutorial on support vector regression,”
Statistics and computing, vol. 14, no. 3, pp. 199–222, 2004.
[36] I. Rubasinghe and D. Meedeniya, “Ultrasound nerve segmentation
using deep probabilistic programming,” Journal of ICT Research and
Applications, vol. 13, no. 3, pp. 241–256, 2019.
[37] E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan,
T. Karaletsos, R. Singh, P. Szerlip, P. Horsfall, and N. D. Goodman,
“Pyro: Deep universal probabilistic programming,” The Journal of
Machine Learning Research, vol. 20, no. 1, pp. 973–978, 2019.
[38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, 2011.
[39] T. M. Therneau, A Package for Survival Analysis in R, 2020, r
package version 3.2-3. [Online]. Available: https://CRAN.R-project.
org/package=survival
[40] Z. A. Shboul, M. Alam, L. Vidyaratne, L. Pei, M. I. Elbakary, and K. M.
Iftekharuddin, “Feature-guided deep radiomics for glioblastoma patient
survival prediction,” Frontiers in Neuroscience, vol. 13, 2019.
[41] M. Islam, V. S. Vibashan, V. J. M. Jose, N. Wijethilake, U. Utkarsh, and
H. Ren, “Brain Tumor Segmentation and Survival Prediction Using 3D
Attention UNet,” in Brainlesion: Glioma, Multiple Sclerosis, Stroke and
Traumatic Brain Injuries, A. Crimi and S. Bakas, Eds. Cham: Springer
International Publishing, 2020, pp. 262–272.
[42] J. Ren, M. Lou, J. Shi, Y. Xue, and D. Cui, “Identifying the genes
regulated by idh1 via gene-chip in glioma cell u87,” International
journal of clinical and experimental medicine, vol. 8, no. 10, p. 18090,
2015.
[43] S. Zhao, W.-P. Fung-Leung, A. Bittner, K. Ngo, and X. Liu, “Compar-
ison of rna-seq and microarray in transcriptome profiling of activated t
cells,” PloS one, vol. 9, no. 1, p. e78644, 2014.
[44] L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of object cate-
gories,” IEEE transactions on pattern analysis and machine intelligence,
vol. 28, no. 4, pp. 594–611, 2006.