Conference PaperPDF Available

Survival prediction and risk estimation of Glioma patients using mRNA expressions

Authors:

Abstract and Figures

Gliomas are lethal type of central nervous system tumors with a poor prognosis. Recently, with the advancements in the micro-array technologies thousands of gene expression related data of glioma patients are acquired, leading for salient analysis in many aspects. Thus, genomics are been emerged into the field of prognosis analysis. In this work, we identify survival related 7 gene signature and explore two approaches for survival prediction and risk estimation. For survival prediction, we propose a novel probabilistic programming based approach, which outperforms the existing traditional machine learning algorithms. An average 4 fold accuracy of 74% is obtained with the proposed algorithm. Further, we construct a prognostic risk model for risk estimation of glioma patients. This model reflects the survival of glioma patients, with high risk for low survival patients.
Content may be subject to copyright.
Survival prediction and risk estimation of Glioma
patients using mRNA expressions
Navodini Wijethilake
Department of Computer Science Engineering
University of Moratuwa
Sri Lanka
navodiniw@cse.mrt.ac.lk
Dulani Meedeniya
Department of Computer Science Engineering
University of Moratuwa
Sri Lanka
dulanim@cse.mrt.ac.lk
Charith Chitraranjan
Department of Computer Science Engineering
University of Moratuwa
Sri Lanka
charithc@cse.mrt.ac.lk
Indika Perera
Department of Computer Science Engineering
University of Moratuwa
Sri Lanka
indika@cse.mrt.ac.lk
Abstract—Gliomas are lethal type of central nervous system
tumors with a poor prognosis. Recently, with the advancements in
the micro-array technologies thousands of gene expression related
data of glioma patients are acquired, leading for salient analysis
in many aspects. Thus, genomics are been emerged into the field
of prognosis analysis. In this work, we identify survival related 7
gene signature and explore two approaches for survival prediction
and risk estimation. For survival prediction, we propose a novel
probabilistic programming based approach, which outperforms
the existing traditional machine learning algorithms. An average
4 fold accuracy of 74% is obtained with the proposed algorithm.
Further, we construct a prognostic risk model for risk estimation
of glioma patients. This model reflects the survival of glioma
patients, with high risk for low survival patients.
Index Terms—Glioma, gene expression, risk score, probabilistic
programming, bayesian neural networks
I. INTRODUCTION
Gliomas are the most common central nervous system
tumor, that derive from neuroglial cells and progenitor cells
[1]. Gliomas account for 30% of the primary brain tumors
and 80% of the malignant brain tumors, causing majority of
deaths from primary brain tumors [2]. Despite the treatment,
the aggressive forms of gliomas have a mortality within
months. Nowadays, mostly the treatment planning in glioma
patients rely on histology and other clinical parameters such
as age. Based on the histology, Gliomas are been classified
into astrocytic, oligodendrogial and ependymal tumors, and
additionally considering the malignancy, the natural disease
cause, absense and presense of anaplastic features, they are
been assigned into World Health Organization (WHO) I-IV
grades [3]. Yet, the intra-tumor heterogeneity and alterations
in molecular levels are associated with the prognosis of glioma
patients than the underlying histology [4].
Recently, with the progression of the genomic, transcrip-
tomic and epigenetic profiling and with the technological
advances in Ribonucleic acid (RNA) sequencing, which is a
molecule that support several biological roles of genes, and
microarrays, novel approaches for classifying and analysing
gliomas are recognized. Moreover, these underlying molecular
pathogenesis lead to identify genetic alterations which can
cause gliomas that can also be complementary to histological
classifications and diagnostics [2]. Thus, following the WHO
2016 classification, Isocitrate dehydrogenase (IDH) family
IDH1/2 mutant, 1p/19q co-deleted tumors, mostly with oligo-
denrial histology have the best prognosis and belongs to WHO
grade II. WHO grade III gliomas are IDH1/2 mutant, 1p/19q
non co-deleted, Telomerase Reverse Transcriptase (TERT)
promoter-wild type, tumor protein p53 (TP53) mutant tumors
with an astrocytic morphology and have an intermediate sur-
vival. IDH wild-type, 1p/19q non codeleted tumors have a poor
prognosis and are mostly grade IV Glioblastoma multiforme
(GBM). In addition to these, there are several other molecular
signatures associated with gliomas in diagnosis and prognosis,
such as TERT, TP53, O-6-methylguanine Deoxyribonucleic
acid (DNA) methyltransferase (MGMT) and phosphatase and
tensin homolog (PTEN) [5].
Generally oncologists assess the survival of patients, based
on their experience and clinical factors. There factors include
the size, location and stage of the cancer for the survival
estimation. Hence, these decision can be biased, optimistic
and inaccurate [6]. However, accurate survival prediction ap-
proaches lead to less invasive better treatment with optimal
usage of resources [7]. Gene expression data is used in several
types of cancers for survival prediction and shows promising
improvements over other traditional approaches such as ra-
diomic based algorithms [8]–[10]. As a supervised machine
learning algorithm, artificial neural networks are developed for
survival prediction using gene expression data for many cancer
types [11], [12]. However the use of machine learning based
algorithms on gene expression data are not much seen for
survival prediction in days or in class wise of glioma patients.
In our previous work [9], we have demonstrated that in-
corporating both imaging biomarkers and gene expression
arXiv:2011.00659v1 [q-bio.GN] 2 Nov 2020
biomarkers outperforms the survival prediction accuracy of
glioblastoma patients compared to using both features sepa-
rately. However, the lack of data is a limitation of that study
and thus have motivated our study to have comparatively large
dataset by using all the glioma cases, both higher grade and
lower grade gliomas, of The Cancer Genome Atlas (TCGA)
and Chinese Glioma Genome Atlas (CGGA) cohorts. Thus, we
expect that this work will lead for new directions in survival
predictions of glioma patients, focusing more on genomics.
The proposed study has two major parts of survival esti-
mation in glioma patients. We depict machine learning algo-
rithms for survival prediction of glioma patients where we
execute prediction after performing feature selection on high
dimensional gene expression data. Further, we propose a novel
probabilistic programming based Bayesian neural network for
survival prediction of glioma patients in 3 categories; long,
short and medium survival. The selected prognostic genes are
also used to develop a prognostic risk score model. The main
outline of the work is shown in Fig. 1. The main objective
of this study is to identify the prognostic gene signature and
construct a prognostic model for risk estimation, that provide
useful insights for the prognosis of glioma patients. Thus,
the 7 gene signature based survival prediction algorithm with
probabilistic programming and prognostic risk model are the
novel contributions of this study.
13298 genes
TCGA GBM &
LGG Cohort (252)
16510 genes
Training (80%)
Validating (20%)
CGGA Glioma
Cohort (315)
24326 genes
Testing
Log2 & median centering
Iterative Bayesian Model
Averaging Algorithm
Prognositc Risk
Score Development
Overall survival
prediction
Cleaning and filtering
7 genes
Short
(<300 days)
Medium
(300-450days)
Long
(> 450days)
Fig. 1. Overview of the proposed solution.
The paper is structured as follows: Section II explores the
related work, Section III explains our approach for the analysis
and Section IV reveals our observation and the evaluation
summary. In Section V we discuss the important aspects,
limitations of our research.
II. BACKGROU ND
The mutations, methylations and other phenomena occur
in molecular level reflect the associations with the expression
levels of those biomarkers [13]. TP53 is recognized as a neg-
ative prognostic marker, causing poor prognosis in astrocytic
and oligoastrocytic gliomas. Hence, the expression of TP53
gene has an inverse correlation with survival of glioma patient
[14]. Correspondingly, TERT promoter mutation is common in
gliomas and also shows a negative prognosis in glioma patients
[13]. In fact, the expression of TERT gene is significantly high
with mutated TERT promoters [13]. Given the above, it is clear
that expression of genes in gliomas are significantly associated
with other molecular biomarkers and also effect the survival
and prognosis of glioma patients.
Gene expression profiling is commonly used for clustering
and subtype classifications of gliomas, using both unsuper-
vised and supervised approaches [15], [16]. Artificial neural
network (ANN) based subtype classification is also performed
on gene expression profiles of glioma patients [17].
Several studies related to cancer prognosis have used gene
expression details of genetic biomarkers for predicting the
cancer occurrence, for predicting the recurrence of cancers and
also for predicting outcomes after the diagnosis, such as mor-
tality, life expectancy, drug sensitivity etc. The first application
of machine learning, an artificial neural network is found
in early 1995 [18]. Moreover, O’Neill et al [19] also have
employed a neural network for diagnosing diffuse large B-cell
lymphoma with micro-array gene expression profiles. Later,
Chen et al [11] have proposed gene expression based artificial
neural network for predicting survival time of lymphoma
patients. Similar methodology is employed by Lancashire et
al [20] for breast cancer survival outcome prediction.
Nonetheless, Gene expression profiling is utilized for esti-
mating survival of glioma patients, as it is capable of revealing
the unrecognized heterogeneity of gliomas through hierarchi-
cal clustering [21]. Bonata at el [22] have proposed a Bayesian
ensemble model for survival prediction with high dimensional
gene expression profiles by selecting the genes potentially
related gliomagenesis. Identifying the potential biomarkers
related to glioma survival is also typically involves gene
expression profile analysis [23]. Typically, utilizing neural
networks the survival outcome of neuroblastoma patients after
5 years from the diagnosis is predicted with expression data
[24].
Risk score formulation, as a prognosis estimation tool, is
also established with the gene expression data for Glioma
patients [25], [26]. For this the most prominent features are
chosen with statistical methods based on the relationships of
each gene and survival. The predominant statistical analysis
based approaches are univariate and multivariate cox propor-
tional hazard regression analysis. Thus, based on the Hazard
ratio and the p value of each gene, the features associated with
survival are identified. In some studies [25], the features have
chosen based on the methylation status of each gene as well.
However, some studies have mentioned that risk score is not
an accurate reflection of the survival probability [26].
Nomograms are another initiation for estimating survival
probability after a particular time period of glioblastoma
patients [27], [28].
III. SYS TE M MOD EL A ND METHODOLOGY
A. Dataset
The publicly available gene expression GBM and lower-
grade glioma (LGG), which are biomarkers to classify subjects
into risk groups, datasets of TCGA and CGGA are downloaded
for our study. TCGA dataset comprises of 252 subject cases
with overall survival information and gene expression profiles
of 16510 genes, obtained using the Illumina HiSeq RNA
Sequencing platform. Particularly, WHO grade II, III are
included in the TCGA-LGG cohort and WHO grade IV is
included in the TCGA-GBM cohort. CGGA dataset consists
of 315 cases with overall survival data and gene expression
profiles of 24326 genes, acquired using Illumina HiSeq 2000
platform, which is a powerful high-throughput sequencing
system. The initial gene expression values, are normalized
based on the gene length into fragments per kilo-base per
million mapped reads (FPKM) [29] in comma-separated values
(CSV) format.
FPKM = Total fragments mapped reads million
Exon length kilobase pair (1)
Moreover, all the gene expression features are log transformed
and normalized median centered before analysing. After nor-
malization, 13094 gene common for both TCGA and CGGA
datasets are chosen. The CGGA dataset is divided into training
and validating datasets while the TCGA dataset is used for
testing. Based on the overall survival in days, the classes of
the each patient are determined, where the three classes are,
short survival (overall survival in days <300 days), medium
survival (300-450 days of survival) and long survival (overall
survival >450 days). The dataset distribution for the classes
short, medium and long survival are given in Table I. The
CGGA dataset is divided in 4 folds, with equal distribution
of classes in each fold and the ratio of classes in training
and validating folds are maintained without overlapping cases.
Table I shows the class distribution in each dataset, CGGA and
TCGA cohorts and in the training and validating datasets of
CGGA cohort.
TABLE I
DATASET DESCRIPTION
Class CGGA dataset TCGA
Training Validating Testing
Short 63 21 68
Medium 29 9 44
Long 145 48 140
B. Prognostic Gene Identification
We utilize Bayesian Model Averaging (BMA) [30] based
feature selection approach in order to obtain a robust learning
model. BMA algorithm overcomes the model uncertainty by
obtaining the average over the posterior probability distribu-
tions of several models. Thus, the posterior probability of
being Φgiven the training dataset Dcan be as follows.
Pr(Φ|D) = X
iS
Pr |D, Mi)·Pr (Mi|D)(2)
where, Miis a given model in the subset, S=
{M1, M2, . . . , Mn}. Initially the genes are ranked based on
Cox proportional hazard analysis (Cox-PH), which has the
ability to deal with censored data. Cox hazard function ((3))
for a given covariate vector of subject p,zp= (z1p, . . . , zip),
depicts the probability of dying at a given time T, if the patient
survived until time T.
λ(T,zp) = λ0(T) exp (zpα)(3)
where baseline hazard function is λ0and the coefficients
of the each (z1p, . . . , zip)covariates is given by α=
(α1p, . . . , αip). Since the baseline hazard function is same for
a single case, for different time T, it can be neglected. Thus,
an approximation for αis required and it can be calculated
with the following partial likelihood association.
PL(α) =
n
Y
p=1 exp (zpα)
P`Rpexp (z`α)!δx
(4)
In this Equation (4), Rpis the risk set, subjects that have not
experienced an event by the time tpand δiis the event status
(censored or not) of patient x. The θparameter is obtained
by maximizing the partial likelihood, and thus, according
descending order of log likelihood of those values, genes are
ranked.
The top ranked 25 genes are assigned to a window and tra-
ditional BMA algorithm is applied for survival analysis. Based
on the posterior probabilities of those genes, the genes with
low posterior probabilities (<1%) are eliminated retaining the
genes with high posterior probabilities. The window is moved
along the top ranked genes until 7 genes with high posterior
probabilities are obtained. These 7 genes and their posterior
probabilities are given in the Table II. Accordingly, TXLNA
gene has the highest posterior probability and the other genes
have posterior probabilities between 1 and 100%.
C. Survival Prediction
We utilized the most commonly used machine learning
algorithms to predict the overall survival class, short, medium
and long. For all these the input is the selected 7 gene
signature. In fact, To overcome the class imbalance, shown
in Table I, the minority class (i.e. short and medium classes)
entries in the training dataset are over sampled randomly.
TABLE II
SEL ECT ED 7GENES AND THEIR CORRESPONDING POSTERIOR
PROBABILITIES
Gene Posterior probability
TXLNA 100.0
WDR77 50.5
TAF12 48.9
STK40 35.9
YTHDF2 14.7.
SNRNP40 8.0.
SLC30A7 2.0
1) Decision Tree: Decision tree (DT) [31] is a tree based
structure, where each node in the tree specifies a condition
of a input covariate that decides the distribution or the final
class. The decision tree is fine tuned to obtain the best results,
with a maximum depth of the tree to be 4 and the minimum
samples required to make a decision at a node to be 10.
2) Random Forest Classification: Random Forest Classifier
(RFC) [32] is a widely used ensemble model, that outperforms
a single decision tree by overcoming the instability and the
variance, with multiple decision trees. Hence, in RFC the
number of trees in the forest are tuned to 100.
3) XGBoost: XGBoost [33] is an ensemble model but each
tree is trained in a sequential manner, boosting the prediction
accuracy. The fine tuned XGBoost model, consists of 50 trees
trained sequentially, with a maximum depth of a tree, 4 layers.
4) CatBoost: Catboost [34] is a sequentially trained boost-
ing algorithm, which outperforms the existing boosting algo-
rithms with its efficiency. We utilize multi-class loss function
with a learning rate of 0.005 for 2000 epochs for 4 layered
symmetric tree.
5) Support Vector Machine: Support Vector Machine
(SVM) [35] is a common machine learning algorithm that is
been used for survival prediction of glioma patients. Thus, we
explore a support vector classification based on SVM with a
linear kernel.
Moreover, we propose a novel Bayesian neural network
based on deep probabilistic programming. Deep probabilistic
programming uses the power of deep neural networks with
probabilistic models, to enhance the performance while opti-
mizing the related costs [36]. The traditional more frequently
utilized algorithms show less performance, as discussed in the
results section. Therefore, a need for a better performing algo-
rithm, which can recognize the uncertainties in the predictions,
occurred.
6) Proposed Bayesian Neural Network: We propose prob-
abilistic programming based Bayesian neural network (BNN)
for the survival class prediction of glioma patients. The first
layer receives the chosen 7 features as the input. The final
layers contains the predicted class, short, medium or long.
In the middle the architecture consists of 2 layers comprised
with 24 and 12 neurons in each layer respectively. Both
hidden layers are comprise of 50% dropout with 1st hidden
layer followed by a Tanh activation and the 2nd hidden layer
followed by a rectified linear activation unit (ReLU). All the
nodes are fully connected with nodes in the adjacent layers.
BNN differs from the traditional artificial neural network by
assigning distributions for the all the parameters, weights and
biases instead of a single value. Mean and scale parameters of
all the distributions are initialized to 0.01 and 0.1 respectively.
Stochastic gradient descent optimization with learning rate
0.001 is utilized for training the BNN. Further, stochastic
variational inference optimizes the trace implementation of
evidence lower bound (ELBO) in order to diverge probability
distribution of parameters. The proposed BNN architecture is
shown in Fig. 2.
Implementations are developed with Pyro (version 1.3.1)
probabilistic programming language with pytorch (version
1.5.0) [37]. For the other machine learning algorithms Scikit-
learn library [38] is utilized. To measure and compare the
performance of the applied machine learning algorithms, 4
metrics Accuracy, Precision, Recall and F1 scores are occu-
pied.
Output Layer
Input Layer
Short
Medium
Long
TXLNA
STK40
SLC30A7
YTHDF2
SNRNP40
TAF12
WDR77
1st Layer
(24 neurons)
2nd Layer
(12 neurons)
Dropout 50%
Tanh activation
Dropout 25%
Relu activation
Weight distribution assigned based on the Bayes theorem
Fig. 2. Proposed Bayesian Neural network architecture.
D. Prognostic Risk model Construction
Further, in order to estimate a risk, we calculate a risk
score with the survival related genes of glioma patients. We
use univariate cox proportional hazards regression analysis to
evaluate the association between the 7 genes obtained from the
iterative Bayesian Model Averaging algorithm and the survival
status and time. Typically the genes with p values <0.01 and
the hazard ratio (HR) >1are considered as the criteria for
them to be candidate genes for the survival estimation. The
obtained βvalues, i.e. the regression coefficient obtained from
the cox analysis are used to calculate the risk score along with
the expression values of the corresponding genes (expi). The
univariation cox analysis is performed with the R package,
survival (version 3.2-3) [39]. The Equation (5) indicates the
prognostic risk score formula used to obtain the risk score
prognostic model.
Risk score =
n
X
i=1
βiexpi(5)
nis the number of genes chosen to be included in the
prognostic signature. After obtaining the risk score for the
CGGA cohort, the median value of the prognostic risk score is
used to separate the patients in the high and low risk groups. To
clarify, the performance of the proposed signature is validated
on the TCGA cohort.
IV. SYS TE M EVALUATION
A. Overall survival prediction
The Accuracy, Precision, Recall and F1 scores were calcu-
lated to compare the techniques utilized to predict the survival
class. Each algorithm were trained on CGGA training splits
and validated on the corresponding fold validation split. Table
III shows the average metric results over 4 folds CGGA
validation splits. Thus, Based on the all 4 metrics the proposed
BNN algorithm outperforms the rest of frequently utilized ML
techniques, with 74.50% average accuracy. According to the
studies, the highest reported accuracy is 68% for radiomics
[40], with a cohort >100 patients used for the training.
According to our results, second best performing algorithm
was Random Forest Regression with over 60% accuracy.
TABLE III
COMPARISON OF OVERA LL SURVIVAL PREDICTION WITH MACH IN E
LEARNING - 4 FOLD CROSS VALIDATION ON CGGA CO HO RT
ML methods Accuracy Precision Recall F1 score
DT [31] 55.25% 57.50% 55.25% 56.00%
RFC [32] 62.25% 61.25% 62.25% 61.00%
XGBoost [33] 56.25% 62.75% 56.25% 58.25%
CatBoost [34] 57.25% 62.75% 57.25% 59.00%
SVC [35] 52.75% 67.25% 52.75% 56.50%
BNN 74.50% 67.50% 73.00% 70.25%
Further, we evaluated all the trained algorithms of each
fold on TCGA cohort. Thus, we obtained the average over
the metrics obtained from the testing on each fold trained
with CGGA. We could clearly observe the BNN performing
better compared to the other algorithms in testing phase.
The support vector machine and decision trees, that are been
widely utilized in survival prediction, showed a comparatively
low performance with gene expression data. The Boosting
algorithms showed an average performance and the ensemble
algorithm, RFC indicated the second best performance with
regard to the other algorithms. The performance comparison
on TCGA cohort validation is given in Table IV.
B. Prognostic risk model
We initiated a prognostic model, based on the univariate cox
regression analysis coeefcients and the expression of the most
prominent 7 genes, TXLNA, STK40, SLC30A7, YTHDF2,
SNRNP40, TAF12 and WDR77 associated with the survival.
TABLE IV
COMPARISON OF OVERA LL SU RVIVAL PREDICTION WITH MAC HIN E
LEARNING -TESTING ON TCGA CO HORT
ML methods Accuracy Precision Recall F1 score
DT 44.50% 46.00% 44.50% 45.00%
RFC 54.25% 51.25% 54.25% 51.00%
XGBoost 51.00% 52.25% 51.00% 51.75%
CatBoost 50.25% 52.75% 50.25% 50.75%
SVC 47.25% 56.00% 47.25% 50.00%
BNN 59.75% 57.25% 55.25% 51.00%
Univariate cox regression analysis results are shown in the
Table V.
TABLE V
UNIVARIATE COX REGRESSION ANALYSIS ON THE CHOSEN 7GE NES
Gene Coef in coxPH Hazard Ratio (95% CI) p value
TXLNA 0.87 2.4 (2.1-2.7) 2.2e
36
STK40 0.77 2.2 (1.9-2.4) 4.8e
33
SLC30A7 0.8 2.2 (2-2.5) 2.6e
33
YTHDF2 0.74 2.1 (1.8-2.4) 1.5e
29
SNRNP40 0.73 2.1 (1.8-2.4) 1.2e
31
TAF12 0.73 2.1 (1.8-2.3) 1.5e
34
WDR77 0.77 2.2 (1.9-2.5) 5.1e
32
This analysis proved that the genes chosen with the iterative
BMA algorithm are highly associated with the survival, with
HR over 2 for all the 7 genes. All the coefficients of cox
analysis were statistically significant (p values <0.01 for all
the 7 genes) and all the coefficients were positive. Accordingly
we can conclude that the high expression of these genes are
associated with the poor survival of glioma patients. Thus, the
risk score formula is shown in Equation (6).
(6)
Risk score = 0.87 expT X LN A +0.77 expST K 40
+0.8expSLC 30A7+0.74 expY T H DF 2
+0.73 expSN RN P 40 +0.73
expT AF 12 +0.77 expW DR77
Based on the above prognostic risk model, the risk score
was calculated for all the cases in the CGGA and TCGA
cohorts. Further, the threshold of the high risk & low risk
determination, was considered as the median of the risk scores
of the CGGA cohort.
As shown in Fig. 3(a), it can be seen that the high risk
patients have a high expression in all the 7 genes and the
survival was mostly short and medium. In fact, when the 7
genes are low expressed, the majority of the patients have a
long survival and thus, a low risk.
This prognostic risk model was validated on the TCGA
cohort by obtaining the risk score using the Equation (6).
The same threshold was used to divide the patients into
high and low risk groups. Consequently, the aforementioned
relationship could also be seen when we observe the Fig 3 (b)
obtained for the TCGA cohort. Most of the patients with a
highly expressed 7 genes also had a low survival and a high
risk.
TXLNA
STK40
SLC30A7
YTHDF2
SNRNP40
TAF12
WDR77
Count
0 20 40 60 80 100
Value
-2 0 2
High Risk
Low Risk
Short
Medium
Long
TXLNA
STK40
SLC30A7
YTHDF2
SNRNP40
TAF12
WDR77
Count
0 20 40 60 80
Value
-2 0 2
-1-3 3
1
(a)
(b)
Fig. 3. Heat map for the (a) CGGA cohort (b) validation TCGA cohort with
proposed gene signature.
This could further be clarified by observing the distribution
of the overall survival in days in each risk group shown in Fig.
4. Patients with a high risk significantly had a low survival and
patients with high risk had a large span of overall survival in
days.
TCGA CGGA TCGA CGGA
0
1000
2000
3000
4000
Overall Survival in days
High Risk
Low Risk
Fig. 4. Overall survival distribution of high and low risk groups.
For CGGA cohort for the high risk and low risk groups
have an overall survival of 572.2658 ±662.5993 days and
2244.361±1385.437 days, respectively. For testing, on TCGA,
the overall survival of the high risk and low risk groups
were 468.1624 ±412.3076 days and 1124.43 ±1131.1days,
respectively. Thus, we could observe that the high risk patient
group has a low survival in contrast to the low risk patient
group.
Kaplan-Meier plots were acquired for both TCGA and
CGGA cohorts, thus verifying the low percentage of survival
with respect to the overall survival for high risk patients.
Fig. 5 verifies the performance of the prognostic risk model
demonstrating the above observations. Correspondingly, The
high risk group TCGA cohort demonstrated a low percentage
of survival with respect to the overall survival, as shown in
Fig. 5 (b). This verified the prognostic risk model behaviour
for CGGA cohort, shown in 5 (a).
0 2000 4000 6000
0
50
100
Overall Survival in days
Percent survival
High Risk
Low Risk
(a)
(b)
0 1000 2000 3000 4000 5000
0
50
100
Overall Survival in days
Percent survival
High Risk
Low Risk
Fig. 5. Kaplan-Meier curves obtained for the high risk and low risk groups.
(a) CGGA (b) TCGA dataset.
Fig. 6 shows the gene expression distribution of the the
most prominent genes we choose for prognostic risk model
development, in high risk and low risk groups. It can be
observed that the mean expression value of each gene in the
low risk group is lower than the mean expression value of the
corresponding gene in the high risk group.
TXLNA STK40 SLC30A7 YTHDF2 SNRNP40 TAF12 WDR77
-4
-2
0
2
4
Gene
Log2(FPKM+1)
High Risk Low Risk
Fig. 6. mRNA expression value distribution of each selected genes for the
high and low risk groups.
V. DISCUSSION
Many studies reveal that genetic alterations and the molec-
ular heterogeneity plays a vital role in gliomagenesis and
prognosis. Therefore, recently many prognosis assessment
tools are originated to with these different types of omics
data related to gliomas. In this current study we identi-
fied 7 molecular biomarkers, with the potential of survival
prediction, which depict the underlying heterogeneity. The
state-of-art survival prediction in gliomas are mainly with
radiomic features, extracted from tumor region [40], [41]. As
in the Brain Tumor Segmentation challenge 2018 the highest
accuracy obtained is 68% with imaging features [40]. Here,
we proposed a machine learning based approach for survival
prediction with gene expression data with an accuracy over
70%. Moreover, we established a prognostic model to estimate
the risk of glioma patients. The high-throughput genomics
depict underlying molecular biology of gliomas and thus, have
shown improvements over radiomics for survival prediction.
Out of the considered 7 genes namely TXLNA, STK40,
SLC30A7, YTHDF2, SNRNP40, TAF12 and WDR77, the
gene TAF12 is said to have associations with gliomas ac-
cording to the previous studies. Mostly the grade II & III
gliomas have mutated IDH1/IDH2. As claimed by Ren et al
[42] mutated IDH1 down regulates the TAF12 expression. Our
observation of low expression of TAF12 in high risk category
with low survival as shown in Fig. 3, is on par with above
findings. Yet, our findings disclose that there are 6 other genes,
which has not been identified to have association with survival
exists, and can be used for risk and survival estimation.
However most of these studies rely on a single cohort to
avoid the inconsistencies occur between different datasets.
These inconsistencies occur often due to the technical lim-
itations, as they have been acquired from different high-
throughput platforms. This can also be affected by the probes’
cross hybridization, redundancy and annotations [43]. There-
fore, the significantly low performance of TCGA compared to
the CGGA in both of our survival prediction, is mostly due to
these reasons. Besides, the prognostic model, as shown in fig
3 (b), reveals that although it is developed on CGGA cohort,
it still has the ability to separate patients in TCGA cohort, in
to risk groups as a relatively promising prognostic model.
For the prognostic risk score model, unlike previous studies,
we consider the complete cohort of gliomas, without separat-
ing them into grades. Hence, without the prior diagnosis of
the grade, the prognostic risk model can be deployed for risk
estimation. Further, for this prognostic risk score model, only
require 7 genes, making it less complex for the clinicians.
According to studies, probabilistic programming languages
are capable of learning with less number of samples [44].
Thus, in our application with a deficit of learning samples
this algorithm showed improvements over other traditional
machine learning algorithms. Further, Deep Probabilistic Pro-
gramming Languages (DPPL) that combines deep learning
model with probabilistic programming, has shown strong
potential to achieve promising outcomes in deep learning
computations [36]. In the future, the overall survival prediction
task will be extended using an explainable method in order to
identify the contribution of expression of genes with a high
accuracy.
VI. CONCLUSION
This study has presented a solution for survival prediction
of glioma patients based on genomics with a comparatively
large dataset. We identified 7 gene signature associated with
survival and proposed two approaches for prognosis prediction
which have a potential ability to separate glioma patients into
groups based on their survival and risk. We proposed a novel
Bayesian neural network, for survival prediction that surpasses
the state-of-art survival prediction approaches. Moreover, we
established a comprehensive prognostic risk model based on
Cox-PH, that estimates the risk of glioma patients including
both GBM and LGG. Both of these approaches have promising
predictive ability of survival and risk for glioma patients, with
over 70% accuracy for overall survival class prediction.
VII. ACKN OWLEDGEMENT
We acknowledge the support from the Senate Research
Committee Grant SRC/LT/2019/18, University of Moratuwa,
Sri Lanka
REFERENCES
[1] E. Aquilanti, J. Miller, S. Santagata, D. P. Cahill, and P. K. Brastianos,
“Updates in prognostic markers for gliomas,” Neuro-Oncology, vol. 20,
no. suppl 7, pp. vii17–vii26, 2018.
[2] M. Weller, W. Wick, K. Aldape, M. Brada, M. Berger, S. M. Pfister,
R. Nishikawa, M. Rosenthal, P. Y. Wen, R. Stupp et al., “Glioma,
Nature reviews Disease primers, vol. 1, no. 1, pp. 1–18, 2015.
[3] G. N. Fuller and B. W. Scheithauer, “The 2007 revised world health
organization (who) classification of tumours of the central nervous
system: newly codified entities, Brain pathology, vol. 17, no. 3, pp.
304–307, 2007.
[4] L. A. Gravendeel, M. C. Kouwenhoven, O. Gevaert, J. J. de Rooi, A. P.
Stubbs, J. E. Duijm, A. Daemen, F. E. Bleeker, L. B. Bralten, N. K.
Kloosterhof et al., “Intrinsic gene expression profiles of gliomas are a
better predictor of survival than histology,” Cancer research, vol. 69,
no. 23, pp. 9065–9072, 2009.
[5] K. Ludwig and H. I. Kornblum, “Molecular markers in glioma,” Journal
of neuro-oncology, vol. 134, no. 3, pp. 505–512, 2017.
[6] M. Moghtadaei, M. R. H. Golpayegani, F. Almasganj, A. Etemadi, M. R.
Akbari, and R. Malekzadeh, “Predicting the risk of squamous dysplasia
and esophageal squamous cell carcinoma using minimum classification
error method,” Computers in biology and medicine, vol. 45, pp. 51–57,
2014.
[7] M. S. Bal, V. K. Bodal, J. Kaur, M. Kaur, and S. Sharma, “Patterns
of cancer: A study of 500 punjabi patients,” Asian Pac J Cancer Prev,
vol. 16, no. 12, pp. 5107–10, 2015.
[8] A. Bashiri, M. Ghazisaeedi, R. Safdari, L. Shahmoradi, and H. Ehte-
sham, “Improving the prediction of survival in cancer patients by using
machine learning techniques: experience of gene expression data: a
narrative review,” Iranian journal of public health, vol. 46, no. 2, p.
165, 2017.
[9] N. Wijethilake, M. Islam, and H. Ren, “Radiogenomics model for overall
survival prediction of glioblastoma. Medical & Biological Engineering
& Computing, 2020.
[10] N. Wijethilake, M. Islam, D. Meedeniya, C. Chitraranjan, I. Perera, and
H. Ren, “Radiogenomics of glioblastoma: Identification of radiomics as-
sociated with molecular subtypes,” 2nd MICCAI workshop on Radiomics
and Radiogenomics in Neuro-oncology using AI, Springer, LNCS, (to
appear), 2020.
[11] Y.-C. Chen, W.-W. Yang, and H.-W. Chiu, “Artificial neural network
prediction for cancer survival time by gene expression data, in 2009
3rd International Conference on Bioinformatics and Biomedical Engi-
neering. IEEE, 2009, pp. 1–4.
[12] W. Rasanjana, S. Rajapaksa, I. Perera, and D. Meedeniya, “A svm
model for candidate y-chromosome gene discovery in prostate cancer,
in Proceedings of 11th International Conference, vol. 60, 2019, pp. 129–
138.
[13] M. Labussiere, A. Di Stefano, V. Gleize, B. Boisselier, M. Giry,
S. Mangesius, A. Bruno, R. Paterra, Y. Marie, A. Rahimian et al.,
“Tert promoter mutations in gliomas, genetic associations and clinico-
pathological correlations,” British journal of cancer, vol. 111, no. 10,
pp. 2024–2032, 2014.
[14] X. Wang, J.-x. Chen, J.-p. Liu, C. You, Y.-h. Liu, and Q. Mao, “Gain
of function of mutant tp53 in glioblastoma: prognosis and response to
temozolomide,” Annals of surgical oncology, vol. 21, no. 4, pp. 1337–
1344, 2014.
[15] R. G. Verhaak, K. A. Hoadley, E. Purdom, V. Wang, Y. Qi, M. D.
Wilkerson, C. R. Miller, L. Ding, T. Golub, J. P. Mesirov et al.,
“Integrated genomic analysis identifies clinically relevant subtypes of
glioblastoma characterized by abnormalities in pdgfra, idh1, egfr, and
nf1,” Cancer cell, vol. 17, no. 1, pp. 98–110, 2010.
[16] M. Vitucci, D. Hayes, and C. Miller, “Gene expression profiling of
gliomas: merging genomic and histopathological classification for per-
sonalised therapy, British journal of cancer, vol. 104, no. 4, pp. 545–
553, 2011.
[17] L. P. Petalidis, A. Oulas, M. Backlund, M. T. Wayland, L. Liu, K. Plant,
L. Happerfield, T. C. Freeman, P. Poirazi, and V. P. Collins, “Improved
grading and survival prediction of human astrocytic brain tumors by
artificial neural network analysis of gene expression microarray data,
Molecular cancer therapeutics, vol. 7, no. 5, pp. 1013–1024, 2008.
[18] D. Faraggi and R. Simon, A neural network model for survival data,
Statistics in medicine, vol. 14, no. 1, pp. 73–82, 1995.
[19] M. C. O’Neill and L. Song, “Neural network analysis of lymphoma
microarray data: prognosis and diagnosis near-perfect,” BMC bioinfor-
matics, vol. 4, no. 1, p. 13, 2003.
[20] L. J. Lancashire, D. Powe, J. Reis-Filho, E. Rakha, C. Lemetre,
B. Weigelt, T. Abdel-Fatah, A. R. Green, R. Mukta, R. Blamey et al.,
“A validated gene expression profile for detecting clinical outcome in
breast cancer using artificial neural networks,” Breast cancer research
and treatment, vol. 120, no. 1, pp. 83–93, 2010.
[21] W. A. Freije, F. E. Castro-Vargas, Z. Fang, S. Horvath, T. Cloughesy,
L. M. Liau, P. S. Mischel, and S. F. Nelson, “Gene expression profiling
of gliomas strongly predicts survival, Cancer research, vol. 64, no. 18,
pp. 6503–6510, 2004.
[22] V. Bonato, V. Baladandayuthapani, B. M. Broom, E. P. Sulman, K. D.
Aldape, and K.-A. Do, “Bayesian ensemble methods for survival pre-
diction in gene expression data,” Bioinformatics, vol. 27, no. 3, pp. 359–
367, 2011.
[23] J. B.-K. Hsu, T.-H. Chang, G. A. Lee, T.-Y. Lee, and C.-Y. Chen,
“Identification of potential biomarkers related to glioma survival by gene
expression profile analysis,” BMC medical genomics, vol. 11, no. 7,
p. 34, 2019.
[24] J. S. Wei, B. T. Greer, F. Westermann, S. M. Steinberg, C.-G. Son, Q.-
R. Chen, C. C. Whiteford, S. Bilke, A. L. Krasnoselsky, N. Cenacchi
et al., “Prediction of clinical outcome using gene expression profiling
and artificial neural networks for patients with neuroblastoma,” Cancer
research, vol. 64, no. 19, pp. 6883–6891, 2004.
[25] W.-J. Zeng, Y.-L. Yang, Z.-Z. Liu, Z.-P. Wen, Y.-H. Chen, X.-L. Hu,
Q. Cheng, J. Xiao, J. Zhao, and X.-P. Chen, “Integrative analysis of
dna methylation and gene expression identify a three-gene signature for
predicting prognosis in lower-grade gliomas, Cellular Physiology and
Biochemistry, vol. 47, no. 1, pp. 428–439, 2018.
[26] S. Zuo, X. Zhang, and L. Wang, “A rna sequencing-based six-gene
signature for survival prediction in patients with glioblastoma, Scientific
reports, vol. 9, no. 1, pp. 1–10, 2019.
[27] L. Wang, Z. Yan, X. He, C. Zhang, H. Yu, and Q. Lu, A 5-gene
prognostic nomogram predicting survival probability of glioblastoma
patients,” Brain and behavior, vol. 9, no. 4, p. e01258, 2019.
[28] H. Gittleman, A. E. Sloan, and J. S. Barnholtz-Sloan, An independently
validated survival nomogram for lower-grade glioma,” Neuro-oncology,
vol. 22, no. 5, pp. 665–674, 2020.
[29] F. Abbas-Aghababazadeh, Q. Li, and B. L. Fridley, “Comparison of
normalization approaches for gene expression studies completed with
high-throughput sequencing,” PloS one, vol. 13, no. 10, 2018.
[30] A. Annest, R. E. Bumgarner, A. E. Raftery, and K. Y. Yeung, “Iterative
bayesian model averaging: A method for the application of survival
analysis to high-dimensional microarray data,” BMC bioinformatics,
vol. 10, no. 1, p. 72, 2009.
[31] E. Marubini, A. Morabito, and M. Valsecchi, “Prognostic factors and risk
groups: some results given by using an algorithm suitable for censored
survival data, Statistics in medicine, vol. 2, no. 2, pp. 295–303, 1983.
[32] L. Breiman, “Random forests, Machine learning, vol. 45, no. 1, pp.
5–32, 2001.
[33] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,”
in Proceedings of the 22nd acm sigkdd international conference on
knowledge discovery and data mining, 2016, pp. 785–794.
[34] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin,
“Catboost: unbiased boosting with categorical features,” in Advances in
neural information processing systems, 2018, pp. 6638–6648.
[35] A. J. Smola and B. Sch¨
olkopf, “A tutorial on support vector regression,”
Statistics and computing, vol. 14, no. 3, pp. 199–222, 2004.
[36] I. Rubasinghe and D. Meedeniya, “Ultrasound nerve segmentation
using deep probabilistic programming,” Journal of ICT Research and
Applications, vol. 13, no. 3, pp. 241–256, 2019.
[37] E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer, N. Pradhan,
T. Karaletsos, R. Singh, P. Szerlip, P. Horsfall, and N. D. Goodman,
“Pyro: Deep universal probabilistic programming, The Journal of
Machine Learning Research, vol. 20, no. 1, pp. 973–978, 2019.
[38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
esnay, “Scikit-learn: Machine learning in Python, Journal of Machine
Learning Research, vol. 12, pp. 2825–2830, 2011.
[39] T. M. Therneau, A Package for Survival Analysis in R, 2020, r
package version 3.2-3. [Online]. Available: https://CRAN.R-project.
org/package=survival
[40] Z. A. Shboul, M. Alam, L. Vidyaratne, L. Pei, M. I. Elbakary, and K. M.
Iftekharuddin, “Feature-guided deep radiomics for glioblastoma patient
survival prediction, Frontiers in Neuroscience, vol. 13, 2019.
[41] M. Islam, V. S. Vibashan, V. J. M. Jose, N. Wijethilake, U. Utkarsh, and
H. Ren, “Brain Tumor Segmentation and Survival Prediction Using 3D
Attention UNet,” in Brainlesion: Glioma, Multiple Sclerosis, Stroke and
Traumatic Brain Injuries, A. Crimi and S. Bakas, Eds. Cham: Springer
International Publishing, 2020, pp. 262–272.
[42] J. Ren, M. Lou, J. Shi, Y. Xue, and D. Cui, “Identifying the genes
regulated by idh1 via gene-chip in glioma cell u87,” International
journal of clinical and experimental medicine, vol. 8, no. 10, p. 18090,
2015.
[43] S. Zhao, W.-P. Fung-Leung, A. Bittner, K. Ngo, and X. Liu, “Compar-
ison of rna-seq and microarray in transcriptome profiling of activated t
cells,” PloS one, vol. 9, no. 1, p. e78644, 2014.
[44] L. Fei-Fei, R. Fergus, and P. Perona, “One-shot learning of object cate-
gories,” IEEE transactions on pattern analysis and machine intelligence,
vol. 28, no. 4, pp. 594–611, 2006.
... Ren et al. using gene over-expression and gene-chip techniques, found mutated IDH1 down regulates TAF12 expression in U87 cell lines [7]. Wijethilake et al. reported the high expression of TAF12 with six other genes is associated with poor survival of glioma patients, based on novel probabilistic programming [8]. However, they did not show the expression pattern and potential functions of TAF12 in glioma. ...
... Meta-analysis further confirmed the relationship between TAF12 overexpression and worse prognosis in glioma patients. These results are consistent with previous findings [7,8]. ...
Article
Full-text available
TATA box-binding protein-associated factor 12 (TAF12) has been identified as an oncogene in choroid plexus carcinoma, but its role in glioma is poorly understood because of a lack of previous studies. This study investigated the relationship of TAF12 expression with the clinicopathologic features of glioma cases, as well as its prognostic value and biological function, using large-scale databases and clinical samples. TAF12 mRNA expression and clinicopathologic characteristics of glioma cases were assessed in three public databases, and bioinformatics analyses were conducted to explore the prognostic value and biological functions of TAF12 in glioma. High TAF12 expression was commonly associated with reduced survival time and poor clinical indexes, including higher World Health Organization grade, wild-type isocitrate dehydrogenase 1 expression, and 1p19q non-codeletion status (p < 0.0001). Multivariate Cox regression analysis showed that high TAF12 expression was an independent poor prognostic factor for glioma patients (hazard ratio = 1.42, 95% confidence interval, 1.19–1.68, p < 0.001). Functional enrichment analysis revealed involvement of TAF12 in immune and inflammatory responses in glioma. Also, expression of several immune checkpoint molecules was significantly higher in samples with high TAF12 expression. TAF12 is a potential independent prognostic factor for glioma, and these findings provide a foundation for further investigation of the potential role of TAF12 in immunotherapy.
... During the past decade, many efforts have been made to promote data-driven approaches for precision medicine in cancer research field including brain gliomas [12], including the usage of genetic multi-omics data, radiomics data, protein structure data and a variety of clinical data [13][14][15][16]. In this study, we aimed to address a full view of functional and clinical significance of the pivotal oncogenic gene LMO2 in gliomas primarily with a Weighted Gene Co-Expression Network Analysis (WGCNA) based bioinformatical strategy, and the results did give a well resolution of LMO2 functions in such complicated tumor environments. ...
... Till now, a variety of approaches have been developed for characteristic identification and precision medicine in gliomas, such as using integrated genetic multi-omics and radiomics data [13,14]. In this study, we aimed to apply a WGCNA-based strategy to investigate the functional aspects of certain genes (LMO2 in this study) from transcriptome data in a complicated glioma environment, and the results confirmed that the LMO2 functional associations had been indeed dissected into several aspects exactly and reasonably in brain gliomas, including common endothelium and pattern recognition receptor (PRR) response and LGG-specific cytotoxic T-lymphocyte (CTL) infiltration. ...
Article
Full-text available
Brain glioma is one of the cancer types with worst prognosis, and LMO2 has been reported to play oncogenic functions in brain gliomas. Herein, analysis of datasets from The Cancer Genome Atlas (TCGA) indicated that higher LMO2 level in patient samples indicated worse prognosis in lower grade gliomas (LGG) but not glioblastoma multiforme (GBM). Further, in tumor tissues consisting of a variety of cell types, LMO2 level indicated intratumoral endothelium and pattern recognition receptor (PRR) response in both LGGs and GBMs, and additionally indicated cytotoxic T-lymphocyte, M2 macrophage infiltration and fibroblast specifically in LGGs. Moreover, only in LGGs these aspects were significantly associated with patient survival, in either risky or protective manner, and these dissected associations can give a better prediction on patient prognosis than LMO2 alone. This study not only provided more detailed understandings of LMO2 functional representatives in brain gliomas but also demonstrated that dealing with certain gene (LMO2 in this study) in transcriptome data with the Weighted Gene Co-Expression Network Analysis (WGCNA) method was a robust strategy for dissecting exact and reasonable gene functions/associations in a complicated tumor environment.
... Our results show some interesting differences when compared to the existing literature. The 70% accuracy of the combined model we used (RFE + LRLasso) is 17.25% higher than the 52.75% accuracy of its SVC compared to the use of the conventional SVM model in the study of Navodini et al. (36). This result may provide richer feature information with our use of image data from multiple MRI sequences. ...
Article
Full-text available
Objective Telomerase reverse transcriptase (TERT) promoter mutation status in gliomas is a key determinant of treatment strategy and prognosis. This study aimed to analyze the radiogenomic features and construct radiogenomic models utilizing medical imaging techniques to predict the TERT promoter mutation status in gliomas. Methods This was a retrospective study of 304 patients with gliomas. T1-weighted contrast-enhanced, apparent diffusion coefficient, and diffusion-weighted imaging MRI sequences were used for radiomic feature extraction. A total of 3,948 features were extracted from MRI images using the FAE software. These included 14 shape features, 18 histogram features, 24 gray level run length matrix, 14 gray level dependence matrix, 16 gray level run length matrix, 16 gray level size zone matrix (GLSZM), 5 neighboring gray tone difference matrix, and 744 wavelet transforms. The dataset was randomly divided into training and testing sets in a ratio of 7:3. Three feature selection methods and six classification algorithms were used to model the selected features. Predictive performance was evaluated using receiver operating characteristic curve analysis. Results Among the evaluated classification algorithms, the combination model of recursive feature elimination (RFE) with linear regression (LR) using six features showed the best diagnostic performance (area under the curve: 0.733, 0.562, and 0.633 in the training, validation, and testing sets, respectively). The next best-performing models were naive Bayes, linear discriminant analysis, autoencoder, and support vector machine. Regarding the three feature selection algorithms, RFE showed the most consistent performance, followed by relief and ANOVA. T1-enhanced entropy and GLSZM derived from T1-enhanced images were identified as the most critical radiomics features for distinguishing TERT promoter mutation status. Conclusion The LR and LRLasso models, mainly based on T1-enhanced entropy and GLSZM, showed good predictive ability for TERT promoter mutations in gliomas using radiomics models.
... Consequently, they are less intelligible to humans. Regardless of human explainability, most existing methods aim to increase accuracy [21][22][23][24][25][26][27]. In addition, the model should be understood by medical professionals. ...
Article
Full-text available
Brain tumors (BT) present a considerable global health concern because of their high mortality rates across diverse age groups. A delay in diagnosing BT can lead to death. Therefore, a timely and accurate diagnosis through magnetic resonance imaging (MRI) is crucial. A radiologist makes the final decision to identify the tumor through MRI. However, manual assessments are flawed, time-consuming, and rely on experienced radiologists or neurologists to identify and diagnose a BT. Computer-aided classification models often lack performance and explainability for clinical translation, particularly in neuroscience research, resulting in physicians perceiving the model results as inadequate due to the black box model. Explainable deep learning (XDL) can advance neuroscientific research and healthcare tasks. To enhance the explainability of deep learning (DL) and provide diagnostic support, we propose a new classification and localization model, combining existing methods to enhance the explainability of DL and provide diagnostic support. We adopt a pre-trained visual geometry group (pre-trained-VGG-19), scratch-VGG-19, and EfficientNet model that runs a modified form of the class activation mapping (CAM), gradient-weighted class activation mapping (Grad-CAM) and Grad-CAM++ algorithms. These algorithms, introduced into a convolutional neural network (CNN), uncover a crucial part of the classification and can provide an explanatory interface for diagnosing BT. The experimental results demonstrate that the pre-trained-VGG-19 with Grad-CAM provides better classification and visualization results than the scratch-VGG-19, EfficientNet, and cutting-edge DL techniques regarding visual and quantitative evaluations with increased accuracy. The proposed approach may contribute to reducing the diagnostic uncertainty and validating BT classification.
... full genome sequencing or proteomics), and/or necessitated multi-modal combination with medical imaging. [53][54][55][56][57][58][59][60][61][62][63][64][65][66][67][68][69][70] While these areas are undoubtedly interesting and add value to the field, our focus was to provide a means of forecasting survival with genetic data acquired in routine clinical care across the range of diagnoses available to us. Therefore, it was deemed appropriate to derive comparator models that would be tested against the graph-representations criterion on the original genetic data, and the WHO CNS5 diagnosis. ...
Article
Full-text available
Tumour heterogeneity is increasingly recognized as a major obstacle to therapeutic success across neuro-oncology. Gliomas are characterized by distinct combinations of genetic and epigenetic alterations, resulting in complex interactions across multiple molecular pathways. Predicting disease evolution and prescribing individually optimal treatment requires statistical models complex enough to capture the intricate (epi)genetic structure underpinning oncogenesis. Here, we formalize this task as the inference of distinct patterns of connectivity within hierarchical latent representations of genetic networks. Evaluating multi-institutional clinical, genetic and outcome data from 4023 glioma patients over 14 years, across 12 countries, we employ Bayesian generative stochastic block modelling to reveal a hierarchical network structure of tumour genetics spanning molecularly confirmed glioblastoma, IDH-wildtype; oligodendroglioma, IDH-mutant and 1p/19q codeleted; and astrocytoma, IDH-mutant. Our findings illuminate the complex dependence between features across the genetic landscape of brain tumours and show that generative network models reveal distinct signatures of survival with better prognostic fidelity than current gold standard diagnostic categories.
... Our approach involves using age-adjusted representations of nuclear morphometric features or their organization in WSIs and utilizing linear associations to improve interpretation while reducing the number of parameters. We suggest that incorporating the Cox Hazard Model in the loss function (as done in [22]) increases the likelihood of finding associations by noise or chance. Therefore, we advocate for using linear associations instead, as they offer simplicity and using a single computed index at a time improves interpretability and robustness. ...
Article
Full-text available
Simple Summary Identifying biomarkers of survival from a large-scale cohort of Glioblastoma Multiforme (GBM) pathology images is hindered by heterogeneity of tumor signature compounded by age being the single most important confounder in predicting survival in GBM. The main contributions of this manuscript are to define (i) metrics for identifying tumor subtypes of tumor heterogeneity and (ii) relevant statistics for incorporating age for evaluating competing hypotheses. As a result, the GBM cohort are stratified based on interpretable morphometric features with or without preconditioning on published genomic subtypes. Abstract Tumor Whole Slide Images (WSI) are often heterogeneous, which hinders the discovery of biomarkers in the presence of confounding clinical factors. In this study, we present a pipeline for identifying biomarkers from the Glioblastoma Multiforme (GBM) cohort of WSIs from TCGA archive. The GBM cohort endures many technical artifacts while the discovery of GBM biomarkers is challenged because “age” is the single most confounding factor for predicting outcomes. The proposed approach relies on interpretable features (e.g., nuclear morphometric indices), effective similarity metrics for heterogeneity analysis, and robust statistics for identifying biomarkers. The pipeline first removes artifacts (e.g., pen marks) and partitions each WSI into patches for nuclear segmentation via an extended U-Net for subsequent quantitative representation. Given the variations in fixation and staining that can artificially modulate hematoxylin optical density (HOD), we extended Navab’s Lab method to normalize images and reduce the impact of batch effects. The heterogeneity of each WSI is then represented either as probability density functions (PDF) per patient or as the composition of a dictionary predicted from the entire cohort of WSIs. For PDF- or dictionary-based methods, morphometric subtypes are constructed based on distances computed from optimal transport and linkage analysis or consensus clustering with Euclidean distances, respectively. For each inferred subtype, Kaplan–Meier and/or the Cox regression model are used to regress the survival time. Since age is the single most important confounder for predicting survival in GBM and there is an observed violation of the proportionality assumption in the Cox model, we use both age and age-squared coupled with the Likelihood ratio test and forest plots for evaluating competing statistics. Next, the PDF- and dictionary-based methods are combined to identify biomarkers that are predictive of survival. The combined model has the advantage of integrating global (e.g., cohort scale) and local (e.g., patient scale) attributes of morphometric heterogeneity, coupled with robust statistics, to reveal stable biomarkers. The results indicate that, after normalization of the GBM cohort, mean HOD, eccentricity, and cellularity are predictive of survival. Finally, we also stratified the GBM cohort as a function of EGFR expression and published genomic subtypes to reveal genomic-dependent morphometric biomarkers.
... Thus, they are less understandable to humans. In addition, most of the existing systems focus on improving the prediction accuracy [2][3][4][5] domain, other than the accuracy, the model should be interpretable for the domain experts. This study presents the Tumour-Analyser web application using interpretable deep learning [6], to address the issues in model understandability. ...
Article
Full-text available
Tumour-Analyser is a web application that classifies a brain tumour into three classes, namely, lower-grade astrocytoma (A), oligodendroglioma (O), glioblastoma & diffuse astrocytic glioma (G). We use a magnetic resonance imaging (MRI) sequence and a whole slide imaging (WSI) that are classified using DenseNet and ResNet, respectively. The tool interprets the decision-making process of each classification model. Tumour-Analyser provides a viable solution to the less human understandability of existing models due to the inherent black-box nature of deep learning models and less transparency, by applying interpretability.
Article
Full-text available
Effective diagnosis and treatment in cancer is a barrier for the development of personalized medicine, mostly due to tumor heterogeneity. In the particular case of gliomas, highly heterogeneous brain tumors at the histological, cellular and molecular levels, and exhibiting poor prognosis, the mechanisms behind tumor heterogeneity and progression remain poorly understood. The recent advances in biomedical high-throughput technologies have allowed the generation of large amounts of molecular information from the patients that combined with statistical and machine learning techniques can be used for the definition of glioma subtypes and targeted therapies, an invaluable contribution to disease understanding and effective management.In this work sparse and robust sparse logistic regression models with the elastic net penalty were applied to glioma RNA-seq data from The Cancer Genome Atlas (TCGA), to identify relevant transcriptomic features in the separation between lower-grade glioma (LGG) subtypes and identify putative outlying observations. In general, all classification models yielded good accuracies, selecting different sets of genes. Among the genes selected by the models, TXNDC12, TOMM20, PKIA, CARD8 and TAF12 have been reported as genes with relevant role in glioma development and progression. This highlights the suitability of the present approach to disclose relevant genes and fosters the biological validation of non-reported genes.
Article
Full-text available
Gliomas are tumors of the central nervous system, which usually start within the glial cells of the brain or the spinal cord. These are extremely migratory and diffusive tumors, which quickly expand to the surrounding regions in the brain. There are different grades of gliomas, hinting about their growth patterns and aggressiveness and potential response to the treatment. As part of routine clinical procedure for gliomas, both radiology images (rad), such as multiparametric MR images, and digital pathology images (path) from tissue samples are acquired. Each of these data streams are used separately for prediction of the survival outcome of gliomas, however, these images provide complimentary information, which can be used in an integrated way for better prediction. There is a need to develop an image-based method that can utilise the information extracted from these imaging sequences in a synergistic way to predict patients’ outcome and to potentially assist in building comprehensive and patient-centric treatment plans. The objective of this study is to improve survival prediction outcomes of gliomas by integrating radiology and pathology imaging. Multiparametric magnetic resonance imaging (MRI), rad images, and path images of glioma patients were acquired from The Cancer Imaging Archive. Quantitative imaging features were extracted from tumor regions in rad and path images. The features were given as input to an ensemble regression machine learning pipeline, including support vector regression, AdaBoost, gradient boost, and random forest. The performance of the model was evaluated in several configurations, including leave-one-out, five-fold cross-validation, and split-train-test. Moreover, the quantitative performance evaluations were conducted separately in the complete cohort (n = 171), high-grade gliomas (HGGs), n = 75, and low-grade gliomas (LGGs), n = 96. The combined rad and path features outperformed individual feature types in all the configurations and datasets. In leave-one-out configuration, the model comprising both rad and path features was successfully validated on the complete dataset comprising HGFs and LGGs (R=0.84 p=2.2×10−16). The Kaplan–Meier curves generated on the predictions of the proposed model yielded a hazard ratio of 3.314 [95%CI:1.718−6.394], log−rank(P)=2×10−4 on combined rad and path features. Conclusion: The proposed approach emphasizes radiology experts and pathology experts’ clinical workflows by creating prognosticators upon ‘rad’ radiology images and digital pathology ‘path’ images independently, as well as combining the power of both, also through delivering integrated analysis, that can contribute to a collaborative attempt between different departments for administration of patients with gliomas.
Conference Paper
Full-text available
Glioblastoma is the most malignant type of central nervous system tumor with GBM subtypes cleaved based on molecular level gene alterations. These alterations are also happened to affect the histology. Thus, it can cause visible changes in images, such as enhancement and edema development. In this study, we extract intensity, volume, and texture features from the tumor subregions to identify the correlations with gene expression features and overall survival. Consequently, we utilize the radiomics to find associations with the subtypes of glioblastoma. Accordingly , the fractal dimensions of the whole tumor, tumor core, and necrosis regions show a significant difference between the Proneural, Classical and Mesenchymal subtypes. Additionally, the subtypes of GBM are predicted with an average accuracy of 79% utilizing radiomics and accuracy over 90% utilizing gene expression profiles.
Article
Full-text available
Glioblastoma multiforme (GBM) is a very aggressive and infiltrative brain tumor with a high mortality rate. There are radiomic models with handcrafted features to estimate glioblastoma prognosis. In this work, we evaluate to what extent of combining genomic with radiomic features makes an impact on the prognosis of overall survival (OS) in patients with GBM. We apply a hypercolumn-based convolutional network to segment tumor regions from magnetic resonance images (MRI), extract radiomic features (geometric, shape, histogram), and fuse with gene expression profiling data to predict survival rate for each patient. Several state-of-the-art regression models such as linear regression, support vector machine, and neural network are exploited to conduct prognosis analysis. The Cancer Genome Atlas (TCGA) dataset of MRI and gene expression profiling is used in the study to observe the model performance in radiomic, genomic, and radiogenomic features. The results demonstrate that genomic data are correlated with the GBM OS prediction, and the radiogenomic model outperforms both radiomic and genomic models. We further illustrate the most significant genes, such as IL1B, KLHL4, ATP1A2, IQGAP2, and TMSL8, which contribute highly to prognosis analysis. Our Proposed fully automated "Radiogenomic"" approach for survival prediction overview. It fuses geometric, intensity, volumetric, genomic and clinical information to predict OS.
Chapter
Full-text available
In this work, we develop an attention convolutional neural network (CNN) to segment brain tumors from Magnetic Resonance Images (MRI). Further, we predict the survival rate using various machine learning methods. We adopt a 3D UNet architecture and integrate channel and spatial attention with the decoder network to perform segmentation. For survival prediction, we extract some novel radiomic features based on geometry, location, the shape of the segmented tumor and combine them with clinical information to estimate the survival duration for each patient. We also perform extensive experiments to show the effect of each feature for overall survival (OS) prediction. The experimental results infer that radiomic features such as histogram, location, and shape of the necrosis region and clinical features like age are the most critical parameters to estimate the OS.
Article
Full-text available
Deep probabilistic programming concatenates the strengths of deep learning to the context of probabilistic modeling for efficient and flexible computation in practice. Being an evolving field, there exist only a few expressive programming languages for uncertainty management. This paper discusses an application for analysis of ultrasound nerve segmentation-based biomedical images. Our method uses the probabilistic programming language Edward with the U-Net model and generative adversarial networks under different optimizers. The segmentation process showed the least Dice loss (-0.54) and the highest accuracy (0.99) with the Adam optimizer in the U-Net model with the least time consumption compared to other optimizers. The smallest amount of generative network loss in the generative adversarial network model gained was 0.69 for the Adam optimizer. The Dice loss, accuracy, time consumption and output image quality in the results show the applicability of deep probabilistic programming in the long run. Thus, we further propose a neuroscience decision support system based on the proposed approach.
Article
Full-text available
Glioblastoma is recognized as World Health Organization (WHO) grade IV glioma with an aggressive growth pattern. The current clinical practice in diagnosis and prognosis of Glioblastoma using MRI involves multiple steps including manual tumor sizing. Accurate identification and segmentation of multiple abnormal tissues within tumor volume in MRI is essential for precise survival prediction. Manual tumor and abnormal tissue detection and sizing are tedious, and subject to inter-observer variability. Consequently, this work proposes a fully automated MRI-based glioblastoma and abnormal tissue segmentation, and survival prediction framework. The framework includes radiomics feature-guided deep neural network methods for tumor tissue segmentation; followed by survival regression and classification using these abnormal tumor tissue segments and other relevant clinical features. The proposed multiple abnormal tumor tissue segmentation step effectively fuses feature-based and feature-guided deep radiomics information in structural MRI. The survival prediction step includes two representative survival prediction pipelines that combine different feature selection and regression approaches. The framework is evaluated using two recent widely used benchmark datasets from Brain Tumor Segmentation (BraTS) global challenges in 2017 and 2018. The best overall survival pipeline in the proposed framework achieves leave-one-out cross-validation (LOOCV) accuracy of 0.73 for training datasets and 0.68 for validation datasets, respectively. These training and validation accuracies for tumor patient survival prediction are among the highest reported in literature. Finally, a critical analysis of radiomics features and efficacy of these features in segmentation and survival prediction performance is presented as lessons learned.
Conference Paper
Full-text available
Prostate cancer is widely known to be one of the most common cancers among men around the world. Due to its high heterogeneity, many of the studies carried out to identify the molecular level causes for cancer have only been partially successful. Among the techniques used in cancer studies, gene expression profiling is seen to be one of the most popular techniques due to its high usage. Gene expression profiles reveal information about the functionality of genes in different body tissues at different conditions. In order to identify cancer-decisive genes, differential gene expression analysis is carried out using statistical and machine learning methodologies. It helps to extract information about genes that have significant expression differences between healthy tissues and cancerous tissues. In this paper, we discuss a comprehensive supervised classification approach using Support Vector Machine (SVM) models to investigate differentially expressed Y-chromosome genes in prostate cancer. 8 SVM models, which are tuned to have 98.3% average accuracy have been used for the analysis. We were able to capture genes like CD99 (MIC2), ASMTL, DDX3Y and TXLNGY to come out as the best candidates. Some of our results support existing findings while introducing novel findings to be possible prostate cancer candidates.
Article
Full-text available
Background: Recent studies have proposed several gene signatures as biomarkers for different grades of gliomas from various perspectives. However, most of these genes can only be used appropriately for patients with specific grades of gliomas. Methods: In this study, we aimed to identify survival-relevant genes shared between glioblastoma multiforme (GBM) and lower-grade glioma (LGG), which could be used as potential biomarkers to classify patients into different risk groups. Cox proportional hazard regression model (Cox model) was used to extract relative genes, and effectiveness of genes was estimated against random forest regression. Finally, risk models were constructed with logistic regression. Results: We identified 104 key genes that were shared between GBM and LGG, which could be significantly correlated with patients' survival based on next-generation sequencing data obtained from The Cancer Genome Atlas for gene expression analysis. The effectiveness of these genes in the survival prediction of GBM and LGG was evaluated, and the average receiver operating characteristic curve (ROC) area under the curve values ranged from 0.7 to 0.8. Gene set enrichment analysis revealed that these genes were involved in eight significant pathways and 23 molecular functions. Moreover, the expressions of ten (CTSZ, EFEMP2, ITGA5, KDELR2, MDK, MICALL2, MAP 2 K3, PLAUR, SERPINE1, and SOCS3) of these genes were significantly higher in GBM than in LGG, and comparing their expression levels to those of the proposed control genes (TBP, IPO8, and SDHA) could have the potential capability to classify patients into high- and low- risk groups, which differ significantly in the overall survival. Signatures of candidate genes were validated, by multiple microarray datasets from Gene Expression Omnibus, to increase the robustness of using these potential prognostic factors. In both the GBM and LGG cohort study, most of the patients in the high-risk group had the IDH1 wild-type gene, and those in the low-risk group had IDH1 mutations. Moreover, most of the high-risk patients with LGG possessed a 1p/19q-noncodeletion. Conclusion: In this study, we identified survival relevant genes which were shared between GBM and LGG, and those enabled to classify patients into high- and low-risk groups based on expression level analysis. Both the risk groups could be correlated with the well-known genetic variants, thus suggesting their potential prognostic value in clinical application.
Article
Full-text available
Background Glioblastoma (GBM) remains the most biologically aggressive subtype of gliomas with an average survival of 10 to 12 months. Considering that the overall survival (OS) of each GBM patient is a key factor in the treatment of individuals, it is meaningful to predict the survival probability for GBM patients newly diagnosed in clinical practice. Material and Methods Using the TCGA dataset and two independent GEO datasets, we identified genes that are associated with the OS and differentially expressed between GBM tissues and the adjacent normal tissues. A robust likelihood‐based survival modeling approach was applied to select the best genes for modeling. After the prognostic nomogram was generated, an independent dataset on different platform was used to evaluate its effectiveness. Results We identified 168 differentially expressed genes associated with the OS. Five of these genes were selected to generate a gene prognostic nomogram. The external validation demonstrated that 5‐gene prognostic nomogram has the capability of predicting the OS of GBM patients. Conclusion We developed a novel and convenient prognostic tool based on five genes that exhibited clinical value in predicting the survival probability for newly diagnosed GBM patients, and all of these five genes could represent potential target genes for the treatment of GBM. The development of this model will provide a good reference for cancer researchers.
Article
Full-text available
Glioblastoma (GBM) is an aggressive tumor of the central nervous system that has poor prognosis despite extensive therapy. Therefore, it is essential to identify a gene expression-based signature for predicting GBM prognosis. The RNA sequencing data of GBM patients from the Chinese Glioma Genome Atlas (CGGA) and The Cancer Genome Atlas (TCGA) databases were employed in our study. The univariate and multivariate regression models were utilized to assess the relative contribution of each gene to survival prediction in both cohorts, and the common genes in two cohorts were identified as a final prognostic model. A prognostic risk score was calculated based on the prognostic gene signature. This prognostic signature stratified the patients into the low- and high-risk groups. Multivariate regression and stratification analyses were implemented to determine whether the gene signature was an independent prognostic factor. We identified a 6-gene signature through univariate and multivariate regression models. This prognostic signature stratified the patients into the low- and high-risk groups, implying improved and poor outcomes respectively. Multivariate regression and stratification analyses demonstrated that the predictive value of the 6-gene signature was independent of other clinical factors. This study highlights the significant implications of having a gene signature as a prognostic predictor in GBM, and its potential application in personalized therapy.
Article
Background: Gliomas are the most common primary malignant brain tumor. Diffuse low-grade and intermediate-grade gliomas, which together comprise the lower-grade gliomas [LGG] (WHO grades II and III), present a therapeutic challenge to physicians due to the heterogeneity of their clinical behavior. Nomograms are useful tools for individualized estimation of survival. This study aimed to develop and independently validate a survival nomogram for patients with newly diagnosed LGG. Methods: Data were obtained for newly diagnosed LGG patients from The Cancer Genome Atlas (TCGA) and the Ohio Brain Tumor Study (OBTS) with the following variables: tumor grade (II or III), age at diagnosis, sex, Karnofsky Performance Status (KPS), and molecular subtype (IDH mutant with 1p/19q codeletion [IDHmut-codel], IDH mutant without 1p/19q codeletion [IDHmut-non-codel], IDH wild-type [IDHwt]). Survival was assessed using Cox proportional hazards regression, random survival forests, and recursive partitioning analysis, with adjustment for known prognostic factors. The models were developed using TCGA data and independently validated using the OBTS data. Models were internally validated using 10-fold cross-validation and externally validated with calibration curves. Results: A final nomogram was validated for newly diagnosed LGG. Factors that increased the probability of survival included grade II tumor, younger age at diagnosis, having a high KPS, and the IDHmut-codel molecular subtype. Conclusions: A nomogram that calculates individualized survival probabilities for patients with newly diagnosed LGG could be useful to healthcare providers for counseling patients regarding treatment decisions and optimizing therapeutic approaches. Free online software for implementing this nomogram is provided: https://hgittleman.shinyapps.io/LGG_Nomogram_H_Gittleman/.