Conference PaperPDF Available

A SVM Model for Candidate Y-chromosome Gene Discovery in Prostate Cancer

Authors:

Abstract and Figures

Prostate cancer is widely known to be one of the most common cancers among men around the world. Due to its high heterogeneity, many of the studies carried out to identify the molecular level causes for cancer have only been partially successful. Among the techniques used in cancer studies, gene expression profiling is seen to be one of the most popular techniques due to its high usage. Gene expression profiles reveal information about the functionality of genes in different body tissues at different conditions. In order to identify cancer-decisive genes, differential gene expression analysis is carried out using statistical and machine learning methodologies. It helps to extract information about genes that have significant expression differences between healthy tissues and cancerous tissues. In this paper, we discuss a comprehensive supervised classification approach using Support Vector Machine (SVM) models to investigate differentially expressed Y-chromosome genes in prostate cancer. 8 SVM models, which are tuned to have 98.3% average accuracy have been used for the analysis. We were able to capture genes like CD99 (MIC2), ASMTL, DDX3Y and TXLNGY to come out as the best candidates. Some of our results support existing findings while introducing novel findings to be possible prostate cancer candidates.
Content may be subject to copyright.
A SVM Model for Candidate Y-chromosome
Gene Discovery in Prostate Cancer
Abstract
Prostate cancer is widely known to be one of the most common cancers among men around the world.
Due to its high heterogeneity, many of the studies carried out to identify the molecular level causes for
cancer have only been partially successful. Among the techniques used in cancer studies, gene
expression profiling is seen to be one of the most popular techniques due to its high usage. Gene
expression profiles reveal information about the functionality of genes in different body tissues at
different conditions. In order to identify cancer-decisive genes, differential gene expression analysis is
carried out using statistical and machine learning methodologies. It helps to extract information about
genes that have significant expression differences between healthy tissues and cancerous tissues. In this
paper, we discuss a comprehensive supervised classification approach using Support Vector Machine
(SVM) models to investigate differentially expressed Y-chromosome genes in prostate cancer. 8 SVM
models, which are tuned to have 98.3% average accuracy have been used for the analysis. We were able
to capture genes like CD99 (MIC2), ASMTL, DDX3Y and TXLNGY to come out as the best
candidates. Some of our results support existing findings while introducing novel findings to be possible
prostate cancer candidates.
keywords: Support Vector Machines; prostate cancer; Y-chromosome; differential expression;
microarray data; log fold change.
1. Introduction
Advancement in computing technology has enabled the possibility of materializing microscopic
information into humanly sensitive data, thus causing a massive growth of bioinformatics and
computational biology fields. Collecting biological information of humans and other species, which
once was a virtually impossible task has now become trivial. The depth of available cellular information
of species has gone from cell level to DNA sequences. Furthermore, gene expression data can be
acquired for analysis on computers with the development of Microarray technology [18] [21]. The
advancement of retrieving methods and purity of Microarray data has given insights into touching
Wageesha Rasanjana, Sandun Rajapaksa, Indika Perera, Dulani Meedeniya
Department of Computer Science and Engineering
University of Moratuwa, Sri Lanka
rasanjana.wageesha@gmail.com, {sandunip, indika, dulanim}@cse.mrt.ac.lk
EPiC Series in Computing
Volume 60, 2019, Pages 129–138
Proceedings of 11th International Conference
on Bioinformatics and Computational Biology
O. Eulenstein, H. Al-Mubaid and Q. Ding (eds.), BiCOB 2019 (EPiC Series in Computing, vol. 60),
pp. 129–138
untouchable aspects in Medicine and Biology [9]; thus, benefitting the treatments for various diseases.
Cancer treatments have become highly cellular based as bioinformaticians and scientists have been
discovering genes and their interactions with each other over the past decade. These discoveries could
cause triggers for treatments of a number of cancers and other sorts of untreatable diseases. In fact,
studies have been carried out on the genome level to find differentially expressed genes that are
remarkable in cancerous tissues [3] [14] [23]. However, the heterogeneity of these data has caused many
hardships in analyzing gene expression profiles, which in turn provoking the development of novel
data-analysis techniques.
One of the most common, yet hardly treatable cancers is Prostate cancer, which can be seen often
among men around the globe [2]. 84,861 men from the USA [20] and 47,151 men from the UK [5]
were diagnosed with prostate cancer in 2015 while 11,631 deaths were reported in the UK in 2016. It
has two basic stages as primary prostate tumour and metastatic prostate cancer where primary cancer is
of low risk than metastatic cancer. The primary prostate tumour is only located within the prostate gland
while the metastatic disease is spread across many other organs of the body [24].
The probability of a man being diagnosed with it rises highly with the age making this cancer common
among the geriatric population. The 84,861 men who were diagnosed with prostate cancer in the USA
in 2015 includes 12,489 men aged between 60 and 79 and 14,529 men above the age of 80. This has
led to research on gene expression data related to prostate cancer, which in turn revealed to be massively
heterogeneous [1] [4] [23]. Many of the studies have been carried out in both biological sciences and
statistical domains, highlighting information about some of the candidate genes that can be vital in
prostate cancer [1] [22]. Since prostate cancer is male-specific, the importance of analyzing the effect
of Y-chromosome genes towards the proliferation of the disease has been identified. However, most of
the research carried out on prostate cancer is focused on chromosomes other than Y-chromosome.
This paper presents an approach for the identification of candidate Y-chromosome genes that can have
an impact on the growth of prostate cancer. Section 2 presents an overview of commonly used analysis
techniques in both statistical and supervised learning domains while describing important statistical
terms used in our approach. Section 3 outlines the analysis criteria and methodology of the model.
Obtained experimental results and comparisons are presented in Section 4. Finally, section 5 concludes
the paper with the inference obtained from the results, future extensions, and emphasizes the importance
of the research.
2. Background
Common practice in big data analysis is to build statistical models that can interpret the probabilistic
behaviour of data. In this paper, interest lies within gene expression profiles and their differential
expressions. Biological data are highly variable and very sensitive due to their microscopic scale,
especially when it comes to gene expressions. The variance of data highly depends on each of the gene
expression values, which vary largely with respect to the tissue from which they have been collected.
In order to remove the variability and noise in data, data pre-processing should be taken place as a
common practice. Table 1 and Figure 1 depict information about average and variance of expression is
5 randomly selected genes from GSE6919 dataset.
Table 1. Average expression across 171 patient-samples
AR
IGHV3_23
EIF2AK2
RPS19
PLAGL1
1760.5
3.2
1342.7
10398.5
81.8
A SVM Model for Candidate Y-chromosome Gene Discovery in Prostate Cancer Rasanjana et al.
130
In statistics, the log2 transform is widely used to get rid of the high variance of data due to its
simplicity [8] [13] [19]. It is possible to get rid of the dependency between the mean and variance of
data by using this transform. This dependency is generally known as heteroskedasticity. This is
important when dealing with noise (error) in microarray data because errors largely dependent on the
population mean, especially when finding the quantitative change of a statistical variable.
Figure 1. Variation of expression across 171 samples
Fold change (FC) is a statistical quantity used to measure the quantitative change of a variable from
one state to another. FC can be calculated between tumour (or cancer) samples and normal samples in
microarray gene expression data. It is an indication of the amount of expression change that has occurred
when the gene transformed from a normal state to a cancerous state. When the data are log-transformed
before calculating the fold change, the resulting value is interpreted as log fold change (LogFC). In
addition to that, there are many other statistical quantities, which can be used to evaluate the differential
expression of genes such as t-statistics or Bayesian log odds (B-statistics) [19].
Apart from these statistical estimation methods, machine-learning methods have been widely used to
analyze gene expression data and sequence data [11]. Since gene expression data are highly variable,
statistical methods may provide incorrect estimates when the expression differences are insignificant
between tissue samples. Pirooznia et al. [16] have carried out a study to perform comparisons between
several machine learning algorithms regarding the applicability in microarray expression analysis. This
study concludes that SVM results in better performance compared to other supervised learning
algorithms for microarray expression analysis.
A study done by Khosravi et al. [10] found a set of Y-chromosome genes in cancerous tissues which
exhibit highly differentiated expression levels compared to other normal tissues. This phenomenon is
used in many of the other studies to classify cancer candidate genes by analyzing their differential gene
expression profiles. For some genes, expression patterns either can be up or down regulated and those
genes are classified as cancer candidates. The occurrence of up or down regulation during the metastatic
transformation process is highlighted in a comprehensive study done by Chandran et al. [4]. This
information is used extensively to extract candidate Y-chromosome genes having differential
expressions between normal tissue cells and cancerous cells in our research. Moreover, SVM is used in
our approach to predict the genes since it has been concluded as the best learning algorithm for
microarray data analysis [16]. However, further laboratory testing and comprehensive analysis are
required to enhance these results and confirm them as truly vital candidates in prostate cancer.
A SVM Model for Candidate Y-chromosome Gene Discovery in Prostate Cancer Rasanjana et al.
131
3. Analysis Criteria and Methodology
The proposed approach employs a dataset extracted from GEO Dataset under the accession number
GSE6919 with platform ID GPL8300 [15]. The dataset has 171 patient-samples those, which are
acquired from four distinct conditions; normal prostate tissue samples without any pathological
alterations, samples adjacent to the primary prostate tumour, primary prostate tumour samples, and
metastatic prostate cancer samples. Each of these distinct sample types contains 18, 63, 65 and 25
samples respectively. Healthy prostate tissue samples without any pathology and samples adjacent to a
prostate tumour are altogether labelled to normal category while primary tumour samples and metastatic
cancer samples are put into the cancerous category. Throughout this paper, we refer to normal samples
and cancerous samples according to the above categorization. These patient-samples, each having
12625 gene expression values, are used for categorization purpose while training, testing and prediction
performed on gene-probe samples where one gene-probe sample has 171 values. Overall, the microarray
dataset is a matrix having 12625 rows and 171 columns. The proposed approach consists of a number
of steps combining both statistical and supervised learning methods. The dataset is restructured into 8
categories each of which contains more than one sample category as illustrated in Table 2. The purpose
of the categorization is the unique identification of differentially expressed genes throughout the cancer
expansion process from the normal stage to the metastatic stage. Our study compares every normal
sample with every cancerous sample creating the need for 8 categories. Therefore, 8 SVM models were
built having one model for each category. SVM models are trained using a subset of ranked gene probes
from the whole gene set of 12625 genes while the Y-chromosome gene set with 45 genes, is extracted
for the classification. Limma package in R is used for the probe ranking process since it has been widely
used among the computational biology researchers for the statistical expression analysis of genes.
Table 2. Categorization of samples
Category
NOR-MET (Normal & Metastatic samples)
ADJ-MET (Adjacent to tumour & Metastatic samples)
NOR-ADJ-MET (Normal, Adjacent to a tumour & Metastatic samples)
NOR-TUM (Normal & Primary tumour samples)
ADJ-TUM (Adjacent to tumour & Primary tumour samples)
NOR-ADJ-TUM (Normal, Adjacent to tumour & Primary tumour samples)
TUM-MET (Primary tumour & Metastatic samples)
ALL (All samples)
The gene probes are ranked according to log-fold-change (LogFC) value. Large LogFC values of the
top genes provide evidence for the significance of differential expression pattern in them. Under-
expressed genes exhibit a negative LogFC value while over-expressed genes exhibit the opposite. In
order to signify the patterns in the training dataset, it is divided into two subsets following their positive
and negative LogFC values. Highly over-expressed genes tend to have significantly increased
expression in the metastatic region while highly under-expressed metastatic gene expressions are
significantly shrunk. Thus, highlighting patterns in both cases. This distinct pattern will be slowly
diminished with the reduction of LogFC value. Seemingly, top-ranked genes show a greater pattern in
differential expression while others barely display a pattern. As per the investigation carried out, none
of the top genes was from Y-chromosome. Concisely, Y-chromosome genes that are differentially
expressed in the prostate cancer cannot be accurately identified just by the statistical ranking method,
but another more sophisticated approach is required. Therefore, a combined approach containing a
statistical ranking method and a supervised learning model was used in this research.
A SVM Model for Candidate Y-chromosome Gene Discovery in Prostate Cancer Rasanjana et al.
132
3.1 Model Building and Classification
For each SVM model, ~200 genes were classified as vital for cancer and ~200 genes were considered
to have no effect for cancer. These classifications were done based on LogFC values. Top-ranked genes
from each category having the highest LogFC values (LogFC > 1.5) [17] are classified as vital (boolean
1) and the genes from the bottom of the ranking table having lowest LogFC are considered as neutral
(boolean 0); thus, creating a training dataset of ~400 genes out of 12625 total genes. Alternatively, all
NA values were replaced by zero. We have used a 70:30 dataset split ratio based on the size of the
available dataset [7]. Therefore, our training set has ~280 samples, leaving ~120 samples for the test
data. In statistics, training datasets less than ~100 samples in size tend to have a greater variance
resulting in low accuracy models. Thus, around 85% of data is recommended to use as the training set.
Since our training set contains more samples, we refrained from using a higher split ratio.
Using the dataset both linear and non-linear SVM models were evaluated with 10-fold cross-
validation. This evaluation resulted in better accuracy for the linear model as illustrated by the bar plot
in Figure 2. Therefore, a linear SVM model was adapted for our approach. Moreover, a cost-grid
ranging from 0.05 to 35 was used to find the best regularization parameter for each model as illustrated
in Figure 3. The train function from the Limma package selects the optimum cost value from the range
and sets to the training model. The accuracy tent to be constant after cost is 25; hence, we limited the
range up to 35. After classification, the prediction results from each of the 8 models were observed to
derive conclusions. Figure 4 shows the architectural setup of the procedure followed in our study.
Figure 4. The architecture of the experimental setup
Figure 2. Comparison of linear & non-linear model
accuracies
Figure 3. Variation of the accuracy of SVM models
against a range of cost values
A SVM Model for Candidate Y-chromosome Gene Discovery in Prostate Cancer Rasanjana et al.
133
4. Results and comparison
We analyzed 171 samples taken from 171 patients and from different conditions in the human body.
Those contain both cancerous and healthy samples from tissues in both prostate gland and other organs
(only metastatic samples are taken from other organs). All the samples contain 12625 total gene probes,
which we use as samples for training, testing and prediction, summing up to 2,158,875 out of which 67
NA values are replaced by zero. We built 8 SVM models to identify and interpret the Y-chromosome
genes that are differentially expressed across different tissues. In addition, we simultaneously
investigated over-expressed genes and under-expressed genes under those 8 models. Finally, we
compared our results with the existing findings from the literature. Table 3 and Table 4 display results
and comparison of expression change between candidate Y-chromosome genes across different models.
Table 3. Under-expressed genes
Initially, we considered the significantly differentiated samples, which are highly cancerous and
highly normal. NOR-MET model contained genes from normal prostate tissue samples (samples that
are neither diagnosed nor pathologically altered) and metastatic cancer samples (samples taken from
different metastatic cancer locations such as lungs, liver or lymph nodes). In the ADJ-MET model, we
analyzed genes from tissue samples that are adjacent to the prostate gland and genes from metastatic
locations. In both models, we have achieved over 99% model accuracy and the outputs were almost
identical in both cases. We identified seven over-expressed Y-chromosome genes and six under-
expressed genes namely; VAMP7, USP9Y, ASMTL, KDM5D, DDX3Y, SLC25A6 and CD99,
LOC101928634, AKAP17A, TXLNGY, SLC25A6, RPS4Y1 as illustrated in columns 3, 4 and 5 of Table
3. USP9Y and DAZ4 genes do not show over-expression in the NOR-MET model. NOR-ADJ-MET
model was created to justify our findings of NOR-MET and ADJ-MET models. The results under-
expression from this category were similar to the first two categories except for the loss of DDX3Y and
over-expression was different for some genes. When the normal and adjacent samples are combined
into one category, the expression patterns become less significant compared to when they are evaluated
separately. Thus, causing difficulty for accurate classification.
Types
Genes
SVM Model categories
NOR-MET
ADJ-MET
NOR-ADJ-MET
NOR-TUM
ADJ-TUM
NOR-ADJ-TUM
TUM-MET
ALL
Under-Expressed
VAMP7
v
v
v
v
v
USP9Y
v
v
v
ASMTL
v
v
v
v
v
v
v
KDM5D
v
v
v
v
DDX3Y
v
v
v
v
v
v
CD99
v
v
v
v
v
v
v
v
IL3RA
v
v
v
RBMY1J
v
UTY
v
EIF1AY
v
v
Genes with changes of expression pattern in the majority of classes
A SVM Model for Candidate Y-chromosome Gene Discovery in Prostate Cancer Rasanjana et al.
134
Then we analyzed genes that are differentially expressed during the tumour growth process. In contrast,
healthy normal prostate tissue samples and primary prostate tumour samples were considered. First, we
compared normal prostate gland samples and primary prostate tumour samples, which is the NOR-TUM
model. In the ADJ-TUM model, we considered samples that are adjacent to primary prostate tumour
and samples taken directly from a primary prostate tumour. In addition, we analyzed the NOR-ADJ-
TUM model as well to justify the results. We could analyze them as a special scenario where these
genes exhibit differential expressions only within a primary tumour and no differential expression in
the metastatic stage. In which, they have resulted as candidates from NOR-TUM, ADJ-TUM and NOR-
ADJ-TUM categories while displaying no expression changes in NOR-MET, ADJ-MET and NOR-
ADJ-MET models. We could identify only one common under-expressed gene (CD99) across these 3
models though there are many common over-expressed genes such as BPY2, XKRY2 and SRY etc. These
genes that only exhibit early changes expression pattern are shown in gray colour in Table 4.
Table 4. Over-expressed genes
Types
Genes
SVM Model categories
NOR-MET
ADJ-MET
NOR-ADJ-MET
NOR-TUM
ADJ-TUM
NOR-ADJ-TUM
TUM-MET
ALL
Over-Expressed
LOC101928634
^
^
^
^
AKAP17A
^
^
^
^
TXLNGY
^
^
^
^
^
^
^
SLC25A6
^
^
^
^
RPS4Y1
^
^
^
^
USP9Y
^
DAZ4
^
^
IL9R
^
^
BPY2
^
^
^
^
XKRY2
^
^
^
^
SRY
^
^
^
ASMT
^
^
^
^
TSPY10
^
^
^
^
^
TTTY15
^
^
^
PCDH11Y
^
^
^
^
LOC100509646
^
^
ZFY
^
NLGN4Y
^
CDY1B
^
^
CSF2RA
^
SHOX
^
^
VCY1B
^
^
A SVM Model for Candidate Y-chromosome Gene Discovery in Prostate Cancer Rasanjana et al.
135
Finally, we analyzed the genes from the TUM-MET (primary tumour samples and metastatic
samples) model and ALL model. In the TUM-MET model, we attempt to focus on the cancer spread
phenomena starting from a primary tumour to other organs. In ALL model we focused on identifying
the genes that generally have a differential expression due to prostate cancer. Out of the resulting
candidates, some of the Y-chromosome genes can be recognized as most vital since they significantly
vary in expression level across all the categories as illustrated in Table 3 and Table 4. Figure 5.1 to
Figure 5.8 depict heatmaps of row-normalized expression for critically identified genes across 8 models
in which intensity decrease from red to yellow. High intensity conforms to high normalized-value in
each of 4 rows. Next chapter presents important conclusions about our findings and insights into future
research.
Figure 5.4. NOR-TUM
Figure 5.5. ADJ-TUM
Figure 5.6. NOR-ADJ-TUM
Figure 5.7. TUM-MET
Figure 5.8. ALL
Figure 5.1. NOR-MET
Figure 5.2. ADJ-MET
Figure 5.3. NOR-ADJ-MET
A SVM Model for Candidate Y-chromosome Gene Discovery in Prostate Cancer Rasanjana et al.
136
5. Conclusion and Future Work
The analysis carried out by the categorical SVM model with a minimum accuracy of 95%, results in
a set of decisive Y-chromosome genes namely CD99 (also known as MIC2), ASMTL, DDX3Y, and
TXLNGY. Those genes are highlighted in yellow colour in Table 3 and Table 4. It is highly probable
that the aforementioned Y-chromosome genes to be actively involved in prostate cancer generation and
metastasis process when considering the high accuracy obtained for the SVM models. There are many
biological studies carried out focusing on the CD99 gene and its involvement in prostate and other types
of cancers [3][25]. Apart from that, Lau et al. [12] have found information about the involvement of
many Y-chromosome candidates including ASMTL, ILR3, and RPS4Y1 in prostate cancer. In addition,
the genes highlighted in gray colour rows might play vital a role in prostate tumour generation. Early
medical precautions targeted on them may be able to prevent the cause of developing the tumour. These
Y-chromosome genes do not exhibit significant expression patterns when compared to the top-ranked
genes in differential expression analysis but the changes in their expressions from normal tissue to
cancerous tissue are significant for closer observations. Table 3 and Table 4 contains information about
many other genes from our findings, which are correlated with Lau’s work. Moreover, Dasari et al. [6]
have done similar work to add stability to our findings.
However, it should be highlighted that future work is needed to provide confirmation about these Y-
chromosome genes as to how they relate to the progression of prostate cancer. Microarray data may
contain noise, which cannot be removed completely by data pre-processing. Therefore, the results
obtained through the computational methods can never be accurate enough for direct acceptance. We
suggest carrying out thorough narrowed down laboratory experiments on these genes to investigate the
actual role they involved in the disease. In fact, it will be highly beneficial for the biological community
to find out new cellular level treatments for prostate cancer.
6. References
[1] Burmester, James K., et al. 2004. Analysis of candidate genes for prostate cancer. Human
heredity 57, no. 4 (2004): 172-178.
[2] Cancer Research UK, Prostate cancer statistics, https://www.cancerresearchuk.org/health-
professional/cancer-statistics/statistics-by-cancer-type/prostate-cancer, October 2018.
[3] Carter, H. Ballentine. 2004. Prostate cancers in men with low PSA levelsmust we find
them?. The New England journal of medicine 350, no. 22 (2004): 2292.
[4] Chandran, Uma R., et. al 2007. Gene expression profiles of prostate cancer reveal involvement of
multiple molecular pathways in the metastatic process. BMC cancer 7, no. 1 (2007): 64.
[5] Cheng, Liang, Nagabhushan, Moolky, Pretlow, Theresa P., Amini, Saeid B. and Pretlow, Thomas
G. 1996. Expression of E-cadherin in primary and metastatic prostate cancer. The American
journal of pathology 148, no. 5 (1996): 1375.
[6] Dasari, Vijay K. et al. 2001. Expression analysis of Y chromosome genes in human prostate
cancer. J Urol. (2001).
[7] Dobbin, Kevin K., and Richard M. Simon. 2011. Optimally splitting cases for training and testing
high dimensional classifiers. BMC medical genomics 4, no. 1 (2011): 31.
[8] Friedman, Nir, Cai, Long and Xie, X. Sunney 2006. Linking Stochastic Dynamics to Population
Distribution: An Analytical Framework of Gene Expression. In: Physical review letters. (2006).
[9] Golub, Todd R., et al. 1999. Molecular classification of cancer: class discovery and class
prediction by gene expression monitoring. Science 286, no. 5439 (1999): 531-537.
A SVM Model for Candidate Y-chromosome Gene Discovery in Prostate Cancer Rasanjana et al.
137
[10] Khosravi, Pegah, Zahiri, Javad, Gazestani, Vahid H., Mirkhalaf, Samira, Akbarzadeh,
Mohammad, Sadeghi, Mehdi, Goliaei, Bahram 2014. Analysis of candidate genes has proposed
the role of y chromosome in human prostate cancer. Iranian journal of cancer prevention. (2014);
7(4):204.
[11] Larranaga, Pedro et al. 2005. Machine Learning in Bioinformatics. In: Briefings in
Bioinformatics. (2005): 86-112.
[12] Lau, Yun-Fai C., and Zhang, Jianqing 2000. Expression analysis of thirty one Y chromosome
genes in human prostate cancer. Mol. Carcinog. (2000), 308-321.
[13] Lin, Simon M., Du, Pan, Huber, Wolfgang and Kibbe, Warren A. 2008. Model-based variance-
stabilizing transformation for Illumina microarray data. Nucleic Acids Research 36, 2 (2008),
e11-e11.
[14] Myers, Jennifer S., von Lersner, Ariana K., Robbins, Charles J., and Sang, Qing-Xiang Amy
(2015). Differentially Expressed Genes and Signature Pathways of Human Prostate
Cancer. PLOS ONE, 10(12), p.e0145322.
[15] NCBI, GEO Accession viewer. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6919,
October 2018.
[16] Pirooznia, Mehdi, Yang, Jack Y., Yang, Mary Qu and Deng, Youping 2008. A Comparative
Study of Different Machine Learning Methods on Microarray Gene Expression Data. In: BMC
Genomic (2008).
[17] Raza, Khalid, Mishra, Akhilesh 2012. A Novel Anticlustering Filtering Algorithm for the
Prediction of Genes as Drug Target. In: American Journal of Biomedical Engineering. . (2012):
206-211.
[18] Salome, J. Jacinth 2012. Efficient Retrieval Technique for Microarray Gene
Expression. International Journal of Information Retrieval Research (IJIRR), 2(2), (2012), 43-51.
[19] Satagopan, Jaya M. Olson, Sarah H., Elston, Robert C. 2017. Statistical interactions and Bayes
estimation of log odds in case-control studies. Statistical methods in medical research. (April
2017); 26(2):1021-38.
[20] Siegel, Rebecca L., Miller, Kimberly D. and Jemal, Ahmedin Cancer statistics, 2017. CA: A
Cancer Journal for Clinicians 67, 1 (2017), 7-30.
[21] Slonim, Donna K. and Yanai, Itai 2009. Getting Started in Gene Expression Microarray
Analysis. PLoS Computational Biology 5, 10 (2009), e1000543.
[22] Thibodeau, S. N., et al. 2015. Identification of candidate genes for prostate cancer-risk SNPs
utilizing a normal prostate tissue eQTL data set. Nature communications 6 (2015): 8653.
[23] Wang, Lipo, Chu, Feng and Xie, Wei 2007. Accurate cancer classification using expressions of
very few genes. IEEE/ACM Transactions on Computational Biology and Bioinformatics
(TCBB) 4, no. 1 (2007): 40-53.
[24] Yuan, Ye, Kishan, Amar U. and Nickols, Nicholas G. Treatment of the primary tumor in
metastatic prostate cancer. World Journal of Urology (2018).
[25] Zaccarini, Daniel, J. et al. 2018. Expression of TLE-1 and CD99 in Carcinoma: Pitfalls in
Diagnosis of Synovial Sarcoma. In: Applied Immunohistochemistry & Molecular Morphology.
(2018): 368-373.
A SVM Model for Candidate Y-chromosome Gene Discovery in Prostate Cancer Rasanjana et al.
138
Article
Prostate cancer is a major world health problem for men. This shows how important early detection and accurate diagnosis are for better treatment and patient outcomes. This study compares different ways to find Prostate Cancer (PCa) and label tumors as normal or abnormal, with the goal of speeding up current work in microarray gene data analysis. The study looks at how well several feature extraction methods work with three feature selection strategies: Harmonic Search (HS), Firefly Algorithm (FA), and Elephant Herding Optimization (EHO). The techniques tested are Expectation Maximization (EM), Nonlinear Regression (NLR), K-means, Principal Component Analysis (PCA), and Discrete Cosine Transform (DCT). Eight classifiers are used for the task of classification. These are Random Forest, Decision Tree, Adaboost, XGBoost, and Support Vector Machine (SVM) with linear, polynomial, and radial basis function kernels. This study looks at how well these classifiers work with and without feature selection methods. It finds that the SVM with radial basis function kernel, using DCT for feature extraction and EHO for feature selection, does the best of all of them, with an accuracy of 94.8 % and an error rate of 5.15 %.
Article
Full-text available
Survival analysis is a critical task in glioma patient management due to the inter and intra tumor heterogeneity. In clinical practice, clinicians estimate the survival with their experience, which can be biased and optimistic. Over the past decades, diverse survival analysis approaches were proposed incorporating distinct data such as imaging and genetic information. The remarkable advancements in imaging and high throughput omics and sequencing technologies have enabled the acquisition of this information of glioma patients eficiently, providing novel insights for survival estimation in the present day. Besides, in the past years, machine learning techniques and deep learning have emerged into the field of survival analysis of glioma patients trading off the traditional statistical analysis-based survival analysis approaches. In this survey paper, we explore the prognostic parameters acquired, utilizing diagnostic imaging techniques and genomic platforms for survival or risk estimation of glioma patients. Further, we review the techniques, learning and statistical analysis algorithms, along with their benefits and limitations used for prognosis prediction. Consequently, we highlight the challenges of the existing state-of-the-art survival prediction studies and propose future directions in the field of research.
Conference Paper
Full-text available
Gliomas are lethal type of central nervous system tumors with a poor prognosis. Recently, with the advancements in the micro-array technologies thousands of gene expression related data of glioma patients are acquired, leading for salient analysis in many aspects. Thus, genomics are been emerged into the field of prognosis analysis. In this work, we identify survival related 7 gene signature and explore two approaches for survival prediction and risk estimation. For survival prediction, we propose a novel probabilistic programming based approach, which outperforms the existing traditional machine learning algorithms. An average 4 fold accuracy of 74% is obtained with the proposed algorithm. Further, we construct a prognostic risk model for risk estimation of glioma patients. This model reflects the survival of glioma patients, with high risk for low survival patients.
Article
Full-text available
The cornerstone of treatment for metastatic prostate cancer patients has been conventional androgen deprivation therapy, with additional systemic therapy initiated only after castration resistance, and local therapy reserved for palliation. Compelling results from modern trials challenge this paradigm, arguing for initiating escalated hormone therapy and/or chemotherapy during the castration-sensitive disease state for many patients. Furthermore, modern radiotherapy techniques allow for local control of disease with low risk of toxicity. Finally, new PET probes with enhanced sensitivity and accuracy are likely to become a part of routine staging and will lead to an increased incidence of patients with metastatic disease at presentation, with a shift toward identification of patients with limited metastatic disease. As such, the landscape is primed for investigations aimed to explore the role of primary tumor therapy for patients with metastatic prostate cancer. We review the existing data evaluating primary tumor therapy for patients with metastatic prostate cancer and describe ongoing clinical trials testing the hypothesis that primary tumor therapy may benefit patients with metastatic prostate cancer.
Article
Full-text available
Genomic technologies including microarrays and next-generation sequencing have enabled the generation of molecular signatures of prostate cancer. Lists of differentially expressed genes between malignant and non-malignant states are thought to be fertile sources of putative prostate cancer biomarkers. However such lists of differentially expressed genes can be highly variable for multiple reasons. As such, looking at differential expression in the context of gene sets and pathways has been more robust. Using next-generation genome sequencing data from The Cancer Genome Atlas, differential gene expression between age- and stage- matched human prostate tumors and non-malignant samples was assessed and used to craft a pathway signature of prostate cancer. Up- and down-regulated genes were assigned to pathways composed of curated groups of related genes from multiple databases. The significance of these pathways was then evaluated according to the number of differentially expressed genes found in the pathway and their position within the pathway using Gene Set Enrichment Analysis and Signaling Pathway Impact Analysis. The "transforming growth factor-beta signaling" and "Ran regulation of mitotic spindle formation" pathways were strongly associated with prostate cancer. Several other significant pathways confirm reported findings from microarray data that suggest actin cytoskeleton regulation, cell cycle, mitogen-activated protein kinase signaling, and calcium signaling are also altered in prostate cancer. Thus we have demonstrated feasibility of pathway analysis and identified an underexplored area (Ran) for investigation in prostate cancer pathogenesis.
Article
Full-text available
Multiple studies have identified loci associated with the risk of developing prostate cancer but the associated genes are not well studied. Here we create a normal prostate tissue-specific eQTL data set and apply this data set to previously identified prostate cancer (PrCa)-risk SNPs in an effort to identify candidate target genes. The eQTL data set is constructed by the genotyping and RNA sequencing of 471 samples. We focus on 146 PrCa-risk SNPs, including all SNPs in linkage disequilibrium with each risk SNP, resulting in 100 unique risk intervals. We analyse cis-acting associations where the transcript is located within 2 Mb (±1 Mb) of the risk SNP interval. Of all SNP-gene combinations tested, 41.7% of SNPs demonstrate a significant eQTL signal after adjustment for sample histology and 14 expression principal component covariates. Of the 100 PrCa-risk intervals, 51 have a significant eQTL signal and these are associated with 88 genes. This study provides a rich resource to study biological mechanisms underlying genetic risk to PrCa.
Article
Full-text available
Prostate cancer, a serious genetic disease, has known as the first widespread cancer in men, but the molecular changes required for the cancer progression has not fully understood. Availability of high-throughput gene expression data has led to the development of various computational methods, for identification of the critical genes, have involved in the cancer. In this paper, we have shown the construction of co-expression networks, which have been using Y-chromosome genes, provided an alternative strategy for detecting of new candidate, might involve in prostate cancer. In our approach, we have constructed independent co-expression networks from normal and cancerous stages have been using a reverse engineering approach. Then we have highlighted crucial Y chromosome genes involved in the prostate cancer, by analyzing networks, based on party and date hubs. Our results have led to the detection of 19 critical genes, related to prostate cancer, which 12 of them have previously shown to be involved in this cancer. Also, essential Y chromosome genes have searched based on reconstruction of sub-networks which have led to the identification of 4 experimentally established as well as 4 new Y chromosome genes might be linked putatively to prostate cancer. Correct inference of master genes, which mediate molecular, has changed during cancer progression would be one of the major challenges in cancer genomics. In this paper, we have shown the role of Y chromosome genes in finding of the prostate cancer susceptibility genes. Application of our approach to the prostate cancer has led to the establishment of the previous knowledge about this cancer as well as prediction of other new genes.
Article
The characteristic immunoprofile for the diagnosis of synovial sarcoma, a neoplasm of unclear tissue origin, is expression of transducer-like enhancer of split 1 (TLE-1), CD99, partial expression of cytokeratin, and epithelial membrane antigen by immunohistochemistry (IHC). Diagnostic dilemma or misdiagnosis can occur due to overlap in IHC and morphology with carcinomas, and particularly poorly differentiated and metastatic tumors. The frequency of TLE-1 and CD99 expression in carcinomas by IHC has not been previously assessed. We evaluated TLE-1 and CD99 expression in various carcinomas and evaluated the expression of the SS18 (SYT) gene rearrangement (a characteristic biomarker for synovial sarcoma) in tumors with TLE-1 and/or CD99 expression. Immunostains of TLE-1 and CD99 were performed in 100 various carcinomas. Seven of the 98 cases (7%) of carcinomas showed TLE-1 expression, including 1 each of prostate adenocarcinoma (ADCA), esophageal ADCA, basal cell carcinoma, adrenocortical carcinoma, endometrial ADCA, ovarian serous carcinoma, and small cell carcinoma. Twenty-one of the 100 cases (21%) of carcinomas demonstrated CD99 expression, including 6 prostate ADCA, 3 esophageal ADCA, 5 squamous cell carcinomas, 2 hepatocellular carcinomas, 1 each for endometrial ADCA, renal cell carcinoma, urothelial cell carcinoma, neuroendocrine carcinoma, and mucoepidermoid carcinoma. An esophageal ADCA was positive for both TLE-1 and CD99. None of the carcinomas with positive TLE-1 (n=7) or CD99 (n=21) by IHC showed SS18 gene rearrangement by fluorescent in situ hybridization. TLE-1 and CD99 expression were identified in 7% and 21% of carcinomas, respectively. This is a potential pitfall in the IHC interpretation for diagnosis of synovial sarcoma. SS18 gene rearrangement by fluorescent in situ hybridization is helpful for the diagnostically challenging cases, either for confirmation or exclusion of synovial sarcoma.
Article
The DNA mciroarray gene data is in the expression levels of thousands of genes for a small amount of samples. From the microarray gene data, the process of extracting the required knowledge remains an open challenge. Acquiring knowledge is the intricacy in such types of gene data, though number of researches is arising in order to acquire information from these gene data. In order to retrieve the required information, gene classification is vital; however, the task is complex because of the data characteristics, high dimensionality and smaller sample size. Initially, the dimensionality diminution process is carried out in order to shrink the microarray data without losing information with the aid of LPP and PCA techniques and utilized for information retrieval. In this paper, we propose an effective gene retrieval technique based on LPP and PCA called LPCA. The technique like LPP and PCA is chosen for the dimensionality reduction for efficient retrieval of microarray gene data. An application of microarray gene data is included with classification by SVM. SVM is trained by the dimensionality reduced gene data for effective classification. A comparative study is made with these dimensionality reduction techniques.
Article
This paper is concerned with the estimation of the logarithm of disease odds (log odds) when evaluating two risk factors, whether or not interactions are present. Statisticians define interaction as a departure from an additive model on a certain scale of measurement of the outcome. Certain interactions, known as removable interactions, may be eliminated by fitting an additive model under an invertible transformation of the outcome. This can potentially provide more precise estimates of log odds than fitting a model with interaction terms. In practice, we may also encounter nonremovable interactions. The model must then include interaction terms, regardless of the choice of the scale of the outcome. However, in practical settings, we do not know at the outset whether an interaction exists, and if so whether it is removable or nonremovable. Rather than trying to decide on significance levels to test for the existence of removable and nonremovable interactions, we develop a Bayes estimator based on a squared error loss function. We demonstrate the favorable bias-variance trade-offs of our approach using simulations, and provide empirical illustrations using data from three published endometrial cancer case-control studies. The methods are implemented in an R program, and available freely at http://www.radicalpsychology.org/vol9-2/roper.html. © The Author(s) 2015 Reprints and permissions: sagepub.co.uk/journalsPermissions.nav.
Article
The high-throughput data generated by microarray experiments provides complete set of genes being expressed in a g iven cell or in an organis m under part icular conditions. The analysis of these enormous data has opened a new dimension for the researchers. In this paper we describe a novel algorith m to microarray data analysis focusing on the identification of genes that are differentially expressed in particular internal or external conditions and which could be potential drug targets. The algorithm uses the time-series gene expression data as an input and recognizes genes which are expressed differentially. This algorith m imp le ments standard statistics-based gene functional investigations, such as the log transformation, mean, log-sig moid function, coefficient of variations, etc. It does not use clustering analysis. The proposed algorith m has been implemented in Perl. The time-series gene exp ression data on yeast Saccharomyces cerevisiae fro m the Stanford Microarray Database (SMD) consisting of 6154 genes have been taken for the validation o f the algorith m. The developed method extracted 48 genes out of total 6154 genes. These genes are mostly responsible for the yeast's resistants at a high temperature.
Article
Rapid advances in positional cloning studies have identified most of the genes on the human Y chromosome, thereby providing resources for studying the expression of its genes in prostate cancer. Using a semiquantitative reverse transcription–polymerase chain reaction (RT–PCR) procedure, we had examined the expression of the Y chromosome genes in a panel of prostate samples diagnosed with benign prostatic hyperplasia (BPH), low and/or high grade carcinoma, and the prostatic cell line, LNCaP, stimulated by androgen treatment. Results from this expression analysis of 31 of the 33 genes, isolated so far from the Y chromosome, revealed three types of expression patterns: i) specific expression in other tissues (e.g., AMELY, BPY1, BPY2, CDY, and RBM); ii) ubiquitous expression among prostate and control testis samples, similar to those of house-keeping genes (e.g., ANT3, XE7,ASMTL, IL3RA, SYBL1, TRAMP, MIC2, DBY, RPS4Y, and SMCY); iii) differential expression in prostate and testis samples. The last group includes X-Y homologous (e.g., ZFY, PRKY, DFFRY, TB4Y, EIF1AY, and UTY) and Y-specific genes (e.g., SRY, TSPY, PRY, and XKRY). Androgen stimulation of the LNCaP cells resulted in up-regulation of PGPL, CSFR2A, IL3RA, TSPY, and IL9R and down regulation of SRY, ZFY, and DFFRY. The heterogeneous and differential expression patterns of the Y chromosome genes raise the possibility that some of these genes are either involved in or are affected by the oncogenic processes of the prostate. The up- and down-regulation of several Y chromosome genes by androgen suggest that they may play a role(s) in the hormonally stimulated proliferation of the responsive LNCaP cells. Mol. Carcinog. 27:308–321, 2000. © 2000 Wiley-Liss, Inc.