[Show abstract][Hide abstract] ABSTRACT: Genome-wide association studies (GWAS) have identified approximately three dozen single nucleotide polymorphisms (SNPs) consistently associated with prostate cancer (PCa) risk. Despite the reproducibility of these associations, the molecular mechanism for most of these SNPs has not been well elaborated as most lie within non-coding regions of the genome. Androgens play a key role in prostate carcinogenesis. Recently, using ChIP-on-chip technology, 22,447 androgen receptor (AR) binding sites have been mapped throughout the genome, greatly expanding the genomic regions potentially involved in androgen-mediated activity.
To test the hypothesis that sequence variants in AR binding sites are associated with PCa risk, we performed a systematic evaluation among two existing PCa GWAS cohorts; the Johns Hopkins Hospital and the Cancer Genetic Markers of Susceptibility (CGEMS) study population. We demonstrate that regions containing AR binding sites are significantly enriched for PCa risk-associated SNPs, that is, more than expected by chance alone. In addition, compared with the entire genome, these newly observed risk-associated SNPs in these regions are significantly more likely to overlap with established PCa risk-associated SNPs from previous GWAS. These results are consistent with our previous finding from a bioinformatics analysis that one-third of the 33 known PCa risk-associated SNPs discovered by GWAS are located in regions of the genome containing AR binding sites.
The results to date provide novel statistical evidence suggesting an androgen-mediated mechanism by which some PCa associated SNPs act to influence PCa risk. However, these results are hypothesis generating and ultimately warrant testing through in-depth molecular analyses.
[Show abstract][Hide abstract] ABSTRACT: Advanced prostate cancer can progress to systemic metastatic tumors, which are generally androgen insensitive and ultimately lethal. Here, we report a comprehensive genomic survey for somatic events in systemic metastatic prostate tumors using both high-resolution copy number analysis and targeted mutational survey of 3508 exons from 577 cancer-related genes using next generation sequencing. Focal homozygous deletions were detected at 8p22, 10q23.31, 13q13.1, 13q14.11, and 13q14.12. Key genes mapping within these deleted regions include PTEN, BRCA2, C13ORF15, and SIAH3. Focal high-level amplifications were detected at 5p13.2-p12, 14q21.1, 7q22.1, and Xq12. Key amplified genes mapping within these regions include SKP2, FOXA1, and AR. Furthermore, targeted mutational analysis of normal-tumor pairs has identified somatic mutations in genes known to be associated with prostate cancer including AR and TP53, but has also revealed novel somatic point mutations in genes including MTOR, BRCA2, ARHGEF12, and CHD5. Finally, in one patient where multiple independent metastatic tumors were available, we show common and divergent somatic alterations that occur at both the copy number and point mutation level, supporting a model for a common clonal progenitor with metastatic tumor-specific divergence. Our study represents a deep genomic analysis of advanced metastatic prostate tumors and has revealed candidate somatic alterations, possibly contributing to lethal prostate cancer.
[Show abstract][Hide abstract] ABSTRACT: SUMMARY: Large volumes of data generated by high-throughput sequencing instruments present non-trivial challenges in data storage, content access and transfer. We present G-SQZ, a Huffman coding-based sequencing-reads-specific representation scheme that compresses data without altering the relative order. G-SQZ has achieved from 65% to 81% compression on benchmark datasets, and it allows selective access without scanning and decoding from start. This article focuses on describing the underlying encoding scheme and its software implementation, and a more theoretical problem of optimal compression is out of scope. The immediate practical benefits include reduced infrastructure and informatics costs in managing and analyzing large sequencing data. AVAILABILITY: http://public.tgen.org/sqz. Academic/non-profit: Source: available at no cost under a non-open-source license by requesting from the web-site; Binary: available for direct download at no cost. For-Profit: Submit request for for-profit license from the web-site.
[Show abstract][Hide abstract] ABSTRACT: As a first step in analyzing high-throughput data in genome-wide studies, several algorithms are available to identify and prioritize candidates lists for downstream fine-mapping. The prioritized candidates could be differentially expressed genes, aberrations in comparative genomics hybridization studies, or single nucleotide polymorphisms (SNPs) in association studies. Different analysis algorithms are subject to various experimental artifacts and analytical features that lead to different candidate lists. However, little research has been carried out to theoretically quantify the consensus between different candidate lists and to compare the study specific accuracy of the analytical methods based on a known reference candidate list. Within the context of genome-wide studies, we propose a generic mathematical framework to statistically compare ranked lists of candidates from different algorithms with each other or, if available, with a reference candidate list. To cope with the growing need for intuitive visualization of high-throughput data in genome-wide studies, we describe a complementary customizable visualization tool. As a case study, we demonstrate application of our framework to the comparison and visualization of candidate lists generated in a DNA-pooling based genome-wide association study of CEPH data in the HapMap project, where prior knowledge from individual genotyping can be used to generate a true reference candidate list. The results provide a theoretical basis to compare the accuracy of various methods and to identify redundant methods, thus providing guidance for selecting the most suitable analysis method in genome-wide studies.
Preview · Article · May 2009 · Journal of computational biology: a journal of computational molecular cell biology
[Show abstract][Hide abstract] ABSTRACT: High-throughput distributed data analysis based on clustered computing is gaining increasing importance in the field of computational biology. This paper describes a parallel programming approach and its software implementation using Message Passing Interface (MPI) to parallelize a computationally intensive algorithm for identifying cellular contexts. We report successful implementation on a 1,024 processor Beowulf cluster to analyze microarray data consisting of hundreds of thousands of measurements from different datasets. Detailed performance evaluation shows that data analysis that could have taken months on a stand-alone computer was accomplished in less than a day.
[Show abstract][Hide abstract] ABSTRACT: This chapter presents PAGE, a parallel program for analyzing the codetermination of gene transcriptional states from large-scale
simultaneous gene expression measurements with cDNA microarrays, and its application to a large set of genes. Using PAGE,
it was possible to compute coefficients of determination for all possible three-predictor sets from 587 genes for 58 targets
in a reasonable amount of time. Given the limited samplesizes currently being used for microarray analysis, it is not necessary
to go beyond three predictors at this time since the data are insufficient for four-predictor CoD estimation. As shown in
Tables 13.1, 13.2, and 13.3, significant speedups are achieved by the parallelization when compared to the sequential version
of the program modules.
A related data visualization program, VOGE, helps geneticists navigate, manipulate, and interpret the massive amount of computationally
derived information produced by PAGE. Tools provided in VOGE enhance the ability of researchers to interpret and apply the
computationally derived information for understanding the functional roles of genes.
[Show abstract][Hide abstract] ABSTRACT: A cluster operator takes a set of data points and partitions the points into clusters (subsets). As with any scientific model, the scientific content of a cluster operator lies in its ability to predict results. This ability is measured by its error rate relative to cluster formation. To estimate the error of a cluster operator, a sample of point sets is generated, the algorithm is applied to each point set and the clusters evaluated relative to the known partition according to the distributions, and then the errors are averaged over the point sets composing the sample. Many validity measures have been proposed for evaluating clustering results based on a single realization of the random-point-set process. In this paper we consider a number of proposed validity measures and we examine how well they correlate with error rates across a number of clustering algorithms and random-point-set models. Validity measures fall broadly into three classes: internal validation is based on calculating properties of the resulting clusters; relative validation is based on comparisons of partitions generated by the same algorithm with different parameters or different subsets of the data; and external validation compares the partition generated by the clustering algorithm and a given partition of the data. To quantify the degree of similarity between the validation indices and the clustering errors, we use Kendall's rank correlation between their values. Our results indicate that, overall, the performance of validity indices is highly variable. For complex models or when a clustering algorithm yields complex clusters, both the internal and relative indices fail to predict the error of the algorithm. Some external indices appear to perform well, whereas others do not. We conclude that one should not put much faith in a validity score unless there is evidence, either in terms of sufficient data for model estimation or prior model knowledge, that a validity measure is well-correlated to the error rate of the clustering algorithm.
Full-text · Article · Mar 2007 · Pattern Recognition
[Show abstract][Hide abstract] ABSTRACT: Bias and variance for small-sample error estimation are typically posed in terms of statistics for the distributions of the true and estimated errors. On the other hand, a salient practical issue asks, given an error estimate, what can be said about the true error? This question relates to the joint distribution of the true and estimated errors, specifically, the conditional expectation of the true error given the error estimate. A critical issue is that of confidence bounds for the true error given the estimate. We consider the joint distribution of the true error and the estimated error, assuming a random feature-label distribution. From it, we derive the marginal distributions, the conditional expectation of the estimated error given the true error, the conditional expectation of the true error given the estimated error, the conditional variance of the true error given the estimated error, and the 95% upper confidence bound for the true error given the estimated error. Numerous classification and estimation rules are considered across a number of models. Massive simulation is used for continuous models and analytic results are derived for discrete classification. We also consider a breast-cancer study to illustrate how the theory might be applied in practice. Although specific results depend on the classification rule, error-estimation rule, and model, some general trends are seen: (I) if the true error is small (large), then the conditional estimated error is generally high (low)-biased; (II) the conditional expected true error tends to be larger (smaller) than the estimated error for small (large) estimated errors; and (III) the confidence bounds tend to be well above the estimated error for low error estimates, becoming much less so for large estimates.
Full-text · Article · Jan 2007 · Technology in cancer research & treatment
[Show abstract][Hide abstract] ABSTRACT: In this paper, we consider the joint distribution of the true error and the estimated error, assuming a random feature-label distribution. From it, we derive the conditional expectation of the true error and the 95% upper confidence bound for the true error given the estimated error. Numerous classification and estimation rules are considered across a number of models. Although specific results depend on the classification rule, error-estimation rule, and model, some general trends are seen: (1) the conditional expected true error is larger (smaller) than the estimated error for small (large) estimated errors; and (2) the confidence bounds tend to be well above the estimated error for low error estimates, becoming much less so for large estimates.
[Show abstract][Hide abstract] ABSTRACT: When using cDNA microarrays, normalization to correct labeling bias is a common preliminary step before further data analysis is applied, its objective being to reduce the variation between arrays. To date, assessment of the effectiveness of normalization has mainly been confined to the ability to detect differentially expressed genes. Since a major use of microarrays is the expression-based phenotype classification, it is important to evaluate microarray normalization procedures relative to classification. Using a model-based approach, we model the systemic-error process to generate synthetic gene-expression values with known ground truth. These synthetic expression values are subjected to typical normalization methods and passed through a set of classification rules, the objective being to carry out a systematic study of the effect of normalization on classification. Three normalization methods are considered: offset, linear regression, and Lowess regression. Seven classification rules are considered: 3-nearest neighbor, linear support vector machine, linear discriminant analysis, regular histogram, Gaussian kernel, perceptron, and multiple perceptron with majority voting. The results of the first three are presented in the paper, with the full results being given on a complementary website. The conclusion from the different experiment models considered in the study is that normalization can have a significant benefit for classification under difficult experimental conditions, with linear and Lowess regression slightly outperforming the offset method.
Full-text · Article · Feb 2006 · EURASIP Journal on Bioinformatics and Systems Biology
[Show abstract][Hide abstract] ABSTRACT: Prostate cancer represents a significant worldwide public health burden. Epidemiological and genetic epidemiological studies have consistently provided data supporting the existence of inherited prostate cancer susceptibility genes. Segregation analyses of prostate cancer suggest that a multigene model may best explain familial clustering of this disease. Therefore, modeling gene-gene interactions in linkage analysis may improve the power to detect chromosomal regions harboring these disease susceptibility genes. In this study, we systematically screened for prostate cancer linkage by modeling two-locus gene-gene interactions for all possible pairs of loci across the genome in 426 prostate cancer families from Johns Hopkins Hospital, University of Michigan, University of Umeå, and University of Tampere. We found suggestive evidence for an epistatic interaction for six sets of loci (target chromosome-wide/reference marker-specific P< or =0.0001). Evidence for these interactions was found in two independent subsets from within the 426 families. While the validity of these results requires confirmation from independent studies and the identification of the specific genes underlying this linkage evidence, our approach of systematically assessing gene-gene interactions across the entire genome represents a promising alternative approach for gene identification for prostate cancer.
[Show abstract][Hide abstract] ABSTRACT: When using cDNA microarrays, normalization to correct biases is a common preliminary step before carrying out any data analysis, its objective being to reduce the systematic variations between the arrays. The biases are due to various systematic factors - scanner setting, amount of mRNA in the sample pool, and dye response characteristics between the channels. Since expression-based phenotype classification is a major use of microarrays, it is important to evaluate microarray normalization procedures relative to classification. Using a model-based approach, we model the systemic-error process to generate synthetic gene-expression values with known ground truth. Three normalization methods and three classification rules are then considered. Our simulation shows that normalization can have a significant benefit for classification under difficult experimental conditions.
[Show abstract][Hide abstract] ABSTRACT: It is widely hypothesized that the interactions of multiple genes influence individual risk to prostate cancer. However, current efforts at identifying prostate cancer risk genes primarily rely on single-gene approaches. In an attempt to fill this gap, we carried out a study to explore the joint effect of multiple genes in the inflammation pathway on prostate cancer risk. We studied 20 genes in the Toll-like receptor signaling pathway as well as several cytokines. For each of these genes, we selected and genotyped haplotype-tagging single nucleotide polymorphisms (SNP) among 1,383 cases and 780 controls from the CAPS (CAncer Prostate in Sweden) study population. A total of 57 SNPs were included in the final analysis. A data mining method, multifactor dimensionality reduction, was used to explore the interaction effects of SNPs on prostate cancer risk. Interaction effects were assessed for all possible n SNP combinations, where n = 2, 3, or 4. For each n SNP combination, the model providing lowest prediction error among 100 cross-validations was chosen. The statistical significance levels of the best models in each n SNP combination were determined using permutation tests. A four-SNP interaction (one SNP each from IL-10, IL-1RN, TIRAP, and TLR5) had the lowest prediction error (43.28%, P = 0.019). Our ability to analyze a large number of SNPs in a large sample size is one of the first efforts in exploring the effect of high-order gene-gene interactions on prostate cancer risk, and this is an important contribution to this new and quickly evolving field.
Preview · Article · Dec 2005 · Cancer Epidemiology Biomarkers & Prevention
[Show abstract][Hide abstract] ABSTRACT: Given a large set of potential features, it is usually necessary to find a small subset with which to classify. The task of finding an optimal feature set is inherently combinatoric and therefore suboptimal algorithms are typically used to find feature sets. If feature selection is based directly on classification error, then a feature-selection algorithm must base its decision on error estimates. This paper addresses the impact of error estimation on feature selection using two performance measures: comparison of the true error of the optimal feature set with the true error of the feature set found by a feature-selection algorithm, and the number of features among the truly optimal feature set that appear in the feature set found by the algorithm. The study considers seven error estimators applied to three standard suboptimal feature-selection algorithms and exhaustive search, and it considers three different feature-label model distributions. It draws two conclusions for the cases considered: (1) depending on the sample size and the classification rule, feature-selection algorithms can produce feature sets whose corresponding classifiers possess errors far in excess of the classifier corresponding to the optimal feature set; and (2) for small samples, differences in performances among the feature-selection algorithms are less significant than performance differences among the error estimators used to implement the algorithms. Moreover, keeping in mind that results depend on the particular classifier-distribution pair, for the error estimators considered in this study, bootstrap and bolstered resubstitution usually outperform cross-validation, and bolstered resubstitution usually performs as well as or better than bootstrap.
Full-text · Article · Dec 2005 · Pattern Recognition
[Show abstract][Hide abstract] ABSTRACT: The cDNA microarray technology allows us to estimate the expression of thousands of genes of a given tissue. It is natural then to use such information to classify different cell states, like healthy or diseased, or one particular type of cancer or another. However, usually the number of microarray samples is very small and leads to a classification problem with only tens of samples and thousands of features. Recently, Kim et al. proposed to use a parameterized distribution based on the original sample set as a way to attenuate such difficulty. Genes that contribute to good classifiers in such setting are called strong. In this paper, we investigate how to use feature selection techniques to speed up the quest for strong genes. The idea is to use a feature selection algorithm to filter the gene set considered before the original strong feature technique, that is based on a combinatorial search. The filtering helps us to find very good strong gene sets, without resorting to super computers. We have tested several filter options and compared the strong genes obtained with the ones got by the original full combinatorial search.
[Show abstract][Hide abstract] ABSTRACT: Motivation: Given the joint feature-label distribution, increasing the number of features always results in decreased classification error; however, this is not the case when a classifier is designed via a classification rule from sample data. Typically (but not always), for fixed sample size, the error of a designed classifier decreases and then increases as the number of features grows. The potential downside of using too many features is most critical for small samples, which are commonplace for gene-expression-based classifiers for phenotype discrimination. For fixed sample size and feature-label distribution, the issue is to find an optimal number of features. Results: Since only in rare cases is there a known distribution of the error as a function of the number of features and sample size, this study employs simulation for various feature-label distributions and classification rules, and across a wide range of sample and feature-set sizes. To achieve the desired end, finding the optimal number of features as a function of sample size, it employs massively parallel computation. Seven classifiers are treated: 3-nearest-neighbor, Gaussian kernel, linear support vector machine, polynomial support vector machine, perceptron, regular histogram and linear discriminant analysis. Three Gaussian-based models are considered: linear, nonlinear and bimodal. In addition, real patient data from a large breast-cancer study is considered. To mitigate the combinatorial search for finding optimal feature sets, and to model the situation in which subsets of genes are co-regulated and correlation is internal to these subsets, we assume that the covariance matrix of the features is blocked, with each block corresponding to a group of correlated features. Altogether there are a large number of error surfaces for the many cases. These are provided in full on a companion website, which is meant to serve as resource for those working with small-sample classification.
[Show abstract][Hide abstract] ABSTRACT: Network motifs have been demonstrated to be the building blocks in many biological networks such as transcriptional regulatory networks. Finding network motifs plays a key role in understanding system level functions and design principles of molecular interactions. In this paper, we present a novel definition of the neighborhood of a node. Based on this concept, we formally define and present an effective algorithm for finding network motifs. The method seeks a neighborhood assignment for each node such that the induced neighborhoods are partitioned with no overlap. We then present a parallel algorithm to find network motifs using a parallel cluster. The algorithm is applied on an E. coli transcriptional regulatory network to find motifs with size up to six. Compared with previous algorithms, our algorithm performs better in terms of running time and precision. Based on the motifs that are found in the network, we further analyze the topology and coverage of the motifs. The results suggest that a small number of key motifs can form the motifs of a bigger size. Also, some motifs exhibit a correlation with complex functions. This study presents a framework for detecting the most significant recurring subgraph patterns in transcriptional regulatory networks.