[Show abstract][Hide abstract] ABSTRACT: Accurate identification of the primary tumour in cancer of unknown primary (CUP) is required for effective treatment selection and improved patient outcomes. The aim of this study was to develop and validate a gene expression tumour classifier and integrate it with histopathology to identify the likely site of origin in CUP.RNA was extracted from 450 formalin fixed, paraffin embedded samples of known origin comprising 18 tumour groups. Whole genome expression analysis was performed using a bead-based array. Classification of the tumours made use of a binary support vector machine, together with recursive feature elimination. A hierarchical tumour classifier was developed and incorporated with conventional histopathology to identify the origins of metastatic tumours.The classifier demonstrated an accuracy of 88% for correctly predicting the tumour type on a validation set of known tumours (n = 94). For CUP samples (n = 49) having a final clinical diagnosis, the classifier improved the accuracy of histology alone for both single and multiple predictions. Furthermore, where histology alone could not suggest any specific diagnosis, the classifier was able to correctly predict the primary site of origin.We demonstrate the integration of gene expression profiling with conventional histopathology to aid the investigation of CUP.
[Show abstract][Hide abstract] ABSTRACT: Availability: http://bioinformatics.research.nicta.com.au/software/gwisfi/
Epistatic interactions between genes are believed to be a critical component in the genetic architecture of complex diseases. Genome Wide Association Studies (GWAS) may be able to detect such genetic interactions indirectly, via the identification of associated SNP markers. Major obstacles to progress in this area are: the unknown nature of epistatic interactions, little understanding of the capabilities of different filtering methods, and the computational difficulties for exhaustive analysis. A common platform enabling various detection methods is needed to avoid practical issues such as software compatibility and portability, incompatible input and output formats and varying demands on computational resources.
We developed a highly optimised GPU system capable of exhaustively analysing all SNP-pairs in typical GWAS data (0.5M SNPs, 5K samples) in a few minutes on a standard desktop computer. A number of programming elements provided by a functional interface can be used to construct user-defined statistical tests to efficiently score every SNP pair. As a proof of principle, we have implemented 8 methods from the literature via our interface. We have applied all of them using a single GPU to exhaustively scan the 7 popular WTCCC case-control GWAS datasets. We present timing results for these methods, both in their original software implementations and using our platform. Significant improvements in timing are observed, up to 10000 times for CPU implementations of the popular FastEpistasis in PLINK and up to 2 orders of magnitude for some GPU implementations in the literature. As an initial discovery we show plots for overlaps of list of selected pairs by 8 algorithms for Type 2 Diabetes, WTCCC data.
IEEE International Conference on Bioinformatics and Biomedicine, Belfast, UK; 11/2014
[Show abstract][Hide abstract] ABSTRACT: Epistasis has long been thought to contribute to the genetic aetiology of complex diseases, yet few robust epistatic interactions in humans have been detected. We have conducted exhaustive genome-wide scans for pairwise epistasis in five independent celiac disease (CeD) case-control studies, using a rapid model-free approach to examine over 500 billion SNP pairs in total. We found extensive epistasis within the MHC region with 5,359 statistically significant pairs achieving stringent replication criteria across multiple studies. These robust epistatic pairs partially tagged CeD risk HLA haplotypes, and replicable evidence for epistatic SNPs outside the MHC was not observed. Both within and between European populations, we observed striking consistency of epistatic models and epistatic model distribution, thus providing empirical estimates of their frequencies in a complex disease. Within the UK population, models of CeD comprised of both epistatic and additive single-SNP effects increased explained CeD variance by approximately 1% over those of single SNPs. Further analysis showed that additive SNP effects tag epistatic effects (and vice versa), sometimes involving SNPs separated by a megabase or more. These findings show that the genetic architecture of CeD consists of overlapping additive and epistatic components, indicating that the genetic architecture of CeD, and potentially other common autoimmune diseases, is more complex than previously thought.
[Show abstract][Hide abstract] ABSTRACT: In many prediction problems, it is beneficial to obtain confidence estimates for the classification output. We consider the problem of estimating confidence sets in multiclass classification of real life datasets. Building on the theory of conformal predictors, we derive a class-conditional conformal predictor. This allows us to calibrate the confidence estimates in a class specific fashion, resulting in a more precise control of the prediction error rate for each class. We show that the class-conditional conformal predictor is asymptotically valid, and demonstrate that it indeed provides better calibration and efficiency on benchmark digit recognition datasets. In addition, we apply the class-conditional conformal predictor to a biological dataset for predicting localizations of proteins in order to demonstrate its performance in bioinformatics applications.
Proceedings of the 2013 12th International Conference on Machine Learning and Applications - Volume 01; 12/2013
[Show abstract][Hide abstract] ABSTRACT: Background
It has been hypothesized that multivariate analysis and systematic detection of epistatic interactions between explanatory genotyping variables may help resolve the problem of "missing heritability" currently observed in genome-wide association studies (GWAS). However, even the simplest bivariate analysis is still held back by significant statistical and computational challenges that are often addressed by reducing the set of analysed markers. Theoretically, it has been shown that combinations of loci may exist that show weak or no effects individually, but show significant (even complete) explanatory power over phenotype when combined. Reducing the set of analysed SNPs before bivariate analysis could easily omit such critical loci.
We have developed an exhaustive bivariate GWAS analysis methodology that yields a manageable subset of candidate marker pairs for subsequent analysis using other, often more computationally expensive techniques. Our model-free filtering approach is based on classification using ROC curve analysis, an alternative to much slower regression-based modelling techniques. Exhaustive analysis of studies containing approximately 450,000 SNPs and 5,000 samples requires only 2 hours using a desktop CPU or 13 minutes using a GPU (Graphics Processing Unit). We validate our methodology with analysis of simulated datasets as well as the seven Wellcome Trust Case-Control Consortium datasets that represent a wide range of real life GWAS challenges. We have identified SNP pairs that have considerably stronger association with disease than their individual component SNPs that often show negligible effect univariately. When compared against previously reported results in the literature, our methods re-detect most significant SNP-pairs and additionally detect many pairs absent from the literature that show strong association with disease. The high overlap suggests that our fast analysis could substitute for some slower alternatives.
We demonstrate that the proposed methodology is robust, fast and capable of exhaustive search for epistatic interactions using a standard desktop computer. First, our implementation is significantly faster than timings for comparable algorithms reported in the literature, especially as our method allows simultaneous use of multiple statistical filters with low computing time overhead. Second, for some diseases, we have identified hundreds of SNP pairs that pass formal multiple test (Bonferroni) correction and could form a rich source of hypotheses for follow-up analysis.
A web-based version of the software used for this analysis is available at http://bioinformatics.research.nicta.com.au/gwis.
[Show abstract][Hide abstract] ABSTRACT: Many different microarray experiments are publicly available today. It is natural to ask whether different experiments for the same phenotypic conditions can be combined using meta-analysis, in order to increase the overall sample size. However, some genes are not measured in all experiments, hence they cannot be included or their statistical significance cannot be appropriately estimated in traditional meta-analysis. Nonetheless, these genes, which we refer to as incomplete genes, may also be informative and useful.
We propose a meta-analysis framework, called "Incomplete Gene Meta-analysis", which can include incomplete genes by imputing the significance of missing replicates, and computing a meta-score for every gene across all datasets. We demonstrate that the incomplete genes are worthy of being included and our method is able to appropriately estimate their significance in two groups of experiments. We first apply the Incomplete Gene Meta-analysis and several comparable methods to five breast cancer datasets with an identical set of probes. We simulate incomplete genes by randomly removing a subset of probes from each dataset and demonstrate that our method consistently outperforms two other methods in terms of their false discovery rate. We also apply the methods to three gastric cancer datasets for the purpose of discriminating diffuse and intestinal subtypes.
Meta-analysis is an effective approach that identifies more robust sets of differentially expressed genes from multiple studies. The incomplete genes that mainly arise from the use of different platforms may also have statistical and biological importance but are ignored or are not appropriately involved by previous studies. Our Incomplete Gene Meta-analysis is able to incorporate the incomplete genes by estimating their significance. The results on both breast and gastric cancer datasets suggest that the highly ranked genes and associated GO terms produced by our method are more significant and biologically meaningful according to the previous literature.
[Show abstract][Hide abstract] ABSTRACT: In the study of cancer genomics, gene expression microarrays, which measure thousands of genes in a single assay, provide abundant information for the investigation of interesting genes or biological pathways. However, in order to analyze the large number of noisy measurements in microarrays, effective and efficient bioinformatics techniques are needed to identify the associations between genes and relevant phenotypes. Moreover, systematic tests are needed to validate the statistical and biological significance of those discoveries.
In this paper, we develop a robust and efficient method for exploratory analysis of microarray data, which produces a number of different orderings (rankings) of both genes and samples (reflecting correlation among those genes and samples). The core algorithm is closely related to biclustering, and so we first compare its performance with several existing biclustering algorithms on two real datasets - gastric cancer and lymphoma datasets. We then show on the gastric cancer data that the sample orderings generated by our method are highly statistically significant with respect to the histological classification of samples by using the Jonckheere trend test, while the gene modules are biologically significant with respect to biological processes (from the Gene Ontology). In particular, some of the gene modules associated with biclusters are closely linked to gastric cancer tumorigenesis reported in previous literature, while others are potentially novel discoveries.
In conclusion, we have developed an effective and efficient method, Bi-Ordering Analysis, to detect informative patterns in gene expression microarrays by ranking genes and samples. In addition, a number of evaluation metrics were applied to assess both the statistical and biological significance of the resulting bi-orderings. The methodology was validated on gastric cancer and lymphoma datasets.