Fan Shi

University of Melbourne, Melbourne, Victoria, Australia

Are you Fan Shi?

Claim your profile

Publications (5)13.1 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Accurate identification of the primary tumour in cancer of unknown primary (CUP) is required for effective treatment selection and improved patient outcomes. The aim of this study was to develop and validate a gene expression tumour classifier and integrate it with histopathology to identify the likely site of origin in CUP.RNA was extracted from 450 formalin fixed, paraffin embedded samples of known origin comprising 18 tumour groups. Whole genome expression analysis was performed using a bead-based array. Classification of the tumours made use of a binary support vector machine, together with recursive feature elimination. A hierarchical tumour classifier was developed and incorporated with conventional histopathology to identify the origins of metastatic tumours.The classifier demonstrated an accuracy of 88% for correctly predicting the tumour type on a validation set of known tumours (n = 94). For CUP samples (n = 49) having a final clinical diagnosis, the classifier improved the accuracy of histology alone for both single and multiple predictions. Furthermore, where histology alone could not suggest any specific diagnosis, the classifier was able to correctly predict the primary site of origin.We demonstrate the integration of gene expression profiling with conventional histopathology to aid the investigation of CUP.
    Pathology 12/2014; · 2.66 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In many prediction problems, it is beneficial to obtain confidence estimates for the classification output. We consider the problem of estimating confidence sets in multiclass classification of real life datasets. Building on the theory of conformal predictors, we derive a class-conditional conformal predictor. This allows us to calibrate the confidence estimates in a class specific fashion, resulting in a more precise control of the prediction error rate for each class. We show that the class-conditional conformal predictor is asymptotically valid, and demonstrate that it indeed provides better calibration and efficiency on benchmark digit recognition datasets. In addition, we apply the class-conditional conformal predictor to a biological dataset for predicting localizations of proteins in order to demonstrate its performance in bioinformatics applications.
    Proceedings of the 2013 12th International Conference on Machine Learning and Applications - Volume 01; 12/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Many different microarray experiments are publicly available today. It is natural to ask whether different experiments for the same phenotypic conditions can be combined using meta-analysis, in order to increase the overall sample size. However, some genes are not measured in all experiments, hence they cannot be included or their statistical significance cannot be appropriately estimated in traditional meta-analysis. Nonetheless, these genes, which we refer to as incomplete genes, may also be informative and useful. We propose a meta-analysis framework, called "Incomplete Gene Meta-analysis", which can include incomplete genes by imputing the significance of missing replicates, and computing a meta-score for every gene across all datasets. We demonstrate that the incomplete genes are worthy of being included and our method is able to appropriately estimate their significance in two groups of experiments. We first apply the Incomplete Gene Meta-analysis and several comparable methods to five breast cancer datasets with an identical set of probes. We simulate incomplete genes by randomly removing a subset of probes from each dataset and demonstrate that our method consistently outperforms two other methods in terms of their false discovery rate. We also apply the methods to three gastric cancer datasets for the purpose of discriminating diffuse and intestinal subtypes. Meta-analysis is an effective approach that identifies more robust sets of differentially expressed genes from multiple studies. The incomplete genes that mainly arise from the use of different platforms may also have statistical and biological importance but are ignored or are not appropriately involved by previous studies. Our Incomplete Gene Meta-analysis is able to incorporate the incomplete genes by estimating their significance. The results on both breast and gastric cancer datasets suggest that the highly ranked genes and associated GO terms produced by our method are more significant and biologically meaningful according to the previous literature.
    BMC Bioinformatics 03/2011; 12:84. · 3.02 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In the study of cancer genomics, gene expression microarrays, which measure thousands of genes in a single assay, provide abundant information for the investigation of interesting genes or biological pathways. However, in order to analyze the large number of noisy measurements in microarrays, effective and efficient bioinformatics techniques are needed to identify the associations between genes and relevant phenotypes. Moreover, systematic tests are needed to validate the statistical and biological significance of those discoveries. In this paper, we develop a robust and efficient method for exploratory analysis of microarray data, which produces a number of different orderings (rankings) of both genes and samples (reflecting correlation among those genes and samples). The core algorithm is closely related to biclustering, and so we first compare its performance with several existing biclustering algorithms on two real datasets - gastric cancer and lymphoma datasets. We then show on the gastric cancer data that the sample orderings generated by our method are highly statistically significant with respect to the histological classification of samples by using the Jonckheere trend test, while the gene modules are biologically significant with respect to biological processes (from the Gene Ontology). In particular, some of the gene modules associated with biclusters are closely linked to gastric cancer tumorigenesis reported in previous literature, while others are potentially novel discoveries. In conclusion, we have developed an effective and efficient method, Bi-Ordering Analysis, to detect informative patterns in gene expression microarrays by ranking genes and samples. In addition, a number of evaluation metrics were applied to assess both the statistical and biological significance of the resulting bi-orderings. The methodology was validated on gastric cancer and lymphoma datasets.
    BMC Bioinformatics 01/2010; 11:477. · 3.02 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background It has been hypothesized that multivariate analysis and systematic detection of epistatic interactions between explanatory genotyping variables may help resolve the problem of "missing heritability" currently observed in genome-wide association studies (GWAS). However, even the simplest bivariate analysis is still held back by significant statistical and computational challenges that are often addressed by reducing the set of analysed markers. Theoretically, it has been shown that combinations of loci may exist that show weak or no effects individually, but show significant (even complete) explanatory power over phenotype when combined. Reducing the set of analysed SNPs before bivariate analysis could easily omit such critical loci. Results We have developed an exhaustive bivariate GWAS analysis methodology that yields a manageable subset of candidate marker pairs for subsequent analysis using other, often more computationally expensive techniques. Our model-free filtering approach is based on classification using ROC curve analysis, an alternative to much slower regression-based modelling techniques. Exhaustive analysis of studies containing approximately 450,000 SNPs and 5,000 samples requires only 2 hours using a desktop CPU or 13 minutes using a GPU (Graphics Processing Unit). We validate our methodology with analysis of simulated datasets as well as the seven Wellcome Trust Case-Control Consortium datasets that represent a wide range of real life GWAS challenges. We have identified SNP pairs that have considerably stronger association with disease than their individual component SNPs that often show negligible effect univariately. When compared against previously reported results in the literature, our methods re-detect most significant SNP-pairs and additionally detect many pairs absent from the literature that show strong association with disease. The high overlap suggests that our fast analysis could substitute for some slower alternatives. Conclusions We demonstrate that the proposed methodology is robust, fast and capable of exhaustive search for epistatic interactions using a standard desktop computer. First, our implementation is significantly faster than timings for comparable algorithms reported in the literature, especially as our method allows simultaneous use of multiple statistical filters with low computing time overhead. Second, for some diseases, we have identified hundreds of SNP pairs that pass formal multiple test (Bonferroni) correction and could form a rich source of hypotheses for follow-up analysis. Availability A web-based version of the software used for this analysis is available at http://bioinformatics.research.nicta.com.au/gwis.
    BMC Genomics 14(3). · 4.40 Impact Factor

Publication Stats

4 Citations
13.10 Total Impact Points

Top Journals

Institutions

  • 2011
    • University of Melbourne
      Melbourne, Victoria, Australia
    • Victoria University Melbourne
      Melbourne, Victoria, Australia
  • 2010
    • National ICT Australia Ltd
      Sydney, New South Wales, Australia