[Show abstract][Hide abstract] ABSTRACT: Selecting differentially expressed genes (DEGs) is one of the most important tasks in microarray applications for studying multi-factor diseases including cancers. However, the small samples typically used in current microarray studies may only partially reflect the widely altered gene expressions in complex diseases, which would introduce low reproducibility of gene lists selected by statistical methods. Here, by analyzing seven cancer datasets, we showed that, in each cancer, a wide range of functional modules have altered gene expressions and thus have high disease classification abilities. The results also showed that seven modules are shared across diverse cancers, suggesting hints about the common mechanisms of cancers. Therefore, instead of relying on a few individual genes whose selection is hardly reproducible in current microarray experiments, we may use functional modules as functional signatures to study core mechanisms of cancers and build robust diagnostic classifiers.
Science China. Life sciences 02/2011; 54(2):189-93. · 1.51 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: By high-throughput screens of somatic mutations of genes in cancer genomes, hundreds of cancer genes are being rapidly identified, providing us abundant information for systematically deciphering the genetic changes underlying cancer mechanism. However, the functional collaboration of mutated genes is often neglected in current studies. Here, using four genome-wide somatic mutation data sets and pathways defined in various databases, we showed that gene pairs significantly comutated in cancer samples tend to distribute between pathways rather than within pathways. At the basic functional level of motifs in the human protein-protein interaction network, we also found that comutated gene pairs were overrepresented between motifs but extremely depleted within motifs. Specifically, we showed that based on Gene Ontology that describes gene functions at various specific levels, we could tackle the pathway definition problem to some degree and study the functional collaboration of gene mutations in cancer genomes more efficiently. Then, by defining pairs of pathways frequently linked by comutated gene pairs as the between-pathway models, we showed they are also likely to be codisrupted by mutations of the interpathway hubs of the coupled pathways, suggesting new hints for understanding the heterogeneous mechanisms of cancers. Finally, we showed some between-pathway models consisting of important pathways such as cell cycle checkpoint and cell proliferation were codisrupted in most cancer samples under this study, suggesting that their codisruptions might be functionally essential in inducing these cancers. All together, our results would provide a channel to detangle the complex collaboration of the molecular processes underlying cancer mechanism.
Molecular Cancer Therapeutics 08/2010; 9(8):2186-95. · 5.60 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: MOTIVATION: Studying the evolutionary conservation of cancer genes can improve our understanding of the genetic basis of human cancers. Functionally related proteins encoded by genes tend to interact with each other in a modular fashion, which may affect both the mode and tempo of their evolution. RESULTS: In the human PPI network, we searched for subnetworks within each of which all proteins have evolved at similar rates since the human and mouse split. Identified at a given co-evolving level, the subnetworks with non-randomly large sizes were defined as co-evolving modules. We showed that proteins within modules tend to be conserved, evolutionarily old and enriched with housekeeping genes, while proteins outside modules tend to be less-conserved, evolutionarily younger and enriched with genes expressed in specific tissues. Viewing cancer genes from co-evolving modules showed that the overall conservation of cancer genes should be mainly attributed to the cancer proteins enriched in the conserved modules. Functional analysis further suggested that cancer proteins within and outside modules might play different roles in carcinogenesis, providing a new hint for studying the mechanism of cancer.
[Show abstract][Hide abstract] ABSTRACT: Although novel technologies are rapidly emerging, the cDNA microarray data accumulated is still and will be an important source for bioinformatics and biological studies. Thus, the reliability and applicability of the cDNA microarray data warrants further evaluation. In cDNA microarrays, multiple clones are measured for a transcript, which can be exploited to evaluate the consistency of microarray data. We show that even for pairs of RCs, the average Pearson correlation coefficient of their measurements is not high. However, this low consistency could largely be explained by random noise signals for a fraction of unexpressed genes and/or low signal-to-noise ratios for low abundance transcripts. Encouragingly, a large fraction of inconsistent data will be filtered out in the procedure of selecting differentially expressed genes (DEGs). Therefore, although cDNA microarray data are of low consistency, applications based on DEGs selections could still reach correct biological results, especially at the functional modules level.
Omics: a journal of integrative biology 09/2009; 13(6):493-9. · 2.29 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: MOTIVATION: According to current consistency metrics such as percentage of overlapping genes (POG), lists of differentially expressed genes (DEGs) detected from different microarray studies for a complex disease are often highly inconsistent. This irreproducibility problem also exists in other high-throughput post-genomic areas such as proteomics and metabolism. A complex disease is often characterized with many coordinated molecular changes, which should be considered when evaluating the reproducibility of discovery lists from different studies. RESULTS: We proposed metrics percentage of overlapping genes-related (POGR) and normalized POGR (nPOGR) to evaluate the consistency between two DEG lists for a complex disease, considering correlated molecular changes rather than only counting gene overlaps between the lists. Based on microarray datasets of three diseases, we showed that though the POG scores for DEG lists from different studies for each disease are extremely low, the POGR and nPOGR scores can be rather high, suggesting that the apparently inconsistent DEG lists may be highly reproducible in the sense that they are actually significantly correlated. Observing different discovery results for a disease by the POGR and nPOGR scores will obviously reduce the uncertainty of the microarray studies. The proposed metrics could also be applicable in many other high-throughput post-genomic areas.
[Show abstract][Hide abstract] ABSTRACT: Selecting differentially expressed genes (DEGs) is one of the most important tasks in microarray applications. However, the sample sizes typically used in current cancer studies may only partially reflect the widely altered gene expressions in cancers. By analyzing three large cancer datasets, we show that, in each cancer, a wide range of functional modules are altered and have high disease classification abilities. The results also show that modules shared across diverse cancers cover a wide range of functions, suggesting hints about the common mechanisms of cancers. Therefore, instead of relying on a few consensus individual genes whose selection is hardly reproducible in current microarray experiments, we may use functional modules as functional signatures to build robust diagnostic classifiers.
[Show abstract][Hide abstract] ABSTRACT: MOTIVATION: Differentially expressed gene (DEG) lists detected from different microarray studies for a same disease are often highly inconsistent. Even in technical replicate tests using identical samples, DEG detection still shows very low reproducibility. It is often believed that current small microarray studies will largely introduce false discoveries. RESULTS: Based on a statistical model, we show that even in technical replicate tests using identical samples, it is highly likely that the selected DEG lists will be very inconsistent in the presence of small measurement variations. Therefore, the apparently low reproducibility of DEG detection from current technical replicate tests does not indicate low quality of microarray technology. We also demonstrate that heterogeneous biological variations existing in real cancer data will further reduce the overall reproducibility of DEG detection. Nevertheless, in small subsamples from both simulated and real data, the actual false discovery rate (FDR) for each DEG list tends to be low, suggesting that each separately determined list may comprise mostly true DEGs. Rather than simply counting the overlaps of the discovery lists from different studies for a complex disease, novel metrics are needed for evaluating the reproducibility of discoveries characterized with correlated molecular changes. Supplementaty information: Supplementary data are available at Bioinformatics online.
[Show abstract][Hide abstract] ABSTRACT: It is of great importance to identify new cancer genes from the data of large scale genome screenings of gene mutations in cancers. Considering the alternations of some essential functions are indispensable for oncogenesis, we define them as cancer functions and select, as their approximations, a group of detailed functions in GO (Gene Ontology) highly enriched with known cancer genes. To evaluate the efficiency of using cancer functions as features to identify cancer genes, we define, in the screened genes, the known protein kinase cancer genes as gold standard positives and the other kinase genes as gold standard negatives. The results show that cancer associated functions are more efficient in identifying cancer genes than the selection pressure feature. Furthermore, combining cancer functions with the number of non-silent mutations can generate more reliable positive predictions. Finally, with precision 0.42, we suggest a list of 46 kinase genes as candidate cancer genes which are annotated to cancer functions and carry at least 3 non-silent mutations.
Science in China Series C Life Sciences 07/2008; 51(6):569-74. · 1.61 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Selecting feature genes for disease prediction is one of the most important applications of microarray technology. However, gene lists obtained in different studies for a same clinical type of patients often differ widely and have few genes in common. Recent researches suggest that gene lists ranked by fold change are more reproducible than by t-test. Here, based on the resampling method, we use training sets of different sizes to select features as top-ranked by P- value of t-test, d-value of SAM, and fold change. Then, we evaluate the stability and the disease classification power of each top ranked gene list. Our result suggests that for disease classification, gene lists selected through d-value ranking are most suitable concerning both reproducibility and classification power.
BioMedical Engineering and Informatics, 2008. BMEI 2008. International Conference on; 06/2008
[Show abstract][Hide abstract] ABSTRACT: In microarray studies, numerous tools are available for functional enrichment analysis based on GO categories. Most of these tools, due to their requirement of a prior threshold for designating genes as differentially expressed genes (DEGs), are categorized as threshold-dependent methods that often suffer from a major criticism on their changing results with different thresholds.
In the present article, by considering the inherent correlation structure of the GO categories, a continuous measure based on semantic similarity of GO categories is proposed to investigate the functional consistence (or stability) of threshold-dependent methods. The results from several datasets show when simply counting overlapping categories between two groups, the significant category groups selected under different DEG thresholds are seemingly very different. However, based on the semantic similarity measure proposed in this article, the results are rather functionally consistent for a wide range of DEG thresholds. Moreover, we find that the functional consistence of gene lists ranked by SAM metric behaves relatively robust against changing DEG thresholds.
Source code in R is available on request from the authors.
[Show abstract][Hide abstract] ABSTRACT: Based on high-throughput data, numerous algorithms have been designed to find functions of novel proteins. However, the effectiveness
of such algorithms is currently limited by some fundamental factors, including (1) the low a-priori probability of novel proteins participating in a detailed function; (2) the huge false data present in high-throughput datasets;
(3) the incomplete data coverage of functional classes; (4) the abundant but heterogeneous negative samples for training the
algorithms; and (5) the lack of detailed functional knowledge for training algorithms. Here, for partially characterized proteins,
we suggest an approach to finding their finer functions based on protein interaction sub-networks or gene expression patterns,
defined in function-specific subspaces. The proposed approach can lessen the above-mentioned problems by properly defining
the prediction range and functionally filtering the noisy data, and thus can efficiently find proteins’ novel functions. For
thousands of yeast and human proteins partially characterized, it is able to reliably find their finer functions (e.g., the
translational functions) with more than 90% precision. The predicted finer functions are highly valuable both for guiding
the follow-up wet-lab validation and for providing the necessary data for training algorithms to learn other proteins.
Chinese Science Bulletin 11/2007; 52(24):3363-3370. · 1.37 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Current high-throughput protein-protein interaction (PPI) data do not provide information about the condition(s) under which the interactions occur. Thus, the identification of condition-responsive PPI sub-networks is of great importance for investigating how a living cell adapts to changing environments.
In this article, we propose a novel edge-based scoring and searching approach to extract a PPI sub-network responsive to conditions related to some investigated gene expression profiles. Using this approach, what we constructed is a sub-network connected by the selected edges (interactions), instead of only a set of vertices (proteins) as in previous works. Furthermore, we suggest a systematic approach to evaluate the biological relevance of the identified responsive sub-network by its ability of capturing condition-relevant functional modules. We apply the proposed method to analyze a human prostate cancer dataset and a yeast cell cycle dataset. The results demonstrate that the edge-based method is able to efficiently capture relevant protein interaction behaviors under the investigated conditions.
Supplementary data are available at Bioinformatics online.
[Show abstract][Hide abstract] ABSTRACT: Based on high-throughput data, numerous algorithms have been designed for finding functions of novel proteins. However, the effectiveness of such algorithms is currently limited by some fundamental factors including the low a-priori probability of novel proteins participating in a detailed function and the lack of detailed functional knowledge for training algorithms. For such partially characterized proteins, we suggest an approach to find their finer functions based on protein-protein interaction sub-networks, which can efficiently find proteins' novel functions. As an application, we find that finer functions can be predicted for 18 and 15 proteins currently annotated in "protein biosynthesis" and "translation" with more than 90% precision, respectively. The predicted finer functions are highly valuable both for guiding the follow-up wet-lab validation and for providing the necessary data for training algorithms to learn other proteins.
Bioinformatics and Biomedical Engineering, 2007. ICBBE 2007. The 1st International Conference on; 08/2007
[Show abstract][Hide abstract] ABSTRACT: BACKGROUND: Rapid progress in high-throughput biotechnologies (e.g. microarrays) and exponential accumulation of gene functional knowledge make it promising for systematic understanding of complex human diseases at functional modules level. Based on Gene Ontology, a large number of automatic tools have been developed for the functional analysis and biological interpretation of the high-throughput microarray data. RESULTS: Different from the existing tools such as Onto-Express and FatiGO, we develop a tool named GO-2D for identifying 2-dimensional functional modules based on combined GO categories. For example, it refines biological process categories by sorting their genes into different cellular component categories, and then extracts those combined categories enriched with the interesting genes (e.g., the differentially expressed genes) for identifying the cellular-localized functional modules. Applications of GO-2D to the analyses of two human cancer datasets show that very specific disease-relevant processes can be identified by using cellular location information. CONCLUSION: For studying complex human diseases, GO-2D can extract functionally compact and detailed modules such as the cellular-localized ones, characterizing disease-relevant modules in terms of both biological processes and cellular locations. The application results clearly demonstrate that 2-dimensional approach complementary to current 1-dimensional approach is powerful for finding modules highly relevant to diseases.
[Show abstract][Hide abstract] ABSTRACT: Motivation: Personalized medicine based on molecular aspects of diseases, such as gene expression profiling, has become increasingly popular. However, one faces multiple challenges when analyzing clinical gene expression data; most of the well-known ...
[Show abstract][Hide abstract] ABSTRACT: Identifying disease-relevant genes and functional modules, based on gene expression profiles and gene functional knowledge,
is of high importance for studying disease mechanisms and subtyping disease phenotypes. Using gene categories of biological
process and cellular component in Gene Ontology, we propose an approach to selecting functional modules enriched with differentially
expressed genes, and identifying the feature functional modules of high disease discriminating abilities. Using the differentially
expressed genes in each feature module as the feature genes, we reveal the relevance of the modules to the studied diseases.
Using three data-sets for prostate cancer, gastric cancer, and leukemia, we have demonstrated that the proposed modular approach
is of high power in identifying functionally integrated feature gene subsets that are highly relevant to the disease mechanisms.
Our analysis has also shown that the critical disease-relevant genes might be better recognized from the gene regulation network,
which is constructed using the characterized functional modules, giving important clues to the concerted mechanisms of the
modules responding to complex disease states. In addition, the proposed approach to selecting the disease-relevant genes by
jointly considering the gene functional knowledge suggests a new way for precisely classifying disease samples with clear
biological interpretations, which is critical for the clinical diagnosis and the elucidation of the pathogenic basis of complex
Chinese Science Bulletin 08/2006; 51(15):1848-1856. · 1.37 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Discovering molecular heterogeneities in phenotypically defined disease is of critical importance both for understanding pathogenic mechanisms of complex diseases and for finding efficient treatments. Recently, it has been recognized that cellular phenotypes are determined by the concerted actions of many functionally related genes in modular fashions. The underlying modular mechanisms should help the understanding of hidden genetic heterogeneities of complex diseases. We defined a putative disease module to be the functional gene groups in terms of both biological process and cellular localization, which are significantly enriched with genes highly variably expressed across the disease samples. As a validation, we used two large cancer datasets to evaluate the ability of the modules for correctly partitioning samples. Then, we sought the subtypes of complex diffuse large B-cell lymphoma (DLBCL) using a public dataset. Finally, the clinical significance of the identified subtypes was verified by survival analysis. In two validation datasets, we achieved highly accurate partitions that best fit the clinical cancer phenotypes. Then, for the notoriously heterogeneous DLBCL, we demonstrated that two partitioned subtypes using an identified module ("cellular response to stress") had very different 5-year overall rates (65% vs. 14%) and were highly significantly (P < 0.007) correlated with the clinical survival rate. Finally, we built a multivariate Cox proportional-hazard prediction model that included 4 genes as risk predictors for survival over DLBCL. The proposed modular approach is a promising computational strategy for peeling off genetic heterogeneities and understanding the modular mechanisms of human diseases such as cancers.
Molecular Medicine 12(1-3):25-33. · 4.82 Impact Factor