[Show abstract][Hide abstract] ABSTRACT: Multiple myeloma (MM) is a malignant proliferation of plasma B cells. Based on recurrent aneuploidy such as copy number alterations (CNAs), myeloma is divided into two subtypes with different CNA patterns and patient survival outcomes. How aneuploidy events arise, and whether they contribute to cancer cell evolution are actively studied. The large amount of transcriptomic changes resultant of CNAs (dosage effect) pose big challenges for identifying functional consequences of CNAs in myeloma in terms of specific driver genes and pathways. In this study, we hypothesize that gene-wise dosage effect varies as a result from complex regulatory networks that translate the impact of CNAs to gene expression, and studying this variation can provide insights into functional effects of CNAs.
We propose gene-wise dosage effect score and genome-wide karyotype plot as tools to measure and visualize concordant copy number and expression changes across cancer samples. We find that dosage effect in myeloma is widespread yet variable, and it is correlated with gene expression level and CNA frequencies in different chromosomes. Our analysis suggests that despite the enrichment of differentially expressed genes between hyperdiploid MM and non-hyperdiploid MM in the trisomy chromosomes, the chromosomal proportion of dosage sensitive genes is higher in the non-trisomy chromosomes. Dosage-sensitive genes are enriched by genes with protein translation and localization functions, and dosage resistant genes are enriched by apoptosis genes. These results point to future studies on differential dosage sensitivity and resistance of pro- and anti-proliferation pathways and their variation across patients as therapeutic targets and prognosis markers.
Our findings support the hypothesis that recurrent CNAs in myeloma are selected by their functional consequences. The novel dosage effect score defined in this work will facilitate integration of copy number and expression data for identifying driver genes in cancer genomics studies. The accompanying R code is available at http://www.canevolve.org/dosageEffect/.
[Show abstract][Hide abstract] ABSTRACT: Multiple myeloma (MM) is a cancer of antibody-making plasma cells. It frequently harbors alterations in DNA and chromosome copy numbers, and can be divided into two major subtypes, hyperdiploid (HMM) and non-hyperdiploid multiple myeloma (NHMM). The two subtypes have different survival prognosis, possibly due to different but converging paths to oncogenesis. Existing methods for identifying the two subtypes are fluorescence in situ hybridization (FISH) and copy number microarrays, with increased cost and sample requirements. We hypothesize that chromosome alterations have their imprint in gene expression through dosage effect. Using five MM expression datasets that have HMM status measured by FISH and copy number microarrays, we have developed and validated a K-nearest-neighbor method to classify MM into HMM and NHMM based on gene expression profiles. Classification accuracy for test datasets ranges from 0.83 to 0.88. This classification will enable researchers to study differences and commonalities of the two MM subtypes in disease biology and prognosis using expression datasets without need for additional subtype measurements. Our study also supports the advantages of using cancer specific characteristics in feature design and pooling multiple rounds of classification results to improve accuracy. We provide R source code and processed datasets at www.ChengLiLab.org/software.
PLoS ONE 03/2013; 8(3):e58809. · 3.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Genome-wide profiles of tumors obtained using functional genomics platforms are being deposited to the public repositories at an astronomical scale, as a result of focused efforts by individual laboratories and large projects such as the Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium. Consequently, there is an urgent need for reliable tools that integrate and interpret these data in light of current knowledge and disseminate results to biomedical researchers in a user-friendly manner. We have built the canEvolve web portal to meet this need.
canEvolve query functionalities are designed to fulfill most frequent analysis needs of cancer researchers with a view to generate novel hypotheses. canEvolve stores gene, microRNA (miRNA) and protein expression profiles, copy number alterations for multiple cancer types, and protein-protein interaction information. canEvolve allows querying of results of primary analysis, integrative analysis and network analysis of oncogenomics data. The querying for primary analysis includes differential gene and miRNA expression as well as changes in gene copy number measured with SNP microarrays. canEvolve provides results of integrative analysis of gene expression profiles with copy number alterations and with miRNA profiles as well as generalized integrative analysis using gene set enrichment analysis. The network analysis capability includes storage and visualization of gene co-expression, inferred gene regulatory networks and protein-protein interaction information. Finally, canEvolve provides correlations between gene expression and clinical outcomes in terms of univariate survival analysis.
At present canEvolve provides different types of information extracted from 90 cancer genomics studies comprising of more than 10,000 patients. The presence of multiple data types, novel integrative analysis for identifying regulators of oncogenesis, network analysis and ability to query gene lists/pathways are distinctive features of canEvolve. canEvolve will facilitate integrative and meta-analysis of oncogenomics datasets.
The canEvolve web portal is available at http://www.canevolve.org/.
PLoS ONE 02/2013; 8(2):e56228. · 3.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Bortezomib therapy has proven successful for the treatment of relapsed/refractory, relapsed, and newly diagnosed multiple myeloma (MM); however, dose-limiting toxicities and the development of resistance limit its long-term utility. Here, we show that P5091 is an inhibitor of deubiquitylating enzyme USP7, which induces apoptosis in MM cells resistant to conventional and bortezomib therapies. Biochemical and genetic studies show that blockade of HDM2 and p21 abrogates P5091-induced cytotoxicity. In animal tumor model studies, P5091 is well tolerated, inhibits tumor growth, and prolongs survival. Combining P5091 with lenalidomide, HDAC inhibitor SAHA, or dexamethasone triggers synergistic anti-MM activity. Our preclinical study therefore supports clinical evaluation of USP7 inhibitor, alone or in combination, as a potential MM therapy.
Cancer cell 09/2012; 22(3):345-58. · 25.29 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We describe here a novel method for integrating gene and miRNA expression profiles in cancer using feed-forward loops (FFLs) consisting of transcription factors (TFs), miRNAs and their common target genes. The dChip-GemiNI (Gene and miRNA Network-based Integration) method statistically ranks computationally predicted FFLs by their explanatory power to account for differential gene and miRNA expression between two biological conditions such as normal and cancer. GemiNI integrates not only gene and miRNA expression data but also computationally derived information about TF-target gene and miRNA-mRNA interactions. Literature validation shows that the integrated modeling of expression data and FFLs better identifies cancer-related TFs and miRNAs compared to existing approaches. We have utilized GemiNI for analyzing six data sets of solid cancers (liver, kidney, prostate, lung and germ cell) and found that top-ranked FFLs account for ∼20% of transcriptome changes between normal and cancer. We have identified common FFL regulators across multiple cancer types, such as known FFLs consisting of MYC and miR-15/miR-17 families, and novel FFLs consisting of ARNT, CREB1 and their miRNA partners. The results and analysis web server are available at http://www.canevolve.org/dChip-GemiNi.
Nucleic Acids Research 05/2012; 40(17):e135. · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Over the last decade, multiple functional genomic datasets studying chromosomal aberrations and their downstream effects on gene expression have accumulated for several cancer types. A vast majority of them are in the form of paired gene expression profiles and somatic copy number alterations (CNA) information on the same patients identified using microarray platforms. In response, many algorithms and software packages are available for integrating these paired data. Surprisingly, there has been no serious attempt to review the currently available methodologies or the novel insights brought using them. In this work, we discuss the quantitative relationships observed between CNA and gene expression in multiple cancer types and biological milestones achieved using the available methodologies. We discuss the conceptual evolution of both, the step-wise and the joint data integration methodologies over the last decade. We conclude by providing suggestions for building efficient data integration methodologies and asking further biological questions.
Briefings in Bioinformatics 09/2011; 13(3):305-16. · 5.30 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Target specific antibodies are pivotal for the design of vaccines, immunodiagnostic tests, studies on proteomics for cancer biomarker discovery, identification of protein-DNA and other interactions, and small and large biochemical assays. Therefore, it is important to understand the properties of protein sequences that are important for antigenicity and to identify small peptide epitopes and large regions in the linear sequence of the proteins whose utilization result in specific antibodies.
Our analysis using protein properties suggested that sequence composition combined with evolutionary information and predicted secondary structure, as well as solvent accessibility is sufficient to predict successful peptide epitopes. The antigenicity and the specificity in immune response were also found to depend on the epitope length. We trained the B-Cell Epitope Oracle (BEOracle), a support vector machine (SVM) classifier, for the identification of continuous B-Cell epitopes with these protein properties as learning features. The BEOracle achieved an F1-measure of 81.37% on a large validation set. The BEOracle classifier outperformed the classical methods based on propensity and sophisticated methods like BCPred and Bepipred for B-Cell epitope prediction. The BEOracle classifier also identified peptides for the ChIP-grade antibodies from the modENCODE/ENCODE projects with 96.88% accuracy. High BEOracle score for peptides showed some correlation with the antibody intensity on Immunofluorescence studies done on fly embryos. Finally, a second SVM classifier, the B-Cell Region Oracle (BROracle) was trained with the BEOracle scores as features to predict the performance of antibodies generated with large protein regions with high accuracy. The BROracle classifier achieved accuracies of 75.26-63.88% on a validation set with immunofluorescence, immunohistochemistry, protein arrays and western blot results from Protein Atlas database.
Together our results suggest that antigenicity is a local property of the protein sequences and that protein sequence properties of composition, secondary structure, solvent accessibility and evolutionary conservation are the determinants of antigenicity and specificity in immune response. Moreover, specificity in immune response could also be accurately predicted for large protein regions without the knowledge of the protein tertiary structure or the presence of discontinuous epitopes. The dataset prepared in this work and the classifier models are available for download at https://sites.google.com/site/oracleclassifiers/.
[Show abstract][Hide abstract] ABSTRACT: Genome-wide expression signatures are emerging as potential marker for overall survival and disease recurrence risk as evidenced by recent commercialization of gene expression based biomarkers in breast cancer. Similar predictions have recently been carried out using genome-wide copy number alterations and microRNAs. Existing software packages for microarray data analysis provide functions to define expression-based survival gene signatures. However, there is no software that can perform survival analysis using SNP array data or draw survival curves interactively for expression-based sample clusters.
We have developed the survival analysis module in the dChip software that performs survival analysis across the genome for gene expression and copy number microarray data. Built on the current dChip software's microarray analysis functions such as chromosome display and clustering, the new survival functions include interactive exploring of Kaplan-Meier (K-M) plots using expression or copy number data, computing survival p-values from the log-rank test and Cox models, and using permutation to identify significant chromosome regions associated with survival.
The dChip survival module provides user-friendly way to perform survival analysis and visualize the results in the context of genes and cytobands. It requires no coding expertise and only minimal learning curve for thousands of existing dChip users. The implementation in Visual C++ also enables fast computation. The software and demonstration data are freely available at http://dchip-surv.chenglilab.org.
[Show abstract][Hide abstract] ABSTRACT: Systematic annotation of gene regulatory elements is a major challenge in genome science. Direct mapping of chromatin modification marks and transcriptional factor binding sites genome-wide has successfully identified specific subtypes of regulatory elements. In Drosophila several pioneering studies have provided genome-wide identification of Polycomb response elements, chromatin states, transcription factor binding sites, RNA polymerase II regulation and insulator elements; however, comprehensive annotation of the regulatory genome remains a significant challenge. Here we describe results from the modENCODE cis-regulatory annotation project. We produced a map of the Drosophila melanogaster regulatory genome on the basis of more than 300 chromatin immunoprecipitation data sets for eight chromatin features, five histone deacetylases and thirty-eight site-specific transcription factors at different stages of development. Using these data we inferred more than 20,000 candidate regulatory elements and validated a subset of predictions for promoters, enhancers and insulators in vivo. We identified also nearly 2,000 genomic regions of dense transcription factor binding associated with chromatin activity and accessibility. We discovered hundreds of new transcription factor co-binding relationships and defined a transcription factor network with over 800 potential regulatory relationships.
[Show abstract][Hide abstract] ABSTRACT: To gain insight into how genomic information is translated into cellular and developmental programs, the Drosophila model organism Encyclopedia of DNA Elements (modENCODE) project is comprehensively mapping transcripts, histone modifications,
chromosomal proteins, transcription factors, replication proteins and intermediates, and nucleosome properties across a developmental
time course and in multiple cell lines. We have generated more than 700 data sets and discovered protein-coding, noncoding,
RNA regulatory, replication, and chromatin elements, more than tripling the annotated portion of the Drosophila genome. Correlated activity patterns of these elements reveal a functional regulatory network, which predicts putative new
functions for genes, reveals stage- and tissue-specific regulators, and enables gene-expression prediction. Our results provide
a foundation for directed experimental and computational studies in Drosophila and related species and also a model for systematic data integration toward comprehensive genomic and functional annotation.
[Show abstract][Hide abstract] ABSTRACT: Insulators are DNA sequences that control the interactions among genomic regulatory elements and act as chromatin boundaries. A thorough understanding of their location and function is necessary to address the complexities of metazoan gene regulation. We studied by ChIP-chip the genome-wide binding sites of 6 insulator-associated proteins-dCTCF, CP190, BEAF-32, Su(Hw), Mod(mdg4), and GAF-to obtain the first comprehensive map of insulator elements in Drosophila embryos. We identify over 14,000 putative insulators, including all classically defined insulators. We find two major classes of insulators defined by dCTCF/CP190/BEAF-32 and Su(Hw), respectively. Distributional analyses of insulators revealed that particular sub-classes of insulator elements are excluded between cis-regulatory elements and their target promoters; divide differentially expressed, alternative, and divergent promoters; act as chromatin boundaries; are associated with chromosomal breakpoints among species; and are embedded within active chromatin domains. Together, these results provide a map demarcating the boundaries of gene regulatory units and a framework for understanding insulator function during the development and evolution of Drosophila.
[Show abstract][Hide abstract] ABSTRACT: The highly coordinated expression of thousands of genes in an organism is regulated by the concerted action of transcription factors, chromatin proteins and epigenetic mechanisms. High-throughput experimental data for genome wide in vivo protein-DNA interactions and epigenetic marks are becoming available from large projects, such as the model organism ENCyclopedia Of DNA Elements (modENCODE) and from individual labs. Dissemination and visualization of these datasets in an explorable form is an important challenge.
To support research on Drosophila melanogaster transcription regulation and make the genome wide in vivo protein-DNA interactions data available to the scientific community as a whole, we have developed a system called Flynet. Currently, Flynet contains 101 datasets for 38 transcription factors and chromatin regulator proteins in different experimental conditions. These factors exhibit different types of binding profiles ranging from sharp localized peaks to broad binding regions. The protein-DNA interaction data in Flynet was obtained from the analysis of chromatin immunoprecipitation experiments on one color and two color genomic tiling arrays as well as chromatin immunoprecipitation followed by massively parallel sequencing. A web-based interface, integrated with an AJAX based genome browser, has been built for queries and presenting analysis results. Flynet also makes available the cis-regulatory modules reported in literature, known and de novo identified sequence motifs across the genome, and other resources to study gene regulation.
Flynet is available at https://www.cistrack.org/flynet/.
[Show abstract][Hide abstract] ABSTRACT: Systematically annotating function of enzymes that belong to large protein families encoded in a single eukaryotic genome is a very challenging task. We carried out such an exercise to annotate function for serine-protease family of the trypsin fold in Drosophila melanogaster, with an emphasis on annotating serine-protease homologues (SPHs) that may have lost their catalytic function. Our approach involves data mining and data integration to provide function annotations for 190 Drosophila gene products containing serine-protease-like domains, of which 35 are SPHs. This was accomplished by analysis of structure-function relationships, gene-expression profiles, large-scale protein-protein interaction data, literature mining and bioinformatic tools. We introduce functional residue clustering (FRC), a method that performs hierarchical clustering of sequences using properties of functionally important residues and utilizes correlation co-efficient as a quantitative similarity measure to transfer in vivo substrate specificities to proteases. We show that the efficiency of transfer of substrate-specificity information using this method is generally high. FRC was also applied on Drosophila proteases to assign putative competitive inhibitor relationships (CIRs). Microarray gene-expression data were utilized to uncover a large-scale and dual involvement of proteases in development and in immune response. We found specific recruitment of SPHs and proteases with CLIP domains in immune response, suggesting evolution of a new function for SPHs. We also suggest existence of separate downstream protease cascades for immune response against bacterial/fungal infections and parasite/parasitoid infections. We verify quality of our annotations using information from RNAi screens and other evidence types. Utilization of such multi-fold approaches results in 10-fold increase of function annotation for Drosophila serine proteases and demonstrates value in increasing annotations in multiple genomes.
[Show abstract][Hide abstract] ABSTRACT: We demonstrate an integrated approach to the study of a transcriptional regulatory cascade involved in the progression of breast cancer and we identify a protein associated with disease progression. Using chromatin immunoprecipitation and genome tiling arrays, whole genome mapping of transcription factor-binding sites was combined with gene expression profiling to identify genes involved in the proliferative response to estrogen (E2). Using RNA interference, selected ERalpha and c-MYC gene targets were knocked down to identify mediators of E2-stimulated cell proliferation. Tissue microarray screening revealed that high expression of an epigenetic factor, the E2-inducible histone variant H2A.Z, is significantly associated with lymph node metastasis and decreased breast cancer survival. Detection of H2A.Z levels independently increased the prognostic power of biomarkers currently in clinical use. This integrated approach has accelerated the identification of a molecule linked to breast cancer progression, has implications for diagnostic and therapeutic interventions, and can be applied to a wide range of cancers.
Molecular Systems Biology 02/2008; 4:188. · 14.10 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: MOTIVATION: Availability of large volumes of genomic and enzymatic data for taxonomically and phenotypically diverse organisms allows for exploration of the adaptive mechanisms that led to diversification of enzymatic functions. We present Chisel, a computational framework and a pipeline for an automated, high-resolution analysis of evolutionary variations of enzymes. Chisel allows automatic as well as interactive identification, and characterization of enzymatic sequences. Such knowledge can be utilized for comparative genomics, microbial diagnostics, metabolic engineering, drug design and analysis of metagenomes. RESULTS: Chisel is a comprehensive resource that contains 8575 clusters and subsequent computational models specific for 939 distinct enzymatic functions and, when data is sufficient, their taxonomic variations. Application of Chisel to identification of enzymatic sequences in newly sequenced genomes, analysis of organism-specific metabolic networks, 'binning' of metagenomes and other biological problems are presented. We also provide a thorough analysis of Chisel performance with other similar resources and manual annotations on Shewanella oneidensis MR1 genome.
[Show abstract][Hide abstract] ABSTRACT: In a genome-wide analysis, we have identified 85 human genes encoding 103 protein isoforms that resemble retroviral Gag proteins. These genes were domesticated from retrotransposons in at least five independent events during vertebrate evolution and were subsequently duplicated further in mammals. Structural insights into the mammalian proteins can be inferred by homology to Gag from viruses such as HIV; in turn, the cellular roles of the mammalian Gag homologs, such as apoptosis-related functions and binding to ubiquitin ligases, might hint at further functionality of viral Gag itself.
Trends in Genetics 12/2006; 22(11):585-9. · 11.60 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Generation of alternative transcripts from the same gene is an important biological event due to their contribution in creating functional diversity in eukaryotes. In this work, we choose the task of extracting information around this complex topic using a two-step procedure involving machine learning and information extraction.
In the first step, we trained a classifier that inductively learns to identify sentences about physiological transcript diversity from the MEDLINE abstracts. Using a large hand-built corpus, we compared the sentence classification performance of various text categorization methods. Support vector machines (SVMs) followed by the maximum entropy classifier outperformed other methods for the sentence classification task. The SVM with the radial basis function kernel and optimized parameters achieved Fbeta-measure of 91% during the 4-fold cross validation and of 74% when applied to all sentences in more than 12 million abstracts of MEDLINE. In the second step, we identified eight frequently present semantic categories in the sentences and performed a limited amount of semantic role labeling. The role labeling step also achieved very high Fbeta-measure for all eight categories.
The results of our two-step procedure are summarized in the LSAT database of alternative transcripts. LSAT is available at http://www.bork.embl.de/LSAT CONTACT: email@example.com
Supplementary data are available at Bioinformatics online.
[Show abstract][Hide abstract] ABSTRACT: Transcript diversity generated by alternative splicing and associated mechanisms contributes heavily to the functional complexity of biological systems. The numerous examples of the mechanisms and functional implications of these events are scattered throughout the scientific literature. Thus, it is crucial to have a tool that can automatically extract the relevant facts and collect them in a knowledge base that can aid the interpretation of data from high-throughput methods. We have developed and applied a composite text-mining method for extracting information on transcript diversity from the entire MEDLINE database in order to create a database of genes with alternative transcripts. It contains information on tissue specificity, number of isoforms, causative mechanisms, functional implications, and experimental methods used for detection. We have mined this resource to identify 959 instances of tissue-specific splicing. Our results in combination with those from EST-based methods suggest that alternative splicing is the preferred mechanism for generating transcript diversity in the nervous system. We provide new annotations for 1,860 genes with the potential for generating transcript diversity. We assign the MeSH term "alternative splicing" to 1,536 additional abstracts in the MEDLINE database and suggest new MeSH terms for other events. We have successfully extracted information about transcript diversity and semiautomatically generated a database, LSAT, that can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression. LSAT (Literature Support for Alternative Transcripts) is publicly available at http://www.bork.embl.de/LSAT/.
[Show abstract][Hide abstract] ABSTRACT: Structures for protein domains have increased rapidly in recent years owing to advances in structural biology and structural genomics projects. New structures are often similar to those solved previously, and such similarities can give insights into function by linking poorly understood families to those that are better characterized. They also allow the possibility of combing information to find still more proteins adopting a similar structure and sometimes a similar function, and to reprioritize families in structural genomics pipelines. We explore this possibility here by preparing merged profiles for pairs of structurally similar, but not necessarily sequence-similar, domains within the SMART and Pfam database by way of the Structural Classification of Proteins (SCOP). We show that such profiles are often able to successfully identify further members of the same superfamily and thus can be used to increase the sensitivity of database searching methods like HMMer and PSI-BLAST. We perform detailed benchmarks using the SMART and Pfam databases with four complete genomes frequently used as annotation benchmarks. We quantify the associated increase in structural information in Swissprot and discuss examples illustrating the applicability of this approach to understand functional and evolutionary relationships between protein families.
Protein Science 06/2005; 14(5):1305-14. · 2.86 Impact Factor