[Show abstract][Hide abstract] ABSTRACT: Adjuvant therapy following breast cancer surgery generally consists of either a course of chemotherapy if the cancer lacks hormone receptors, or a course of hormonal therapy, otherwise. Here we report a correlation between adjuvant strategy, and mutated pathway patterns. In particular we find that for breast cancer patients, pathways enriched in non-synonymous mutations in the chemotherapy group, are distinct from those of the hormonal therapy group. We apply a recently developed method that identifies collaborative pathway groups for hormone and chemotherapy patients. A collaborative group of pathways is one in which each member is altered in the same--generally large--number of samples. In particular we find the following: (i) A chemotherapy group consisting of 3 pathways and a hormone therapy group consisting of 20, the members of the two groups being mutually exclusive. (ii) Each group is highly enriched in breast cancer drivers. (iii) The pathway groups are correlates of subtype-based therapeutic recommendations. These results suggest that patient profiling using these pathway groups can potentially enable the development of personalized treatment plans that may be more accurate and specific than those currently available.
Preview · Article · Dec 2015 · Molecular Cancer Therapeutics
[Show abstract][Hide abstract] ABSTRACT: The number of mutated genes in cancer cells is far larger than the number of mutations that drive cancer. The difficulty this creates for identifying relevant alterations has stimulated the development of various computational approaches to distinguishing drivers from bystanders. We develop and apply an ensemble classifier (EC) machine learning method, which integrates 10 classifiers that are publically available, and apply it to breast and ovarian cancer. In particular we find the following: (1) Using both standard and non-standard metrics, EC almost always outperforms single method classifiers, often by wide margins. (2) Of the 50 highest ranked genes for breast (ovarian) cancer, 34 (30) are associated with other cancers in either the OMIM, CGC or NCG database (P < 10-22). (3) Another 10, for both breast and ovarian cancer, have been identified by GWAS studies. (4) Several of the remaining genes-including a protein kinase that regulates the Fra-1 transcription factor which is overexpressed in ER negative breast cancer cells; and Fyn, which is overexpressed in pancreatic and prostate cancer, among others-are biologically plausible. Biological implications are briefly discussed. Source codes and detailed results are available at http://www.visantnet.org/misi/driver-integration.zip.
[Show abstract][Hide abstract] ABSTRACT: Experimental data exists for only a vanishingly small fraction of sequenced microbial genes. This community page discusses the progress made by the COMBREX project to address this important issue using both computational and experimental resources.
[Show abstract][Hide abstract] ABSTRACT: With the rapid accumulation of our knowledge on diseases, disease-related genes and drug targets, network-based analysis plays
an increasingly important role in systems biology, systems pharmacology and translational science. The new release of VisANT
aims to provide new functions to facilitate the convenient network analysis of diseases, therapies, genes and drugs. With
improved understanding of the mechanisms of complex diseases and drug actions through network analysis, novel drug methods
(e.g., drug repositioning, multi-target drug and combination therapy) can be designed. More specifically, the new update includes
(i) integrated search and navigation of disease and drug hierarchies; (ii) integrated disease–gene, therapy–drug and drug–target
association to aid the network construction and filtering; (iii) annotation of genes/drugs using disease/therapy information;
(iv) prediction of associated diseases/therapies for a given set of genes/drugs using enrichment analysis; (v) network transformation
to support construction of versatile network of drugs, genes, diseases and therapies; (vi) enhanced user interface using docking
windows to allow easy customization of node and edge properties with build-in legend node to distinguish different node type.
VisANT is freely available at: http://visant.bu.edu.
Preview · Article · May 2013 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: A host of data on genetic variation from the Human Genome and International HapMap projects, and advances in high-throughput genotyping technologies, have made genome-wide association (GWA) studies technically feasible. GWA studies help in the discovery and quantification of the genetic components of disease risks, many of which have not been unveiled before and have opened a new avenue to understanding disease, treatment, and prevention.This chapter presents an overview of GWA, an important tool for discovering regions of the genome that harbor common genetic variants to confer susceptibility for various diseases or health outcomes in the post-Human Genome Project era. A tutorial on how to conduct a GWA study and some practical challenges specifically related to the GWA design is presented, followed by a detailed GWA case study involving the identification of loci associated with glioma as an example and an illustration of current technologies.
No preview · Article · Jan 2013 · Methods in molecular biology (Clifton, N.J.)
[Show abstract][Hide abstract] ABSTRACT: We demonstrate an accurate, quantitative, and label-free optical technology for high-throughput studies of receptor-ligand interactions, and apply it to TATA binding protein (TBP) interactions with oligonucleotides. We present a simple method to prepare single-stranded and double-stranded DNA microarrays with comparable surface density, ensuring an accurate comparison of TBP activity with both types of DNA. In particular, we find that TBP binds tightly to single-stranded DNA, especially to stretches of polythymine (poly-T), as well as to the traditional TATA box. We further investigate the correlation of TBP activity with various lengths of DNA and find that the number of TBPs bound to DNA increases >7-fold as the oligomer length increases from 9 to 40. Finally, we perform a full human genome analysis and discover that 35.5% of human promoters have poly-T stretches. In summary, we report, for the first time to our knowledge, the activity of TBP with poly-T stretches by presenting an elegant stepwise analysis of multiple techniques: discovery by a novel quantitative detection of microarrays, confirmation by a traditional gel electrophoresis, and a full genome prediction with computational analyses.
Preview · Article · Oct 2012 · Biophysical Journal
[Show abstract][Hide abstract] ABSTRACT: Background
Molecular markers based on gene expression profiles have been used in experimental and clinical settings to distinguish cancerous tumors in stage, grade, survival time, metastasis, and drug sensitivity. However, most significant gene markers are unstable (not reproducible) among data sets. We introduce a standardized method for representing cancer markers as 2-level hierarchical feature vectors, with a basic gene level as well as a second level of (more stable) pathway markers, for the purpose of discriminating cancer subtypes. This extends standard gene expression arrays with new pathway-level activation features obtained directly from off-the-shelf gene set enrichment algorithms such as GSEA. Such so-called pathway-based expression arrays are significantly more reproducible across datasets. Such reproducibility will be important for clinical usefulness of genomic markers, and augment currently accepted cancer classification protocols.
The present method produced more stable (reproducible) pathway-based markers for discriminating breast cancer metastasis and ovarian cancer survival time. Between two datasets for breast cancer metastasis, the intersection of standard significant gene biomarkers totaled 7.47% of selected genes, compared to 17.65% using pathway-based markers; the corresponding percentages for ovarian cancer datasets were 20.65% and 33.33% respectively. Three pathways, consisting of Type_1_diabetes mellitus, Cytokine-cytokine_receptor_interaction and Hedgehog_signaling (all previously implicated in cancer), are enriched in both the ovarian long survival and breast non-metastasis groups. In addition, integrating pathway and gene information, we identified five (ID4, ANXA4, CXCL9, MYLK, FBXL7) and six (SQLE, E2F1, PTTG1, TSTA3, BUB1B, MAD2L1) known cancer genes significant for ovarian and breast cancer respectively.
Standardizing the analysis of genomic data in the process of cancer staging, classification and analysis is important as it has implications for both pre-clinical as well as clinical studies. The paradigm of diagnosis and prediction using pathway-based biomarkers as features can be an important part of the process of biomarker-based cancer analysis, and the resulting canonical (clinically reproducible) biomarkers can be important in standardizing genomic data. We expect that identification of such canonical biomarkers will improve clinical utility of high-throughput datasets for diagnostic and prognostic applications.
This article was reviewed by John McDonald (nominated by I. King Jordon), Eugene Koonin, Nathan Bowen (nominated by I. King Jordon), and Ekaterina Kotelnikova (nominated by Mikhail Gelfand).
[Show abstract][Hide abstract] ABSTRACT: Identification of active causal regulators is a crucial problem in understanding mechanism of diseases or finding drug targets. Methods that infer causal regulators directly from primary data have been proposed and successfully validated in some cases. These methods necessarily require very large sample sizes or a mix of different data types. Recent studies have shown that prior biological knowledge can successfully boost a method's ability to find regulators.
We present a simple data-driven method, Correlation Set Analysis (CSA), for comprehensively detecting active regulators in disease populations by integrating co-expression analysis and a specific type of literature-derived causal relationships. Instead of investigating the co-expression level between regulators and their regulatees, we focus on coherence of regulatees of a regulator. Using simulated datasets we show that our method performs very well at recovering even weak regulatory relationships with a low false discovery rate. Using three separate real biological datasets we were able to recover well known and as yet undescribed, active regulators for each disease population. The results are represented as a rank-ordered list of regulators, and reveals both single and higher-order regulatory relationships.
CSA is an intuitive data-driven way of selecting directed perturbation experiments that are relevant to a disease population of interest and represent a starting point for further investigation. Our findings demonstrate that combining co-expression analysis on regulatee sets with a literature-derived network can successfully identify causal regulators and help develop possible hypothesis to explain disease progression.
Full-text · Article · Mar 2012 · BMC Bioinformatics
[Show abstract][Hide abstract] ABSTRACT: The cost and time to develop a drug continues to be a major barrier to widespread distribution of medication. Although the genomic revolution appears to have had little impact on this problem, and might even have exacerbated it because of the flood of additional and usually ineffective leads, the emergence of high throughput resources promises the possibility of rapid, reliable and systematic identification of approved drugs for originally unintended uses. In this paper we develop and apply a method for identifying such repositioned drug candidates against breast cancer, myelogenous leukemia and prostate cancer by looking for inverse correlations between the most perturbed gene expression levels in human cancer tissue and the most perturbed expression levels induced by bioactive compounds. The method uses variable gene signatures to identify bioactive compounds that modulate a given disease. This is in contrast to previous methods that use small and fixed signatures. This strategy is based on the observation that diseases stem from failed/modified cellular functions, irrespective of the particular genes that contribute to the function, i.e., this strategy targets the functional signatures for a given cancer. This function-based strategy broadens the search space for the effective drugs with an impressive hit rate. Among the 79, 94 and 88 candidate drugs for breast cancer, myelogenous leukemia and prostate cancer, 32%, 13% and 17% respectively are either FDA-approved/in-clinical-trial drugs, or drugs with suggestive literature evidences, with an FDR of 0.01. These findings indicate that the method presented here could lead to a substantial increase in efficiency in drug discovery and development, and has potential application for the personalized medicine.
Full-text · Article · Feb 2012 · PLoS Computational Biology
[Show abstract][Hide abstract] ABSTRACT: The National Center for Integrative and Biomedical Informatics (NCIBI) is one of the eight NCBCs. NCIBI supports information access and data analysis for biomedical researchers, enabling them to build computational and knowledge models of biological systems to address the Driving Biological Problems (DBPs). The NCIBI DBPs have included prostate cancer progression, organ-specific complications of type 1 and 2 diabetes, bipolar disorder, and metabolic analysis of obesity syndrome. Collaborating with these and other partners, NCIBI has developed a series of software tools for exploratory analysis, concept visualization, and literature searches, as well as core database and web services resources. Many of our training and outreach initiatives have been in collaboration with the Research Centers at Minority Institutions (RCMI), integrating NCIBI and RCMI faculty and students, culminating each year in an annual workshop. Our future directions include focusing on the TranSMART data sharing and analysis initiative.
Full-text · Article · Nov 2011 · Journal of the American Medical Informatics Association
[Show abstract][Hide abstract] ABSTRACT: A central goal of biology is understanding and describing the molecular basis of plasticity: the sets of genes that are combinatorially
selected by exogenous and endogenous environmental changes, and the relations among the genes. The most viable current approach
to this problem consists of determining whether sets of genes are connected by some common theme, e.g. genes from the same
pathway are overrepresented among those whose differential expression in response to a perturbation is most pronounced. There
are many approaches to this problem, and the results they produce show a fair amount of dispersion, but they all fall within
a common framework consisting of a few basic components. We critically review these components, suggest best practices for
carrying out each step, and propose a voting method for meeting the challenge of assessing different methods on a large number
of experimental data sets in the absence of a gold standard.
Full-text · Article · Sep 2011 · Briefings in Bioinformatics
[Show abstract][Hide abstract] ABSTRACT: Glioblastoma multiforme (GBM) tends to occur between the ages of 45 and 70. This relatively early onset and its poor prognosis make the impact of GBM on public health far greater than would be suggested by its relatively low frequency. Tissue and blood samples have now been collected for a number of populations, and predisposing alleles have been sought by several different genome-wide association (GWA) studies. The Cancer Genome Atlas (TCGA) at NIH has also collected a considerable amount of data. Because of the low concordance between the results obtained using different populations, only 14 predisposing single nucleotide polymorphism (SNP) candidates in five genomic regions have been replicated in two or more studies. The purpose of this paper is to present an improved approach to biomarker identification.
Association analysis was performed with control of population stratifications using the EIGENSTRAT package, under the null hypothesis of "no association between GBM and control SNP genotypes," based on an additive inheritance model. Genes that are strongly correlated with identified SNPs were determined by linkage disequilibrium (LD) or expression quantitative trait locus (eQTL) analysis. A new approach that combines meta-analysis and pathway enrichment analysis identified additional genes.
(i) A meta-analysis of SNP data from TCGA and the Adult Glioma Study identifies 12 predisposing SNP candidates, seven of which are reported for the first time. These SNPs fall in five genomic regions (5p15.33, 9p21.3, 1p21.2, 3q26.2 and 7p15.3), three of which have not been previously reported. (ii) 25 genes are strongly correlated with these 12 SNPs, eight of which are known to be cancer-associated. (iii) The relative risk for GBM is highest for risk allele combinations on chromosomes 1 and 9. (iv) A combined meta-analysis/pathway analysis identified an additional four genes. All of these have been identified as cancer-related, but have not been previously associated with glioma. (v) Some SNPs that do not occur reproducibly across populations are in reproducible (invariant) pathways, suggesting that they affect the same biological process, and that population discordance can be partially resolved by evaluating processes rather than genes.
We have uncovered 29 glioma-associated gene candidates; 12 of them known to be cancer related (p = 1. 4 × 10-6), providing additional statistical support for the relevance of the new candidates. This additional information on risk loci is potentially important for identifying Caucasian individuals at risk for glioma, and for assessing relative risk.
Full-text · Article · Aug 2011 · BMC Medical Genomics
[Show abstract][Hide abstract] ABSTRACT: Computational methods for identifying functional properties of proteins are briefly discussed. The methods lead to the concept
of structure-function motif. A specific example is alpha amphipathicity as an indicator of antigenicity. This motif, though
useful for planning experiments, is not sufficiently reliable to provide the basis for vaccine design. Recent progress on
docking strategies based on structural analyses may provide methods that will be useful for both protein and nucleic acid
[Show abstract][Hide abstract] ABSTRACT: Gene expression (micro array) data have been used widely in bioinformatics. The expression data of a large number of genes from small numbers of subjects are used to identify informative biomarkers that may predict or help in diagnosing some disorders. More recently, increasing amounts of information from underlying relationships of the expressed genes have become available, and workers have started to investigate algorithms which can use such a priori information to improve classification or regression based on gene expression. In this paper, we describe three novel machine learning algorithms for regularizing (smoothing) micro array expression values defined on gene sets with known prior network or metric structures, and which exploit this gene interaction information. These regularized expression values can be used with any machine classifier with the goal of better classification. In this paper, standard smoothing (denoising) techniques previously developed for functions on Euclidean spaces are extended to allow smoothing of micro array expression feature vectors using distance measures defined by biological networks. Such a priori smoothing (denoising) of the feature vectors using metrics on the index space (here the space of genes) yields better signal to noise ratios in the data. When tested on two breast cancer datasets, support vector machine classifiers trained on the smoothed expression values obtain better areas under ROC curves in two cancer datasets.
[Show abstract][Hide abstract] ABSTRACT: COMBREX (http://combrex.bu.edu) is a project to increase the speed of the functional annotation of new bacterial and archaeal genomes. It consists of a database of functional predictions produced by computational biologists and a mechanism for experimental biochemists to bid for the validation of those predictions. Small grants are available to support successful bids.
Full-text · Article · Jan 2011 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: We develop a general method to identify gene networks from pair-wise correlations between genes in a microarray data set and apply it to a public prostate cancer gene expression data from 69 primary prostate tumors. We define the degree of a node as the number of genes significantly associated with the node and identify hub genes as those with the highest degree. The correlation network was pruned using transcription factor binding information in VisANT (http://visant.bu.edu/) as a biological filter. The reliability of hub genes was determined using a strict permutation test. Separate networks for normal prostate samples, and prostate cancer samples from African Americans (AA) and European Americans (EA) were generated and compared. We found that the same hubs control disease progression in AA and EA networks. Combining AA and EA samples, we generated networks for low low (<7) and high (≥7) Gleason grade tumors. A comparison of their major hubs with those of the network for normal samples identified two types of changes associated with disease: (i) Some hub genes increased their degree in the tumor network compared to their degree in the normal network, suggesting that these genes are associated with gain of regulatory control in cancer (e.g. possible turning on of oncogenes). (ii) Some hubs reduced their degree in the tumor network compared to their degree in the normal network, suggesting that these genes are associated with loss of regulatory control in cancer (e.g. possible loss of tumor suppressor genes). A striking result was that for both AA and EA tumor samples, STAT5a, CEBPB and EGR1 are major hubs that gain neighbors compared to the normal prostate network. Conversely, HIF-lα is a major hub that loses connections in the prostate cancer network compared to the normal prostate network. We also find that the degree of these hubs changes progressively from normal to low grade to high grade disease, suggesting that these hubs are master regulators of prostate cancer and marks disease progression. STAT5a was identified as a central hub, with ~120 neighbors in the prostate cancer network and only 81 neighbors in the normal prostate network. Of the 120 neighbors of STAT5a, 57 are known cancer related genes, known to be involved in functional pathways associated with tumorigenesis. Our method is general and can easily be extended to identify and study networks associated with any two phenotypes.
No preview · Article · Jul 2010 · Genome informatics. International Conference on Genome Informatics
[Show abstract][Hide abstract] ABSTRACT: Surprising correlations between human disease phenotypes are emerging. Recent work now reveals startling phenotype connections between species, which could provide new disease models.
[Show abstract][Hide abstract] ABSTRACT: To identify a robust panel of microRNA signatures that can classify tumor from normal kidney using microRNA expression levels. Mounting evidence suggests that microRNAs are key players in essential cellular processes and that their expression pattern can serve as diagnostic biomarkers for cancerous tissues.
We selected 28 clear-cell type human renal cell carcinoma (ccRCC), samples from patient-matched specimens to perform high-throughput, quantitative real-time polymerase chain reaction analysis of microRNA expression levels. The data were subjected to rigorous statistical analyses and hierarchical clustering to produce a discrete set of microRNAs that can robustly distinguish ccRCC from their patient-matched normal kidney tissue samples with high confidence.
Thirty-five microRNAs were found that can robustly distinguish ccRCC from their patient-matched normal kidney tissue samples with high confidence. Among this set of 35 signature microRNAs, 26 were found to be consistently downregulated and 9 consistently upregulated in ccRCC relative to normal kidney samples. Two microRNAs, namely, MiR-155 and miR-21, commonly found to be upregulated in other cancers, and miR-210, induced by hypoxia, were also identified as overexpressed in ccRCC in our study. MicroRNAs identified as downregulated in our study can be correlated to common chromosome deletions in ccRCC.
Our analysis is a comprehensive, statistically relevant study that identifies the microRNAs dysregulated in ccRCC, which can serve as the basis of molecular markers for diagnosis.
[Show abstract][Hide abstract] ABSTRACT: A novel method is proposed for direct detection of DNA hybridization on microarrays. Optical interferometry is used for label-free sensing of biomolecular accumulation on glass surfaces, enabling dynamic detection of interactions. Capabilities of the presented method are demonstrated by high-throughput sensing of solid-phase hybridization of oligonucleotides. Hybridization of surface immobilized probes with 20 base pair-long target oligonucleotides was detected by comparing the label-free microarray images taken before and after hybridization. Through dynamic data acquisition during denaturation by washing the sample with low ionic concentration buffer, melting of duplexes with a single-nucleotide mismatch was distinguished from perfectly matching duplexes with high confidence interval (>97%). The presented technique is simple, robust, and accurate, and eliminates the need of using labels or secondary reagents to monitor the oligonucleotide hybridization.
Full-text · Article · Mar 2010 · Biosensors & Bioelectronics