[Show abstract][Hide abstract] ABSTRACT: The COMBREX database (COMBREX-DB; combrex.bu.edu) is an online repository of information related to (i) experimentally determined protein function, (ii) predicted protein
function, (iii) relationships among proteins of unknown function and various types of experimental data, including molecular
function, protein structure, and associated phenotypes. The database was created as part of the novel COMBREX (COMputational
BRidges to EXperiments) effort aimed at accelerating the rate of gene function validation. It currently holds information
on ∼3.3 million known and predicted proteins from over 1000 completely sequenced bacterial and archaeal genomes. The database
also contains a prototype recommendation system for helping users identify those proteins whose experimental determination
of function would be most informative for predicting function for other proteins within protein families. The emphasis on
documenting experimental evidence for function predictions, and the prioritization of uncharacterized proteins for experimental
testing distinguish COMBREX from other publicly available microbial genomics resources. This article describes updates to
COMBREX-DB since an initial description in the 2011 NAR Database Issue.
Preview · Article · Dec 2015 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: Background
Copy number variations (CNVs) are increasingly recognized as significant disease susceptibility markers in many complex disorders including cancer. The availability of a large number of chromosomal copy number profiles in both malignant and normal tissues in cancer patients presents an opportunity to characterize not only somatic alterations but also germline CNVs, which may confer increased risk for cancer.ResultsWe explored the germline CNVs in five cancer cohorts from the Cancer Genome Atlas (TCGA) consisting of 351 brain, 336 breast, 342 colorectal, 370 renal, and 314 ovarian cancers, genotyped on Affymetrix SNP6.0 arrays. Comparing these to ~3000 normal controls from another study, our case¿control association study revealed 39 genomic loci (9 brain, 3 breast, 4 colorectal, 11 renal, and 12 ovarian cancers) as potential candidates of tumor susceptibility loci. Many of these loci are new and in some cases are associated with a substantial increase in disease risk. The majority of the observed loci do not overlap with coding sequences; however, several observed genomic loci overlap with known cancer genes including RET in brain cancers, ERBB2 in renal cell carcinomas, and DCC in ovarian cancers, all of which have not been previously associated with germline changes in cancer.Conclusions
This large-scale genome-wide association study for CNVs across multiple cancer types identified several novel rare germline CNVs as cancer predisposing genomic loci. These loci can potentially serve as clinically useful markers conferring increased cancer risk.
[Show abstract][Hide abstract] ABSTRACT: Dysregulated muscle metabolism is a cardinal feature of human insulin resistance (IR) and associated diseases, including type 2 diabetes (T2D). However, specific reactions contributing to abnormal energetics and metabolic inflexibility in IR are unknown.
We utilize flux balance computational modeling to develop the first systems-level analysis of IR metabolism in fasted and fed states, and varying nutrient conditions. We systematically perturb the metabolic network to identify reactions that reproduce key features of IR-linked metabolism.
While reduced glucose uptake is a major hallmark of IR, model-based reductions in either extracellular glucose availability or uptake do not alter metabolic flexibility, and thus are not sufficient to fully recapitulate IR-linked metabolism. Moreover, experimentally-reduced flux through single reactions does not reproduce key features of IR-linked metabolism. However, dual knockdowns of pyruvate dehydrogenase (PDH), in combination with reduced lipid uptake or lipid/amino acid oxidation (ETFDH), does reduce ATP synthesis, TCA cycle flux, and metabolic flexibility. Experimental validation demonstrates robust impact of dual knockdowns in PDH/ETFDH on cellular energetics and TCA cycle flux in cultured myocytes. Parallel analysis of transcriptomic and metabolomics data in humans with IR and T2D demonstrates downregulation of PDH subunits and upregulation of its inhibitory kinase PDK4, both of which would be predicted to decrease PDH flux, concordant with the model.
Our results indicate that complex interactions between multiple biochemical reactions contribute to metabolic perturbations observed in human IR, and that the PDH complex plays a key role in these metabolic phenotypes.
Full-text · Article · Jan 2015 · Molecular Metabolism
[Show abstract][Hide abstract] ABSTRACT: Intratumor genetic heterogeneity reflects the evolutionary history of a cancer and is thought to influence treatment outcomes. Here we report that a simple PCR-based assay interrogating somatic variation in hypermutable polyguanine (poly-G) repeats can provide a rapid and reliable assessment of mitotic history and clonal architecture in human cancer. We use poly-G repeat genotyping to study the evolution of colon carcinoma. In a cohort of 22 patients, we detect poly-G variants in 91% of tumors. Patient age is positively correlated with somatic mutation frequency, suggesting that some poly-G variants accumulate before the onset of carcinogenesis during normal division in colonic stem cells. Poorly differentiated tumors have fewer mutations than well-differentiated tumors, possibly indicating a shorter mitotic history of the founder cell in these cancers. We generate poly-G mutation profiles of spatially separated samples from primary carcinomas and matched metastases to build well-supported phylogenetic trees that illuminate individual patients' path of metastatic progression. Our results show varying degrees of intratumor heterogeneity among patients. Finally, we show that poly-G mutations can be found in other cancers than colon carcinoma. Our approach can generate reliable maps of intratumor heterogeneity in large numbers of patients with minimal time and cost expenditure.
Preview · Article · Apr 2014 · Proceedings of the National Academy of Sciences
[Show abstract][Hide abstract] ABSTRACT: Developmental fate decisions are dictated by master transcription factors (TFs) that interact with cis-regulatory elements to direct transcriptional programs. Certain malignant tumors may also depend on cellular hierarchies reminiscent of normal development but superimposed on underlying genetic aberrations. In glioblastoma (GBM), a subset of stem-like tumor-propagating cells (TPCs) appears to drive tumor progression and underlie therapeutic resistance yet remain poorly understood. Here, we identify a core set of neurodevelopmental TFs (POU3F2, SOX2, SALL2, and OLIG2) essential for GBM propagation. These TFs coordinately bind and activate TPC-specific regulatory elements and are sufficient to fully reprogram differentiated GBM cells to "induced" TPCs, recapitulating the epigenetic landscape and phenotype of native TPCs. We reconstruct a network model that highlights critical interactions and identifies candidate therapeutic targets for eliminating TPCs. Our study establishes the epigenetic basis of a developmental hierarchy in GBM, provides detailed insight into underlying gene regulatory programs, and suggests attendant therapeutic strategies.
[Show abstract][Hide abstract] ABSTRACT: Flux balance analysis and constraint based modeling have been successfully used in the past to elucidate the metabolism of single cellular organisms. However, limited work has been done with multicellular organisms and even less with humans. The focus of this paper is to present a novel use of this technique by investigating human nutrition, a challenging field of study. Specifically, we present a steady state constraint based model of skeletal muscle tissue to investigate amino acid supplementation's effect on protein synthesis. We implement several in silico supplementation strategies to study whether amino acid supplementation might be beneficial for increasing muscle contractile protein synthesis. Concurrent with published data on amino acid supplementation's effect on protein synthesis in a post resistance exercise state, our results suggest that increasing bioavailability of methionine, arginine, and the branched-chain amino acids can increase the flux of contractile protein synthesis. The study also suggests that a common commercial supplement, glutamine, is not an effective supplement in the context of increasing protein synthesis and thus, muscle mass. Similar to any study in a model organism, the computational modeling of this research has some limitations. Thus, this paper introduces the prospect of using systems biology as a framework to formally investigate how supplementation and nutrition can affect human metabolism and physiology.
[Show abstract][Hide abstract] ABSTRACT: Experimental data exists for only a vanishingly small fraction of sequenced microbial genes. This community page discusses the progress made by the COMBREX project to address this important issue using both computational and experimental resources.
[Show abstract][Hide abstract] ABSTRACT: Tissue engineering and molecular systems biology are inherently interdisciplinary fields that have been developed independently so far. In this review, we first provide a brief introduction to tissue engineering and to molecular systems biology. Next, we highlight some prominent applications of systems biology techniques in tissue engineering. Finally, we outline research directions that can successfully blend these two fields. Through these examples, we propose that experimental and computational advances in molecular systems biology can lead to predictive models of bioengineered tissues that enhance our understanding of bioengineered systems. In turn, the unique challenges posed by tissue engineering will usher in new experimental techniques and computational advances in systems biology.
No preview · Article · Jul 2013 · Annual review of biomedical engineering
[Show abstract][Hide abstract] ABSTRACT: The functional characterization of Open Reading Frames (ORFs) from sequenced genomes remains a bottleneck in our effort to understand microbial biology. In particular, the functional characterization of proteins with only remote sequence homology to known proteins can be challenging, as there may be few clues to guide initial experiments. Affinity enrichment of proteins from cell lysates, and a global perspective of protein function as provided by COMBREX, affords an approach to this problem. We present here the biochemical analysis of six proteins from Helicobacter pylori ATCC 26695, a focus organism in COMBREX. Initial hypotheses were based upon affinity capture of proteins from total cellular lysate using derivatized nano-particles, and subsequent identification by mass spectrometry. Candidate genes encoding these proteins were cloned and expressed in Escherichia coli, and the recombinant proteins were purified and characterized biochemically and their biochemical parameters compared with the native ones. These proteins include a guanosine triphosphate (GTP) cyclohydrolase (HP0959), an ATPase (HP1079), an adenosine deaminase (HP0267), a phosphodiesterase (HP1042), an aminopeptidase (HP1037), and new substrates were characterized for a peptidoglycan deacetylase (HP0310). Generally, characterized enzymes were active at acidic to neutral pH (4.0-7.5) with temperature optima ranging from 35 to 55°C, although some exhibited outstanding characteristics.
[Show abstract][Hide abstract] ABSTRACT: Glioblastoma (GBM) is thought to be driven by a subpopulation of cancer stem cells (CSCs) that self-renew and recapitulate tumor heterogeneity yet remain poorly understood. Here, we present a comparative analysis of chromatin state in GBM CSCs that reveals widespread activation of genes normally held in check by Polycomb repressors. These activated targets include a large set of developmental transcription factors (TFs) whose coordinated activation is unique to the CSCs. We demonstrate that a critical factor in the set, ASCL1, activates Wnt signaling by repressing the negative regulator DKK1. We show that ASCL1 is essential for the maintenance and in vivo tumorigenicity of GBM CSCs. Genome-wide binding profiles for ASCL1 and the Wnt effector LEF-1 provide mechanistic insight and suggest widespread interactions between the TF module and the signaling pathway. Our findings demonstrate regulatory connections among ASCL1, Wnt signaling, and collaborating TFs that are essential for the maintenance and tumorigenicity of GBM CSCs.
[Show abstract][Hide abstract] ABSTRACT: Background
The dramatic reduction in the cost of sequencing has allowed many researchers to join in the effort of sequencing and annotating prokaryotic genomes. Annotation methods vary considerably and may fail to identify some genes. Here we draw attention to a large number of likely genes missing from annotations using common tools such as Glimmer and BLAST.
By analyzing 1,474 prokaryotic genome annotations in GenBank, we identify 13,602 likely missed genes that are homologs to non-hypothetical proteins, and 11,792 likely missed genes that are homologs only to hypothetical proteins, yet have supporting evidence of their protein-coding nature from COMBREX, a newly created gene function database. We also estimate the likelihood that each potential missing gene found is a genuine protein-coding gene using COMBREX.
Our analysis of the causes of missed genes suggests that larger annotation centers tend to produce annotations with fewer missed genes than smaller centers, and many of the missed genes are short genes <300 bp. Over 1,000 of the likely missed genes could be associated with phenotype information available in COMBREX. 359 of these genes, found in pathogenic organisms, may be potential targets for pharmaceutical research. The newly identified genes are available on COMBREX’s website.
This article was reviewed by Daniel Haft, Arcady Mushegian, and M. Pilar Francino (nominated by David Ardell).
[Show abstract][Hide abstract] ABSTRACT: The oral microbiome, the complex ecosystem of microbes inhabiting the human mouth, harbors several thousands of bacterial types. The proliferation of pathogenic bacteria within the mouth gives rise to periodontitis, an inflammatory disease known to also constitute a risk factor for cardiovascular disease. While much is known about individual species associated with pathogenesis, the system-level mechanisms underlying the transition from health to disease are still poorly understood. Through the sequencing of the 16S rRNA gene and of whole community DNA we provide a glimpse at the global genetic, metabolic, and ecological changes associated with periodontitis in 15 subgingival plaque samples, four from each of two periodontitis patients, and the remaining samples from three healthy individuals. We also demonstrate the power of whole-metagenome sequencing approaches in characterizing the genomes of key players in the oral microbiome, including an unculturable TM7 organism. We reveal the disease microbiome to be enriched in virulence factors, and adapted to a parasitic lifestyle that takes advantage of the disrupted host homeostasis. Furthermore, diseased samples share a common structure that was not found in completely healthy samples, suggesting that the disease state may occupy a narrow region within the space of possible configurations of the oral microbiome. Our pilot study demonstrates the power of high-throughput sequencing as a tool for understanding the role of the oral microbiome in periodontal disease. Despite a modest level of sequencing (~2 lanes Illumina 76 bp PE) and high human DNA contamination (up to ~90%) we were able to partially reconstruct several oral microbes and to preliminarily characterize some systems-level differences between the healthy and diseased oral microbiomes.
[Show abstract][Hide abstract] ABSTRACT: COMBREX (http://combrex.bu.edu) is a project to increase the speed of the functional annotation of new bacterial and archaeal genomes. It consists of a database of functional predictions produced by computational biologists and a mechanism for experimental biochemists to bid for the validation of those predictions. Small grants are available to support successful bids.
Full-text · Article · Jan 2011 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: Type 2 diabetes and obesity are increasingly affecting human populations around the world. Our goal was to identify early molecular signatures predicting genetic risk to these metabolic diseases using two strains of mice that differ greatly in disease susceptibility.
We integrated metabolic characterization, gene expression, protein-protein interaction networks, RT-PCR, and flow cytometry analyses of adipose, skeletal muscle, and liver tissue of diabetes-prone C57BL/6NTac (B6) mice and diabetes-resistant 129S6/SvEvTac (129) mice at 6 weeks and 6 months of age.
At 6 weeks of age, B6 mice were metabolically indistinguishable from 129 mice, however, adipose tissue showed a consistent gene expression signature that differentiated between the strains. In particular, immune system gene networks and inflammatory biomarkers were upregulated in adipose tissue of B6 mice, despite a low normal fat mass. This was accompanied by increased T-cell and macrophage infiltration. The expression of the same networks and biomarkers, particularly those related to T-cells, further increased in adipose tissue of B6 mice, but only minimally in 129 mice, in response to weight gain promoted by age or high-fat diet, further exacerbating the differences between strains.
Insulin resistance in mice with differential susceptibility to diabetes and metabolic syndrome is preceded by differences in the inflammatory response of adipose tissue. This phenomenon may serve as an early indicator of disease and contribute to disease susceptibility and progression.
[Show abstract][Hide abstract] ABSTRACT: Methylthiotransferases (MTTases) are a closely related family of proteins that perform both radical-S-adenosylmethionine (SAM) mediated sulfur insertion and SAM-dependent methylation to modify nucleic acid or protein targets
with a methyl thioether group (–SCH3). Members of two of the four known subgroups of MTTases have been characterized, typified by MiaB, which modifies N6-isopentenyladenosine (i6A) to 2-methylthio-N6-isopentenyladenosine (ms2i6A) in tRNA, and RimO, which modifies a specific aspartate residue in ribosomal protein S12. In this work, we have characterized
the two MTTases encoded by Bacillus subtilis 168 and find that, consistent with bioinformatic predictions, ymcB is required for ms2i6A formation (MiaB activity), and yqeV is required for modification of N6-threonylcarbamoyladenosine (t6A) to 2-methylthio-N6-threonylcarbamoyladenosine (ms2t6A) in tRNA. The enzyme responsible for the latter activity belongs to a third MTTase subgroup, no member of which has previously
been characterized. We performed domain-swapping experiments between YmcB and YqeV to narrow down the protein domain(s) responsible
for distinguishing i6A from t6A and found that the C-terminal TRAM domain, putatively involved with RNA binding, is likely not involved with this discrimination.
Finally, we performed a computational analysis to identify candidate residues outside the TRAM domain that may be involved
with substrate recognition. These residues represent interesting targets for further analysis.
Full-text · Article · Oct 2010 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: Even though a vaccine for malaria infections has been under intense study for many years, it has resisted several different lines of attack attempted by biologists. More than half of Plasmodium proteins still remain uncharacterized and therefore cannot be used in clinical trials. The task is further complicated by the metamorphic life-cycle of the parasite, which allows for rapid evolutionary changes and diversity among related strains, thus making precise targeting of the appropriate proteins for vaccination a technical challenge. We propose an automated method for predicting functions for the malaria parasite, which capitalizes on the importance of the intraerythrocytic developmental cycle data and expression changes during its five phases, as determined computationally by our segmentation algorithm.
Our method combines temporal gene expression profiles with protein-protein interaction data, sequence similarity scores, and metabolic pathway information to produce a set of predicted protein functions that can be used as targets for vaccine development. We use a Bayesian approach, which assigns a probability of having (or not having) a particular function to each protein, given the various sources of evidence. In our method, each data source is represented by either a functional linkage graph or a categorical feature vector.
The methods are tested on Plasmodium falciparum, the species responsible for the deadliest malaria infections. The algorithm was able to assign meaningful functions to 628 out of 1439 previously unannotated proteins, which are first-choice candidates for experimental vaccine research. We conclude that analyzing time-course gene expression profiles in separate phases leads to much higher prediction accuracy when compared with Pearson correlation coefficients computed across the time course as a whole. Additionally, we demonstrate that temporal expression profiles alone are able to improve the predictive power of the integrated data.
Preview · Article · Jul 2010 · Artificial intelligence in medicine
[Show abstract][Hide abstract] ABSTRACT: Aberrant activation of signaling pathways drives many of the fundamental biological processes that accompany tumor initiation and progression. Inappropriate phosphorylation of intermediates in these signaling pathways are a frequently observed molecular lesion that accompanies the undesirable activation or repression of pro- and anti-oncogenic pathways. Therefore, methods which directly query signaling pathway activation via phosphorylation assays in individual cancer biopsies are expected to provide important insights into the molecular "logic" that distinguishes cancer and normal tissue on one hand, and enables personalized intervention strategies on the other.
We first document the largest available set of tyrosine phosphorylation sites that are, individually, differentially phosphorylated in lung cancer, thus providing an immediate set of drug targets. Next, we develop a novel computational methodology to identify pathways whose phosphorylation activity is strongly correlated with the lung cancer phenotype. Finally, we demonstrate the feasibility of classifying lung cancers based on multi-variate phosphorylation signatures.
Highly predictive and biologically transparent phosphorylation signatures of lung cancer provide evidence for the existence of a robust set of phosphorylation mechanisms (captured by the signatures) present in the majority of lung cancers, and that reliably distinguish each lung cancer from normal. This approach should improve our understanding of cancer and help guide its treatment, since the phosphorylation signatures highlight proteins and pathways whose phosphorylation should be inhibited in order to prevent unregulated proliferation.
[Show abstract][Hide abstract] ABSTRACT: Length variation in short tandem repeats (STRs) is an important family of DNA polymorphisms with numerous applications in genetics, medicine, forensics, and evolutionary analysis. Several major diseases have been associated with length variation of trinucleotide (triplet) repeats including Huntington's disease, hereditary ataxias and spinobulbar muscular atrophy. Using the reference human genome, we have catalogued all triplet repeats in genic regions. This data revealed a bias in noncoding DNA repeat lengths. It also enabled a survey of repeat-length polymorphisms (RLPs) in human genomes and a comparison of the rate of polymorphism in humans versus divergence from chimpanzee. For short repeats, this analysis of three human genomes reveals a relatively low RLP rate in exons and, somewhat surprisingly, in introns. All short RLPs observed in multiple genomes are biallelic (at least in this small sample). In contrast, long repeats are highly polymorphic and some long RLPs are multiallelic. For long repeats, the chimpanzee sequence frequently differs from all observed human alleles. This suggests a high expansion/contraction rate in all long repeats. Expansions and contractions are not, however, affected by natural selection discernable from our comparison of human-chimpanzee divergence with human RLPs. Our catalog of human triplet repeats and their surrounding flanking regions can be used to produce a cost-effective whole-genome assay to test individuals. This repeat assay could someday complement SNP arrays for producing tests that assess the risk of an individual to develop a disease, or become part of personalized genomic strategy that provides therapeutic guidance with respect to drug response.
Full-text · Article · Oct 2009 · Proceedings of the National Academy of Sciences