[show abstract][hide abstract] ABSTRACT: Experimental data exists for only a vanishingly small fraction of sequenced microbial genes. This community page discusses the progress made by the COMBREX project to address this important issue using both computational and experimental resources.
[show abstract][hide abstract] ABSTRACT: Tissue engineering and molecular systems biology are inherently interdisciplinary fields that have been developed independently so far. In this review, we first provide a brief introduction to tissue engineering and to molecular systems biology. Next, we highlight some prominent applications of systems biology techniques in tissue engineering. Finally, we outline research directions that can successfully blend these two fields. Through these examples, we propose that experimental and computational advances in molecular systems biology can lead to predictive models of bioengineered tissues that enhance our understanding of bioengineered systems. In turn, the unique challenges posed by tissue engineering will usher in new experimental techniques and computational advances in systems biology.
[show abstract][hide abstract] ABSTRACT: Glioblastoma (GBM) is thought to be driven by a subpopulation of cancer stem cells (CSCs) that self-renew and recapitulate tumor heterogeneity yet remain poorly understood. Here, we present a comparative analysis of chromatin state in GBM CSCs that reveals widespread activation of genes normally held in check by Polycomb repressors. These activated targets include a large set of developmental transcription factors (TFs) whose coordinated activation is unique to the CSCs. We demonstrate that a critical factor in the set, ASCL1, activates Wnt signaling by repressing the negative regulator DKK1. We show that ASCL1 is essential for the maintenance and in vivo tumorigenicity of GBM CSCs. Genome-wide binding profiles for ASCL1 and the Wnt effector LEF-1 provide mechanistic insight and suggest widespread interactions between the TF module and the signaling pathway. Our findings demonstrate regulatory connections among ASCL1, Wnt signaling, and collaborating TFs that are essential for the maintenance and tumorigenicity of GBM CSCs.
[show abstract][hide abstract] ABSTRACT: Flux balance analysis and constraint based modeling have been successfully used in the past to elucidate the metabolism of single cellular organisms. However, limited work has been done with multicellular organisms and even less with humans. The focus of this paper is to present a novel use of this technique by investigating human nutrition, a challenging field of study. Specifically, we present a steady state constraint based model of skeletal muscle tissue to investigate amino acid supplementation's effect on protein synthesis. We implement several in silico supplementation strategies to study whether amino acid supplementation might be beneficial for increasing muscle contractile protein synthesis. Concurrent with published data on amino acid supplementation's effect on protein synthesis in a post resistance exercise state, our results suggest that increasing bioavailability of methionine, arginine, and the branched-chain amino acids can increase the flux of contractile protein synthesis. The study also suggests that a common commercial supplement, glutamine, is not an effective supplement in the context of increasing protein synthesis and thus, muscle mass. Similar to any study in a model organism, the computational modeling of this research has some limitations. Thus, this paper introduces the prospect of using systems biology as a framework to formally investigate how supplementation and nutrition can affect human metabolism and physiology.
PLoS ONE 01/2013; 8(8):e68751. · 3.73 Impact Factor
[show abstract][hide abstract] ABSTRACT: The functional characterization of Open Reading Frames (ORFs) from sequenced genomes remains a bottleneck in our effort to understand microbial biology. In particular, the functional characterization of proteins with only remote sequence homology to known proteins can be challenging, as there may be few clues to guide initial experiments. Affinity enrichment of proteins from cell lysates, and a global perspective of protein function as provided by COMBREX, affords an approach to this problem. We present here the biochemical analysis of six proteins from Helicobacter pylori ATCC 26695, a focus organism in COMBREX. Initial hypotheses were based upon affinity capture of proteins from total cellular lysate using derivatized nano-particles, and subsequent identification by mass spectrometry. Candidate genes encoding these proteins were cloned and expressed in Escherichia coli, and the recombinant proteins were purified and characterized biochemically and their biochemical parameters compared with the native ones. These proteins include a guanosine triphosphate (GTP) cyclohydrolase (HP0959), an ATPase (HP1079), an adenosine deaminase (HP0267), a phosphodiesterase (HP1042), an aminopeptidase (HP1037), and new substrates were characterized for a peptidoglycan deacetylase (HP0310). Generally, characterized enzymes were active at acidic to neutral pH (4.0-7.5) with temperature optima ranging from 35 to 55°C, although some exhibited outstanding characteristics.
PLoS ONE 01/2013; 8(6):e66605. · 3.73 Impact Factor
[show abstract][hide abstract] ABSTRACT: BACKGROUND: The dramatic reduction in the cost of sequencing has allowed many researchers to join in the effort of sequencing and annotating prokaryotic genomes. Annotation methods vary considerably and may fail to identify some genes. Here we draw attention to a large number of likely genes missing from annotations using common tools such as Glimmer and BLAST. RESULTS: By analyzing 1,474 prokaryotic genome annotations in GenBank, we identify 13,602 likely missed genes that are homologs to non-hypothetical proteins, and 11,792 likely missed genes that are homologs only to hypothetical proteins, yet have supporting evidence of their protein-coding nature from COMBREX, a newly created gene function database. We also estimate the likelihood that each potential missing gene found is a genuine protein-coding gene using COMBREX. CONCLUSIONS: Our analysis of the causes of missed genes suggests that larger annotation centers tend to produce annotations with fewer missed genes than smaller centers, and many of the missed genes are short genes <300 bp. Over 1,000 of the likely missed genes could be associated with phenotype information available in COMBREX. 359 of these genes, found in pathogenic organisms, may be potential targets for pharmaceutical research. The newly identified genes are available on COMBREX's website.ReviewersThis article was reviewed by Daniel Haft, Arcady Mushegian, and M. Pilar Francino (nominated by David Ardell).
Biology Direct 10/2012; 7(1):37. · 2.72 Impact Factor
[show abstract][hide abstract] ABSTRACT: The oral microbiome, the complex ecosystem of microbes inhabiting the human mouth, harbors several thousands of bacterial types. The proliferation of pathogenic bacteria within the mouth gives rise to periodontitis, an inflammatory disease known to also constitute a risk factor for cardiovascular disease. While much is known about individual species associated with pathogenesis, the system-level mechanisms underlying the transition from health to disease are still poorly understood. Through the sequencing of the 16S rRNA gene and of whole community DNA we provide a glimpse at the global genetic, metabolic, and ecological changes associated with periodontitis in 15 subgingival plaque samples, four from each of two periodontitis patients, and the remaining samples from three healthy individuals. We also demonstrate the power of whole-metagenome sequencing approaches in characterizing the genomes of key players in the oral microbiome, including an unculturable TM7 organism. We reveal the disease microbiome to be enriched in virulence factors, and adapted to a parasitic lifestyle that takes advantage of the disrupted host homeostasis. Furthermore, diseased samples share a common structure that was not found in completely healthy samples, suggesting that the disease state may occupy a narrow region within the space of possible configurations of the oral microbiome. Our pilot study demonstrates the power of high-throughput sequencing as a tool for understanding the role of the oral microbiome in periodontal disease. Despite a modest level of sequencing (~2 lanes Illumina 76 bp PE) and high human DNA contamination (up to ~90%) we were able to partially reconstruct several oral microbes and to preliminarily characterize some systems-level differences between the healthy and diseased oral microbiomes.
PLoS ONE 01/2012; 7(6):e37919. · 3.73 Impact Factor
[show abstract][hide abstract] ABSTRACT: COMBREX (http://combrex.bu.edu) is a project to increase the speed of the functional annotation of new bacterial and archaeal genomes. It consists of a database of functional predictions produced by computational biologists and a mechanism for experimental biochemists to bid for the validation of those predictions. Small grants are available to support successful bids.
Nucleic Acids Research 01/2011; 39(Database issue):D11-4. · 8.28 Impact Factor
[show abstract][hide abstract] ABSTRACT: Type 2 diabetes and obesity are increasingly affecting human populations around the world. Our goal was to identify early molecular signatures predicting genetic risk to these metabolic diseases using two strains of mice that differ greatly in disease susceptibility.
We integrated metabolic characterization, gene expression, protein-protein interaction networks, RT-PCR, and flow cytometry analyses of adipose, skeletal muscle, and liver tissue of diabetes-prone C57BL/6NTac (B6) mice and diabetes-resistant 129S6/SvEvTac (129) mice at 6 weeks and 6 months of age.
At 6 weeks of age, B6 mice were metabolically indistinguishable from 129 mice, however, adipose tissue showed a consistent gene expression signature that differentiated between the strains. In particular, immune system gene networks and inflammatory biomarkers were upregulated in adipose tissue of B6 mice, despite a low normal fat mass. This was accompanied by increased T-cell and macrophage infiltration. The expression of the same networks and biomarkers, particularly those related to T-cells, further increased in adipose tissue of B6 mice, but only minimally in 129 mice, in response to weight gain promoted by age or high-fat diet, further exacerbating the differences between strains.
Insulin resistance in mice with differential susceptibility to diabetes and metabolic syndrome is preceded by differences in the inflammatory response of adipose tissue. This phenomenon may serve as an early indicator of disease and contribute to disease susceptibility and progression.
[show abstract][hide abstract] ABSTRACT: Methylthiotransferases (MTTases) are a closely related family of proteins that perform both radical-S-adenosylmethionine (SAM) mediated sulfur insertion and SAM-dependent methylation to modify nucleic acid or protein targets with a methyl thioether group (-SCH(3)). Members of two of the four known subgroups of MTTases have been characterized, typified by MiaB, which modifies N(6)-isopentenyladenosine (i(6)A) to 2-methylthio-N(6)-isopentenyladenosine (ms(2)i(6)A) in tRNA, and RimO, which modifies a specific aspartate residue in ribosomal protein S12. In this work, we have characterized the two MTTases encoded by Bacillus subtilis 168 and find that, consistent with bioinformatic predictions, ymcB is required for ms(2)i(6)A formation (MiaB activity), and yqeV is required for modification of N(6)-threonylcarbamoyladenosine (t(6)A) to 2-methylthio-N(6)-threonylcarbamoyladenosine (ms(2)t(6)A) in tRNA. The enzyme responsible for the latter activity belongs to a third MTTase subgroup, no member of which has previously been characterized. We performed domain-swapping experiments between YmcB and YqeV to narrow down the protein domain(s) responsible for distinguishing i(6)A from t(6)A and found that the C-terminal TRAM domain, putatively involved with RNA binding, is likely not involved with this discrimination. Finally, we performed a computational analysis to identify candidate residues outside the TRAM domain that may be involved with substrate recognition. These residues represent interesting targets for further analysis.
Nucleic Acids Research 10/2010; 38(18):6195-205. · 8.28 Impact Factor
[show abstract][hide abstract] ABSTRACT: Even though a vaccine for malaria infections has been under intense study for many years, it has resisted several different lines of attack attempted by biologists. More than half of Plasmodium proteins still remain uncharacterized and therefore cannot be used in clinical trials. The task is further complicated by the metamorphic life-cycle of the parasite, which allows for rapid evolutionary changes and diversity among related strains, thus making precise targeting of the appropriate proteins for vaccination a technical challenge. We propose an automated method for predicting functions for the malaria parasite, which capitalizes on the importance of the intraerythrocytic developmental cycle data and expression changes during its five phases, as determined computationally by our segmentation algorithm.
Our method combines temporal gene expression profiles with protein-protein interaction data, sequence similarity scores, and metabolic pathway information to produce a set of predicted protein functions that can be used as targets for vaccine development. We use a Bayesian approach, which assigns a probability of having (or not having) a particular function to each protein, given the various sources of evidence. In our method, each data source is represented by either a functional linkage graph or a categorical feature vector.
The methods are tested on Plasmodium falciparum, the species responsible for the deadliest malaria infections. The algorithm was able to assign meaningful functions to 628 out of 1439 previously unannotated proteins, which are first-choice candidates for experimental vaccine research. We conclude that analyzing time-course gene expression profiles in separate phases leads to much higher prediction accuracy when compared with Pearson correlation coefficients computed across the time course as a whole. Additionally, we demonstrate that temporal expression profiles alone are able to improve the predictive power of the integrated data.
Artificial intelligence in medicine 07/2010; 49(3):167-76. · 1.65 Impact Factor
[show abstract][hide abstract] ABSTRACT: Length variation in short tandem repeats (STRs) is an important family of DNA polymorphisms with numerous applications in genetics, medicine, forensics, and evolutionary analysis. Several major diseases have been associated with length variation of trinucleotide (triplet) repeats including Huntington's disease, hereditary ataxias and spinobulbar muscular atrophy. Using the reference human genome, we have catalogued all triplet repeats in genic regions. This data revealed a bias in noncoding DNA repeat lengths. It also enabled a survey of repeat-length polymorphisms (RLPs) in human genomes and a comparison of the rate of polymorphism in humans versus divergence from chimpanzee. For short repeats, this analysis of three human genomes reveals a relatively low RLP rate in exons and, somewhat surprisingly, in introns. All short RLPs observed in multiple genomes are biallelic (at least in this small sample). In contrast, long repeats are highly polymorphic and some long RLPs are multiallelic. For long repeats, the chimpanzee sequence frequently differs from all observed human alleles. This suggests a high expansion/contraction rate in all long repeats. Expansions and contractions are not, however, affected by natural selection discernable from our comparison of human-chimpanzee divergence with human RLPs. Our catalog of human triplet repeats and their surrounding flanking regions can be used to produce a cost-effective whole-genome assay to test individuals. This repeat assay could someday complement SNP arrays for producing tests that assess the risk of an individual to develop a disease, or become part of personalized genomic strategy that provides therapeutic guidance with respect to drug response.
Proceedings of the National Academy of Sciences 10/2009; 106(40):17095-100. · 9.74 Impact Factor
[show abstract][hide abstract] ABSTRACT: MOTIVATION: Type 2 diabetes is a chronic metabolic disease that involves both environmental and genetic factors. To understand the genetics of type 2 diabetes and insulin resistance, the DIabetes Genome Anatomy Project (DGAP) was launched to profile gene expression in a variety of related animal models and human subjects. We asked whether these heterogeneous models can be integrated to provide consistent and robust biological insights into the biology of insulin resistance. RESULTS: We perform integrative analysis of the 16 DGAP data sets that span multiple tissues, conditions, array types, laboratories, species, genetic backgrounds and study designs. For each data set, we identify differentially expressed genes compared with control. Then, for the combined data, we rank genes according to the frequency with which they were found to be statistically significant across data sets. This analysis reveals RetSat as a widely shared component of mechanisms involved in insulin resistance and sensitivity and adds to the growing importance of the retinol pathway in diabetes, adipogenesis and insulin resistance. Top candidates obtained from our analysis have been confirmed in recent laboratory studies.
[show abstract][hide abstract] ABSTRACT: MOTIVATION: There is a growing interest in improving the cluster analysis of expression data by incorporating into it prior knowledge, such as the Gene Ontology (GO) annotations of genes, in order to improve the biological relevance of the clusters that are subjected to subsequent scrutiny. The structure of the GO is another source of background knowledge that can be exploited through the use of semantic similarity. RESULTS: We propose here a novel algorithm that integrates semantic similarities (derived from the ontology structure) into the procedure of deriving clusters from the dendrogram constructed during expression-based hierarchical clustering. Our approach can handle the multiple annotations, from different levels of the GO hierarchy, which most genes have. Moreover, it treats annotated and unannotated genes in a uniform manner. Consequently, the clusters obtained by our algorithm are characterized by significantly enriched annotations. In both cross-validation tests and when using an external index such as protein-protein interactions, our algorithm performs better than previous approaches. When applied to human cancer expression data, our algorithm identifies, among others, clusters of genes related to immune response and glucose metabolism. These clusters are also supported by protein-protein interaction data.
[show abstract][hide abstract] ABSTRACT: To characterize the hormonal milieu and adipose gene expression in response to catch-up growth (CUG), a growth pattern associated with obesity and diabetes risk, in a mouse model of low birth weight (LBW).
ICR mice were food restricted by 50% from gestational days 12.5-18.5, reducing offspring birth weight by 25%. During the suckling period, dams were either fed ad libitum, permitting CUG in offspring, or food restricted, preventing CUG. Offspring were killed at age 3 weeks, and gonadal fat was removed for RNA extraction, array analysis, RT-PCR, and evaluation of cell size and number. Serum insulin, thyroxine (T4), corticosterone, and adipokines were measured.
At age 3 weeks, LBW mice with CUG (designated U-C) had body weight comparable with controls (designated C-C); weight was reduced by 49% in LBW mice without CUG (designated U-U). Adiposity was altered by postnatal nutrition, with gonadal fat increased by 50% in U-C and decreased by 58% in U-U mice (P < 0.05 vs. C-C mice). Adipose expression of the lipogenic genes Fasn, AccI, Lpin1, and Srebf1 was significantly increased in U-C compared with both C-C and U-U mice (P < 0.05). Mitochondrial DNA copy number was reduced by >50% in U-C versus U-U mice (P = 0.014). Although cell numbers did not differ, mean adipocyte diameter was increased in U-C and reduced in U-U mice (P < 0.01).
CUG results in increased adipose tissue lipogenic gene expression and adipocyte diameter but not increased cellularity, suggesting that catch-up fat is primarily associated with lipogenesis rather than adipogenesis in this murine model.
[show abstract][hide abstract] ABSTRACT: The traditional approach to studying complex biological networks is based on the identification of interactions between internal components of signaling or metabolic pathways. By comparison, little is known about interactions between higher order biological systems, such as biological pathways and processes. We propose a methodology for gleaning patterns of interactions between biological processes by analyzing protein-protein interactions, transcriptional co-expression and genetic interactions. At the heart of the methodology are the concept of Linked Processes and the resultant network of biological processes, the Process Linkage Network (PLN).
We construct, catalogue, and analyze different types of PLNs derived from different data sources and different species. When applied to the Gene Ontology, many of the resulting links connect processes that are distant from each other in the hierarchy, even though the connection makes eminent sense biologically. Some others, however, carry an element of surprise and may reflect mechanisms that are unique to the organism under investigation. In this aspect our method complements the link structure between processes inherent in the Gene Ontology, which by its very nature is species-independent. As a practical application of the linkage of processes we demonstrate that it can be effectively used in protein function prediction, having the power to increase both the coverage and the accuracy of predictions, when carefully integrated into prediction methods.
Our approach constitutes a promising new direction towards understanding the higher levels of organization of the cell as a system which should help current efforts to re-engineer ontologies and improve our ability to predict which proteins are involved in specific biological processes.
PLoS ONE 02/2009; 4(4):e5313. · 3.73 Impact Factor
[show abstract][hide abstract] ABSTRACT: Aberrant activation of signaling pathways drives many of the fundamental biological processes that accompany tumor initiation and progression. Inappropriate phosphorylation of intermediates in these signaling pathways are a frequently observed molecular lesion that accompanies the undesirable activation or repression of pro- and anti-oncogenic pathways. Therefore, methods which directly query signaling pathway activation via phosphorylation assays in individual cancer biopsies are expected to provide important insights into the molecular "logic" that distinguishes cancer and normal tissue on one hand, and enables personalized intervention strategies on the other.
We first document the largest available set of tyrosine phosphorylation sites that are, individually, differentially phosphorylated in lung cancer, thus providing an immediate set of drug targets. Next, we develop a novel computational methodology to identify pathways whose phosphorylation activity is strongly correlated with the lung cancer phenotype. Finally, we demonstrate the feasibility of classifying lung cancers based on multi-variate phosphorylation signatures.
Highly predictive and biologically transparent phosphorylation signatures of lung cancer provide evidence for the existence of a robust set of phosphorylation mechanisms (captured by the signatures) present in the majority of lung cancers, and that reliably distinguish each lung cancer from normal. This approach should improve our understanding of cancer and help guide its treatment, since the phosphorylation signatures highlight proteins and pathways whose phosphorylation should be inhibited in order to prevent unregulated proliferation.
PLoS ONE 01/2009; 4(11):e7994. · 3.73 Impact Factor
[show abstract][hide abstract] ABSTRACT: The study of gene function is critical in various genomic and proteomic fields. Due to the availability of tremendous amounts of different types of protein data, integrating these datasets to predict function has become a significant opportunity in computational biology. In this paper, to predict protein function we (i) develop a novel Bayesian framework combining relational,hierarchical and structural information with improvement in data usage efficiency over similar methods, and (ii) propose to use it in conjunction with an integrative protein-protein association network, STRING (Search Tool for the Retrieval of INteracting Genes/proteins), which combines information from seven different sources. At the heart of our work is accomplishing protein data integration in a concerted fashion with respect to algorithm and data source. Method performance is assessed by a 5-fold cross-validation in yeast on selected terms from the Molecular Function ontology in the Gene Ontology database. Results show that our combined use of the proposed computational framework and the protein network from STRING offers substantial improvements in prediction. The benefits of using an aggressively integrative network, such as STRING, may derive from the fact that although it is likely that the ultimate gene interaction matrix (including but not limited to protein-protein, genetic, or regulatory interactions) will be sparse, presently it is still known only incompletely in most organisms, and thus the use of multiple distinct data sources is rewarded.
Bioinformatics and Biomedicine, 2008. BIBM '08. IEEE International Conference on; 12/2008
[show abstract][hide abstract] ABSTRACT: In embryonic stem (ES) cells, bivalent chromatin domains with overlapping repressive (H3 lysine 27 tri-methylation) and activating (H3 lysine 4 tri-methylation) histone modifications mark the promoters of more than 2,000 genes. To gain insight into the structure and function of bivalent domains, we mapped key histone modifications and subunits of Polycomb-repressive complexes 1 and 2 (PRC1 and PRC2) genomewide in human and mouse ES cells by chromatin immunoprecipitation, followed by ultra high-throughput sequencing. We find that bivalent domains can be segregated into two classes -- the first occupied by both PRC2 and PRC1 (PRC1-positive) and the second specifically bound by PRC2 (PRC2-only). PRC1-positive bivalent domains appear functionally distinct as they more efficiently retain lysine 27 tri-methylation upon differentiation, show stringent conservation of chromatin state, and associate with an overwhelming number of developmental regulator gene promoters. We also used computational genomics to search for sequence determinants of Polycomb binding. This analysis revealed that the genomewide locations of PRC2 and PRC1 can be largely predicted from the locations, sizes, and underlying motif contents of CpG islands. We propose that large CpG islands depleted of activating motifs confer epigenetic memory by recruiting the full repertoire of Polycomb complexes in pluripotent cells.