[Show abstract][Hide abstract] ABSTRACT: The use of pathways and gene interaction networks for the analysis of differential expression experiments has allowed us to highlight the differences in gene expression profiles between samples in a systems biology perspective. The usefulness and accuracy of pathway analysis critically depend on our understanding of how genes interact with one another. That knowledge is continuously improving due to advances in next generation sequencing technologies and in computational methods. While most approaches treat each of them as independent entities, pathways actually coordinate to perform essential functions in a cell. In this work, we propose a methodology based on a sparse regression approach to find genes that act as intermediary to and interact with two pathways. We model each gene in a pathway using a set of predictor genes, and a connection is formed between the pathway gene and a predictor gene if the sparse regression coefficient corresponding to the predictor gene is non-zero. A predictor gene is a shared neighbor gene of two pathways if it is connected to at least one gene in each pathway. We compare the sparse regression approach to Weighted Correlation Network Analysis and a correlation distance based approach using time-course RNA-Seq data for dendritic cell from wild type, MyD88-knockout, and TRIF-knockout mice, and a set of RNA-Seq data from 60 Caucasian individuals. For the sparse regression approach, we found overrepresented functions for shared neighbor genes between TLR-signaling pathway and antigen processing and presentation, apoptosis, and Jak-Stat pathways that are supported by prior research, and compares favorably to Weighted Correlation Network Analysis in cases where the gene association signals are weak.
PLoS ONE 09/2015; 10(9):e0137222. DOI:10.1371/journal.pone.0137222 · 3.23 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Time-course gene expression profiles are frequently used to provide insight into the changes in cellular state over time and to infer the molecular pathways involved. When combined with large-scale molecular interaction networks, such data can provide information about the dynamics of cellular response to stimulus. However, few tools are currently available to predict a single active gene sub-network from time-course gene expression profiles.
We introduce a tool, TimeXNet, which identifies active gene sub-networks with temporal paths using time-course gene expression profiles in the context of a weighted gene regulatory and protein-protein interaction network. TimeXNet uses a specialized form of the network flow optimization approach to identify the most probable paths connecting the genes with significant changes in expression at consecutive time intervals. TimeXNet has been extensively evaluated for its ability to predict novel regulators and their associated pathways within active gene sub-networks in the mouse innate immune response and the yeast osmotic stress response. Compared to other similar methods, TimeXNet identified up to 50% more novel regulators from independent experimental datasets. It predicted paths within a greater number of known pathways with longer overlaps (up to 7 consecutive edges) within these pathways. TimeXNet was also shown to be robust in the presence of varying amounts of noise in the molecular interaction network.
TimeXNet is a reliable tool that can be used to study cellular response to stimuli through the identification of time-dependent active gene sub-networks in diverse biological systems. It is significantly better than other similar tools. TimeXNet is implemented in Java as a stand-alone application and supported on Linux, MS Windows and Macintosh. The output of TimeXNet can be directly viewed in Cytoscape. TimeXNet is freely available for non-commercial users.
[Show abstract][Hide abstract] ABSTRACT: With the exponential increase in the number of sequenced organisms, automated annotation of proteins is becoming increasingly important. Intrinsically disordered regions are known to play a significant role in protein function. Despite their abundance, especially in eukaryotes, they are rarely used to inform function prediction systems. In this study, we extracted seven sequence features in intrinsically disordered regions and developed a scheme to use them to predict Gene Ontology Slim terms associated with proteins. We evaluated the function prediction performance of each feature. Our results indicate that the residue composition based features have the highest precision while bigram probabilities, based on sequence profiles of intrinsically disordered regions obtained from PSIBlast, have the highest recall. Amino acid bigrams and features based on secondary structure show an intermediate level of precision and recall. Almost all features showed a high prediction performance for GO Slim terms related to extracellular matrix, nucleus, RNA and DNA binding. However, feature performance varied significantly for different GO Slim terms emphasizing the need for a unique classifier optimized for the prediction of each functional term. These findings provide a first comprehensive and quantitative evaluation of sequence features in intrinsically disordered regions and will help in the development of a more informative protein function predictor.
PLoS ONE 02/2014; 9(2):e89890. DOI:10.1371/journal.pone.0089890 · 3.23 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Innate immune response involves protein-protein interactions, deoxyribonucleic acid (DNA)-protein interactions and signaling cascades. So far, thousands of protein-protein interactions have been curated as a static interaction map. However, protein-protein interactions involved in innate immune response are dynamic. We recorded the dynamics in the interactome during innate immune response by combining gene expression data of lipopolysaccharide (LPS)-stimulated dendritic cells with protein-protein interactions data. We identified the differences in interactome during innate immune response by constructing differential networks and identifying protein modules, which were up-/down-regulated at each stage during the innate immune response. For each protein complex, we identified enriched biological processes and pathways. In addition, we identified core interactions that are conserved throughout the innate immune response and their enriched gene ontology terms and pathways. We defined two novel measures to assess the differences between network maps at different time points. We found that the protein interaction network at 1 hour after LPS stimulation has the highest interactions protein ratio, which indicates a role for proteins with large number of interactions in innate immune response. A pairwise differential matrix allows for the global visualization of the differences between different networks. We investigated the toll-like receptor subnetwork and found that S100A8 is down-regulated in dendritic cells after LPS stimulation. Identified protein complexes have a crucial role not only in innate immunity, but also in circadian rhythms, pathways involved in cancer, and p53 pathways. The study confirmed previous work that reported a strong correlation between cancer and immunity.
Gene regulation and systems biology 01/2014; 8(8):1-15. DOI:10.4137/GRSB.S12850
[Show abstract][Hide abstract] ABSTRACT: The innate immune response is primarily mediated by the Toll-like receptors functioning through the MyD88-dependent and TRIF-dependent pathways. Despite being widely studied, it is not yet completely understood and systems-level analyses have been lacking. In this study, we identified a high-probability network of genes activated during the innate immune response using a novel approach to analyze time-course gene expression profiles of activated immune cells in combination with a large gene regulatory and protein-protein interaction network. We classified the immune response into three consecutive time-dependent stages and identified the most probable paths between genes showing a significant change in expression at each stage. The resultant network contained several novel and known regulators of the innate immune response, many of which did not show any observable change in expression at the sampled time points. The response network shows the dominance of genes from specific functional classes during different stages of the immune response. It also suggests a role for the protein phosphatase 2a catalytic subunit α in the regulation of the immunoproteasome during the late phase of the response. In order to clarify the differences between the MyD88-dependent and TRIF-dependent pathways in the innate immune response, time-course gene expression profiles from MyD88-knockout and TRIF-knockout dendritic cells were analyzed. Their response networks suggest the dominance of the MyD88-dependent pathway in the innate immune response, and an association of the circadian regulators and immunoproteasomal degradation with the TRIF-dependent pathway. The response network presented here provides the most probable associations between genes expressed in the early and the late phases of the innate immune response, while taking into account the intermediate regulators. We propose that the method described here can also be used in the identification of time-dependent gene sub-networks in other biological systems.
[Show abstract][Hide abstract] ABSTRACT: Background
The understanding of the mechanisms of transcriptional regulation remains a challenge for molecular biologists in the post-genome era. It is hypothesized that the regulatory regions of genes expressed in the same tissue or cell type share a similar structure. Though several studies have analyzed the promoters of genes expressed in specific metazoan tissues or cells, little research has been done in plants. Hence finding specific patterns of motifs to explain the promoter architecture of co-expressed genes in plants could shed light on their transcription mechanism.
We identified novel patterns of sets of motifs in promoters of genes co-expressed in four different plant structures (PSs) and in the entire plant in Arabidopsis thaliana. Sets of genes expressed in four PSs (flower, seed, root, shoot) and housekeeping genes expressed in the entire plant were taken from a database of co-expressed genes in A. thaliana. PS-specific motifs were predicted using three motif-discovery algorithms, 8 of which are novel, to the best of our knowledge. A support vector machine was trained using the average upstream distance of the identified motifs from the translation start site on both strands of binding sites. The correctly classified promoters per PS were used to construct specific patterns of sets of motifs to describe the promoter architecture of those co-expressed genes. The discovered PS-specific patterns were tested in the entire A. thaliana genome, correctly identifying 77.8%, 81.2%, 70.8% and 53.7% genes expressed in petal differentiation, synergid cells, root hair and trichome, as well as 88.4% housekeeping genes.
We present five patterns of sets of motifs which describe the promoter architecture of co-expressed genes in five PSs with the ability to predict them from the entire A. thaliana genome. Based on these findings, we conclude that the positioning and orientation of transcription factor binding sites at specific distances from the translation start site is a reliable measure to differentiate promoters of genes expressed in different A. thaliana structures from background genomic promoters. Our method can be used to predict novel motifs and decipher a similar promoter architecture for genes co-expressed in A. thaliana under different conditions.
BMC Systems Biology 10/2013; 7(3):S10. DOI:10.1186/1752-0509-7-S3-S10 · 2.44 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Intrinsically disordered regions in proteins are known to evolve rapidly while maintaining their function. However, given their lack of structure and sequence conservation, the means through which they stay functional is not clear. Poor sequence conservation also hampers the classification of these regions into functional groups. We studied the sequence conservation of a large number of predicted and experimentally determined intrinsically disordered regions from the human proteome in 7 other eukaryotes. We determined the chemical composition of disordered regions by calculating the fraction of positive, negative, polar, hydrophobic and special (Pro, Gly) residues, and studied its maintenance in orthologous proteins. A significant number of disordered regions with low sequence conservation showed considerable similarity in their chemical composition between orthologs. Clustering disordered regions based on their chemical composition resulted in functionally distinct groups. Finally, disordered regions showed location preference within the proteins that was dependent on their chemical composition. We conclude that preserving the overall chemical composition is one of the ways through which intrinsically disordered regions maintain their flexibility and function through evolution. We propose that the chemical composition of disordered regions can be used to classify them into functional groups and, together with conservation and location, may be used to define a general classification scheme.
[Show abstract][Hide abstract] ABSTRACT: The proteasome is the degradation machine at the center of the ubiquitin-proteasome system and controls the concentrations of many proteins in eukaryotes. It is highly processive so that substrates are degraded completely into small peptides, avoiding the formation of potentially toxic fragments. Nonetheless, some proteins are incompletely degraded, indicating the existence of factors that influence proteasomal processivity. We have quantified proteasomal processivity and determined the underlying rates of substrate degradation and release. We find that processivity increases with species complexity over a 5-fold range between yeast and mammalian proteasome, and the effect is due to slower but more persistent degradation by proteasomes from more complex organisms. A sequence stretch that has been implicated in causing incomplete degradation, the glycine-rich region of the NFκB subunit p105, reduces the proteasome's ability to unfold its substrate, and polyglutamine repeats such as found in Huntington's disease reduce the processivity of the proteasome in a length-dependent manner.
ACS Chemical Biology 06/2012; 7(8):1444-53. DOI:10.1021/cb3001155 · 5.33 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Function prediction of intrinsically disordered domains (IDDs) using sequence similarity methods is limited by their high mutability and prevalence of low complexity regions. We describe a novel method for identifying similar IDDs by a similarity metric based on amino acid composition and identify significantly overrepresented Gene Ontology (GO) and Pfam domain annotations within highly similar IDDs. Applications and extensions of the proposed method are discussed, in particular with respect to protein functional annotation. We test the predicted annotations in a large-scale survey of IDDs in mouse and find that the proposed method provides significantly greater protein coverage in terms of function prediction than traditional sequence alignment methods like BLAST. As a proof of concept we examined several disorder-containing * The authors wish it to be known that, in their opinion, the first two authors contributed equally to this work. proteins: GRA15 and ROP16, both encoded in the parasitic protozoa T. gondii; Cyclon, a mostly uncharacterized protein involved in the regulation of immune cell death; STIM1, a protein essential for regulating calcium levels in the endoplasmic reticulum. We show that the overrepresented GO terms are consistent with recently-reported biological functions. We
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing 05/2012; 17:164-175. DOI:10.1142/9789814366496_0016
[Show abstract][Hide abstract] ABSTRACT: Gene co-expression, in the form of a correlation coefficient, has been valuable in the analysis, classification and prediction of protein-protein interactions. However, it is susceptible to bias from a few samples having a large effect on the correlation coefficient. Gene co-expression stability is a means of quantifying this bias, with high stability indicating robust, unbiased co-expression correlation coefficients. We assess the utility of gene co-expression stability as an additional measure to support the co-expression correlation in the analysis of protein-protein interaction networks.
We studied the patterns of co-expression correlation and stability in interacting proteins with respect to their interaction promiscuity, levels of intrinsic disorder, and essentiality or disease-relatedness. Co-expression stability, along with co-expression correlation, acts as a better classifier of hub proteins in interaction networks, than co-expression correlation alone, enabling the identification of a class of hubs that are functionally distinct from the widely accepted transient (date) and obligate (party) hubs. Proteins with high levels of intrinsic disorder have low co-expression correlation and high stability with their interaction partners suggesting their involvement in transient interactions, except for a small group that have high co-expression correlation and are typically subunits of stable complexes. Similar behavior was seen for disease-related and essential genes. Interacting proteins that are both disordered have higher co-expression stability than ordered protein pairs. Using co-expression correlation and stability, we found that transient interactions are more likely to occur between an ordered and a disordered protein while obligate interactions primarily occur between proteins that are either both ordered, or disordered.
We observe that co-expression stability shows distinct patterns in structurally and functionally different groups of proteins and interactions. We conclude that it is a useful and important measure to be used in concert with gene co-expression correlation for further insights into the characteristics of proteins in the context of their interaction network.
[Show abstract][Hide abstract] ABSTRACT: In order to characterize mammalian intrinsically disordered domains (IDDs) we examined the patterns in their amino acid abundance as well as overrepresented local sequence motifs. We considered IDDs from mouse proteins associated with innate immune responses as well as a set of generic human genes. These sets were compared with artificially generated random sequences with the same overall amino acid abundance and length distributions. IDDs were then clustered by amino acid abundance, and further analyzed in terms of co-occurrence of clusters with functionally characterized Pfam domains.
Overall, IDDs were very different from randomly generated sequences. The deviation from random distributions was at least as great as that for ordered domains, for which the deviation can be rationalized in terms of strong evolutionary pressure for structure and function. The co-occurrence of certain Pfam domains with specific IDD clusters was found to be significant (p-value < 0.01). Local sequence motifs that were over-represented in the innate immune set consisted mostly of low complexity fragments, primarily characterized by amino acid repeats, and could not be assigned an obvious functional role.
Our results suggest that IDDs are constrained within a narrow subset of possible sequences. This is most likely a result of biophysical restraints that have yet to be elucidated. More detailed examination of the functional relationship between the IDDs and associated Pfam domains is one possible avenue of investigation.
[Show abstract][Hide abstract] ABSTRACT: Despite the availability of a large number of protein-protein interactions (PPIs) in several species, researchers are often limited to using very small subsets in a few organisms due to the high prevalence of spurious interactions. In spite of the importance of quality assessment of experimentally determined PPIs, a surprisingly small number of databases provide interactions with scores and confidence levels. We introduce HitPredict (http://hintdb.hgc.jp/htp/), a database with quality assessed PPIs in nine species. HitPredict assigns a confidence level to interactions based on a reliability score that is computed using evidence from sequence, structure and functional annotations of the interacting proteins. HitPredict was first released in 2005 and is updated annually. The current release contains 36,930 proteins with 176,983 non-redundant, physical interactions, of which 116,198 (66%) are predicted to be of high confidence.
[Show abstract][Hide abstract] ABSTRACT: Intrinsic disorder and distributed surface charge have been previously identified as some of the characteristics that differentiate hubs (proteins with a large number of interactions) from non-hubs in protein-protein interaction networks. In this study, we investigated the differences in the quantity, diversity, and functional nature of Pfam domains, and their relationship with intrinsic disorder, in hubs and non-hubs. We found that proteins with a more diverse domain composition were over-represented in hubs when compared with non-hubs, with the number of interactions in hubs increasing with domain diversity. Conversely, the fraction of intrinsic disorder in hubs decreased with increasing number of ordered domains. The difference in the levels of disorder was more prominent in hubs and non-hubs with fewer domains. Functional analysis showed that hubs were enriched in kinase and adaptor domains acting primarily in signal transduction and transcription regulation, whereas non-hubs had more DNA-binding domains and were involved in catalytic activity. Consistent with the differences in the functional nature of their domains, hubs with two or more domains were more likely to connect distinct functional modules in the interaction network when compared with single domain hubs. We conclude that the availability of greater number and diversity of ordered domains, in addition to the tendency to have promiscuous domains, differentiates hubs from non-hubs and provides an additional means of achieving interaction promiscuity. Further, hubs with fewer domains use greater levels of intrinsic disorder to facilitate interaction promiscuity with the prevalence of disorder decreasing with increasing number of ordered domains.
Protein Science 08/2010; 19(8):1461-8. DOI:10.1002/pro.425 · 2.85 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Hubs are proteins with a large number of interactions in a protein-protein interaction network. They are the principal agents in the interaction network and affect its function and stability. Their specific recognition of many different protein partners is of great interest from the structural viewpoint. Over the last few years, the structural properties of hubs have been extensively studied. We review the currently known features that are particular to hubs, possibly affecting their binding ability. Specifically, we look at the levels of intrinsic disorder, surface charge and domain distribution in hubs, as compared to non-hubs, along with differences in their functional domains.
International Journal of Molecular Sciences 04/2010; 11(4):1930-43. DOI:10.3390/ijms11041930 · 2.86 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Hubs are highly connected proteins in a protein-protein interaction network. Previous work has implicated disordered domains and high surface charge as the properties significant in the ability of hubs to bind multiple proteins. While conformational flexibility of disordered domains plays an important role in the binding ability of large hubs, high surface charge is the dominant property in small hubs. In this study, we further investigate the role of the high surface charge in the binding ability of small hubs in the absence of disordered domains. Using multipole expansion, we find that the charges are highly distributed over the hub surfaces. Residue enrichment studies show that the charged residues in hubs are more prevalent on the exposed surface, with the exception of Arg, which is predominantly found at the interface, as compared to non-hubs. This suggests that the charged residues act primarily from the exposed surface rather than the interface to affect the binding ability of small hubs. They do this through (i) enhanced intra-molecular electrostatic interactions to lower the desolvation penalty, (ii) indirect long - range intermolecular interactions with charged residues on the partner proteins for better complementarity and electrostatic steering, and (iii) increased solubility for enhanced diffusion-controlled rate of binding. Along with Arg, we also find a high prevalence of polar residues Tyr, Gln and His and the hydrophobic residue Met at the interfaces of hubs, all of which have the ability to form multiple types of interactions, indicating that the interfaces of hubs are optimized to participate in multiple interactions.
[Show abstract][Hide abstract] ABSTRACT: We investigate the structural properties of hubs that enable them to interact with several partners in protein-protein interaction networks. We find that hubs have more observed and predicted disordered residues with fewer loops/coils, and more charged residues on the surface as compared to non-hubs. Smaller hubs have fewer disordered residues and more charged residues on the surface than larger hubs. We conclude that the global flexibility provided by disordered domains, and high surface charge are complementary factors that play a significant role in the binding ability of hubs.
[Show abstract][Hide abstract] ABSTRACT: Protein-protein interaction data used in the creation or prediction of molecular networks is usually obtained from large scale or high-throughput experiments. This experimental data is liable to contain a large number of spurious interactions. Hence, there is a need to validate the interactions and filter out the incorrect data before using them in prediction studies.
In this study, we use a combination of 3 genomic features -- structurally known interacting Pfam domains, Gene Ontology annotations and sequence homology -- as a means to assign reliability to the protein-protein interactions in Saccharomyces cerevisiae determined by high-throughput experiments. Using Bayesian network approaches, we show that protein-protein interactions from high-throughput data supported by one or more genomic features have a higher likelihood ratio and hence are more likely to be real interactions. Our method has a high sensitivity (90%) and good specificity (63%). We show that 56% of the interactions from high-throughput experiments in Saccharomyces cerevisiae have high reliability. We use the method to estimate the number of true interactions in the high-throughput protein-protein interaction data sets in Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens to be 27%, 18% and 68% respectively. Our results are available for searching and downloading at http://helix.protein.osaka-u.ac.jp/htp/.
A combination of genomic features that include sequence, structure and annotation information is a good predictor of true interactions in large and noisy high-throughput data sets. The method has a very high sensitivity and good specificity and can be used to assign a likelihood ratio, corresponding to the reliability, to each interaction.