Detailing regulatory networks through large scale data integration.
ABSTRACT Much of a cell's regulatory response to changing environments occurs at the transcriptional level. Particularly in higher organisms, transcription factors (TFs), microRNAs and epigenetic modifications can combine to form a complex regulatory network. Part of this system can be modeled as a collection of regulatory modules: co-regulated genes, the conditions under which they are co-regulated and sequence-level regulatory motifs.
We present the Combinatorial Algorithm for Expression and Sequence-based Cluster Extraction (COALESCE) system for regulatory module prediction. The algorithm is efficient enough to discover expression biclusters and putative regulatory motifs in metazoan genomes (>20,000 genes) and very large microarray compendia (>10,000 conditions). Using Bayesian data integration, it can also include diverse supporting data types such as evolutionary conservation or nucleosome placement. We validate its performance using a functional evaluation of co-clustered genes, known yeast and Escherichea coli TF targets, synthetic data and various metazoan data compendia. In all cases, COALESCE performs as well or better than current biclustering and motif prediction tools, with high accuracy in functional and TF/target assignments and zero false positives on synthetic data. COALESCE provides an efficient and flexible platform within which large, diverse data collections can be integrated to predict metazoan regulatory networks.
Source code (C++) is available at http://function.princeton.edu/sleipnir, and supporting data and a web interface are provided at http://function.princeton.edu/coalesce.
Supplementary data are available at Bioinformatics online.
[show abstract] [hide abstract]
ABSTRACT: We describe an algorithm, multi-species cMonkey, for the simultaneous biclustering of heterogeneous multiple-species data collections and apply the algorithm to a group of bacteria containing Bacillus subtilis, Bacillus anthracis, and Listeria monocytogenes. The algorithm reveals evolutionary insights into the surprisingly high degree of conservation of regulatory modules across these three species and allows data and insights from well-studied organisms to complement the analysis of related but less well studied organisms.Genome biology 09/2010; 11(9):R96. · 6.63 Impact Factor
[show abstract] [hide abstract]
ABSTRACT: The Biological General Repository for Interaction Datasets (BioGRID) is a public database that archives and disseminates genetic and protein interaction data from model organisms and humans (http://www.thebiogrid.org). BioGRID currently holds 347,966 interactions (170,162 genetic, 177,804 protein) curated from both high-throughput data sets and individual focused studies, as derived from over 23,000 publications in the primary literature. Complete coverage of the entire literature is maintained for budding yeast (Saccharomyces cerevisiae), fission yeast (Schizosaccharomyces pombe) and thale cress (Arabidopsis thaliana), and efforts to expand curation across multiple metazoan species are underway. The BioGRID houses 48,831 human protein interactions that have been curated from 10,247 publications. Current curation drives are focused on particular areas of biology to enable insights into conserved networks and pathways that are relevant to human health. The BioGRID 3.0 web interface contains new search and display features that enable rapid queries across multiple data types and sources. An automated Interaction Management System (IMS) is used to prioritize, coordinate and track curation across international sites and projects. BioGRID provides interaction data to several model organism databases, resources such as Entrez-Gene and other interaction meta-databases. The entire BioGRID 3.0 data collection may be downloaded in multiple file formats, including PSI MI XML. Source code for BioGRID 3.0 is freely available without any restrictions.Nucleic Acids Research 11/2010; 39(Database issue):D698-704. · 8.03 Impact Factor