[Show abstract][Hide abstract] ABSTRACT: The mitotic spindle is an essential molecular machine involved in cell division, whose composition has been studied extensively by detailed cellular biology, high-throughput proteomics, and RNA interference experiments. However, because of its dynamic organization and complex regulation it is difficult to obtain a complete description of its molecular composition. We have implemented an integrated computational approach to characterize novel human spindle components and have analysed in detail the individual candidates predicted to be spindle proteins, as well as the network of predicted relations connecting known and putative spindle proteins. The subsequent experimental validation of a number of predicted novel proteins confirmed not only their association with the spindle apparatus but also their role in mitosis. We found that 75% of our tested proteins are localizing to the spindle apparatus compared to a success rate of 35% when expert knowledge alone was used. We compare our results to the previously published MitoCheck study and see that our approach does validate some findings by this consortium. Further, we predict so-called "hidden spindle hub", proteins whose network of interactions is still poorly characterised by experimental means and which are thought to influence the functionality of the mitotic spindle on a large scale. Our analyses suggest that we are still far from knowing the complete repertoire of functionally important components of the human spindle network. Combining integrated bio-computational approaches and single gene experimental follow-ups could be key to exploring the still hidden regions of the human spindle system.
[Show abstract][Hide abstract] ABSTRACT: Validation of the LM, NNI and DGC methods. Test of the performance of the pair-wise combination of methods using the text mined, manually curated gold standard dataset - EXPERT.
[Show abstract][Hide abstract] ABSTRACT: CATH version 3.3 (class, architecture, topology, homology) contains 128,688 domains, 2386 homologous superfamilies and 1233 fold groups, and reflects a major focus on classifying structural genomics (SG) structures and transmembrane proteins, both of which are likely to add structural novelty to the database and therefore increase the coverage of protein fold space within CATH. For CATH version 3.4 we have significantly improved the presentation of sequence information and associated functional information for CATH superfamilies. The CATH superfamily pages now reflect both the functional and structural diversity within the superfamily and include structural alignments of close and distant relatives within the superfamily, annotated with functional information and details of conserved residues. A significantly more efficient search function for CATH has been established by implementing the search server Solr (http://lucene.apache.org/solr/). The CATH v3.4 webpages have been built using the Catalyst web framework.
Full-text · Article · Jan 2011 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: Accurate modelling of biological systems requires a deeper and more complete knowledge about the molecular components and their functional associations than we currently have. Traditionally, new knowledge on protein associations generated by experiments has played a central role in systems modelling, in contrast to generally less trusted bio-computational predictions. However, we will not achieve realistic modelling of complex molecular systems if the current experimental designs lead to biased screenings of real protein networks and leave large, functionally important areas poorly characterised. To assess the likelihood of this, we have built comprehensive network models of the yeast and human proteomes by using a meta-statistical integration of diverse computationally predicted protein association datasets. We have compared these predicted networks against combined experimental datasets from seven biological resources at different level of statistical significance. These eukaryotic predicted networks resemble all the topological and noise features of the experimentally inferred networks in both species, and we also show that this observation is not due to random behaviour. In addition, the topology of the predicted networks contains information on true protein associations, beyond the constitutive first order binary predictions. We also observe that most of the reliable predicted protein associations are experimentally uncharacterised in our models, constituting the hidden or "dark matter" of networks by analogy to astronomical systems. Some of this dark matter shows enrichment of particular functions and contains key functional elements of protein networks, such as hubs associated with important functional areas like the regulation of Ras protein signal transduction in human cells. Thus, characterising this large and functionally important dark matter, elusive to established experimental designs, may be crucial for modelling biological systems. In any case, these predictions provide a valuable guide to these experimentally elusive regions.
[Show abstract][Hide abstract] ABSTRACT: Performance of CODA using CATH domains, with and without subfamilies. CATH domains showed lower performance than Pfam domains in detecting functional relationships between proteins using CODA. This could have been due to low coverage of CATH domains relative to Pfam or because CATH has larger families causing low scores for many hits. CATH superfamilies were clustered at varying sequence identity cut-offs (30, 35, 40, 50, 60, 70, 80, 90, 95 and 100%) using an in-house implementation of directed multi-linkage clustering. The domain counts used in the CODA score were then adjusted using these clusters. Let us say that there are two proteins in yeast, each with one domain. The first protein contains domain A and the second domain B. A protein is found in E. coli which is a fusion of these two domains: A′B′. Let us say that A and A′ are in the same 50% cluster but not the same 60% cluster, i.e. they share 50% sequence identity. The counts for ngA in the CODA score (equation 1) then only include the number of members of the same 50% cluster that belong to yeast. ngB is the number of members of that 50% cluster which belong to E. coli. Likewise, if B and B′ are in the same 70% cluster but not the same 80% clusters, then the counts are taken from that 70% cluster. Using subfamilies slightly improves performance at high enrichment but only where there are few hits. We therefore concluded that the reduced performance of CATH relative to Pfam was related to a mixture of lower coverage of genomes and than the size and functional specificity of the families.
(0.52 MB EPS)
[Show abstract][Hide abstract] ABSTRACT: Comparative performance of Pfam, Pfam-CATH, CATH and CATH-Pfam MDA datasets on the yeast genome. Enrichment is the ratio of true positives achieved by CODA to the number expected by chance. Points are plotted at successive score cut-offs. At an enrichment of 10, Pfam-CATH performed best with 1791 hits; Pfam achieved 1663 hits, CATH-Pfam 792 and CATH 296. At higher enrichment (e.g. 15), the Pfam dataset outperforms all others and finds ∼500 hits. Datasets based principally on CATH domains (CATH-Pfam and CATH) performed less well than those based on Pfam domains. This may be because CATH superfamilies tend to be broader than Pfam families, including more functional subfamilies. This could result in generally reduced scores for hits involving these larger families. Additional analysis using CATH subfamilies did not improve performance however (Figure S1) and thus it is more likely that CATH had lower performance due to lower coverage of the genomes. Pfam MDA datasets were chosen over Pfam-CATH due to a similar performance at moderate enrichment and superior performance at higher enrichment. CODA should be used with a score cut-off of 0.56 to achieve an enrichment of 10 on this dataset and 0.65 for an enrichment of 15.
(0.60 MB EPS)