[Show abstract][Hide abstract] ABSTRACT: The mitotic spindle is an essential molecular machine involved in cell division, whose composition has been studied extensively by detailed cellular biology, high-throughput proteomics, and RNA interference experiments. However, because of its dynamic organization and complex regulation it is difficult to obtain a complete description of its molecular composition. We have implemented an integrated computational approach to characterize novel human spindle components and have analysed in detail the individual candidates predicted to be spindle proteins, as well as the network of predicted relations connecting known and putative spindle proteins. The subsequent experimental validation of a number of predicted novel proteins confirmed not only their association with the spindle apparatus but also their role in mitosis. We found that 75% of our tested proteins are localizing to the spindle apparatus compared to a success rate of 35% when expert knowledge alone was used. We compare our results to the previously published MitoCheck study and see that our approach does validate some findings by this consortium. Further, we predict so-called "hidden spindle hub", proteins whose network of interactions is still poorly characterised by experimental means and which are thought to influence the functionality of the mitotic spindle on a large scale. Our analyses suggest that we are still far from knowing the complete repertoire of functionally important components of the human spindle network. Combining integrated bio-computational approaches and single gene experimental follow-ups could be key to exploring the still hidden regions of the human spindle system.
PLoS ONE 03/2012; 7(3):e31813. · 3.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: CATH version 3.3 (class, architecture, topology, homology) contains 128,688 domains, 2386 homologous superfamilies and 1233 fold groups, and reflects a major focus on classifying structural genomics (SG) structures and transmembrane proteins, both of which are likely to add structural novelty to the database and therefore increase the coverage of protein fold space within CATH. For CATH version 3.4 we have significantly improved the presentation of sequence information and associated functional information for CATH superfamilies. The CATH superfamily pages now reflect both the functional and structural diversity within the superfamily and include structural alignments of close and distant relatives within the superfamily, annotated with functional information and details of conserved residues. A significantly more efficient search function for CATH has been established by implementing the search server Solr (http://lucene.apache.org/solr/). The CATH v3.4 webpages have been built using the Catalyst web framework.
Nucleic Acids Research 01/2011; 39(Database issue):D420-6. · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Accurate modelling of biological systems requires a deeper and more complete knowledge about the molecular components and their functional associations than we currently have. Traditionally, new knowledge on protein associations generated by experiments has played a central role in systems modelling, in contrast to generally less trusted bio-computational predictions. However, we will not achieve realistic modelling of complex molecular systems if the current experimental designs lead to biased screenings of real protein networks and leave large, functionally important areas poorly characterised. To assess the likelihood of this, we have built comprehensive network models of the yeast and human proteomes by using a meta-statistical integration of diverse computationally predicted protein association datasets. We have compared these predicted networks against combined experimental datasets from seven biological resources at different level of statistical significance. These eukaryotic predicted networks resemble all the topological and noise features of the experimentally inferred networks in both species, and we also show that this observation is not due to random behaviour. In addition, the topology of the predicted networks contains information on true protein associations, beyond the constitutive first order binary predictions. We also observe that most of the reliable predicted protein associations are experimentally uncharacterised in our models, constituting the hidden or "dark matter" of networks by analogy to astronomical systems. Some of this dark matter shows enrichment of particular functions and contains key functional elements of protein networks, such as hubs associated with important functional areas like the regulation of Ras protein signal transduction in human cells. Thus, characterising this large and functionally important dark matter, elusive to established experimental designs, may be crucial for modelling biological systems. In any case, these predictions provide a valuable guide to these experimentally elusive regions.
[Show abstract][Hide abstract] ABSTRACT: In order to understand how biological systems function it is necessary to determine the interactions and associations between proteins. Gene fusion prediction is one approach to detection of such functional relationships. Its use is however known to be problematic in higher eukaryotic genomes due to the presence of large homologous domain families. Here we introduce CODA (Co-Occurrence of Domains Analysis), a method to predict functional associations based on the gene fusion idiom.
We apply a novel scoring scheme which takes account of the genome-specific size of homologous domain families involved in fusion to improve accuracy in predicting functional associations. We show that CODA is able to accurately predict functional similarities in human with comparison to state-of-the-art methods and show that different methods can be complementary. CODA is used to produce evidence that a currently uncharacterised human protein may be involved in pathways related to depression and that another is involved in DNA replication.
The relative performance of different gene fusion methodologies has not previously been explored. We find that they are largely complementary, with different methods being more or less appropriate in different genomes. Our method is the only one currently available for download and can be run on an arbitrary dataset by the user. The CODA software and datasets are freely available from ftp://ftp.biochem.ucl.ac.uk/pub/gene3d_data/v6.1.0/CODA/. Predictions are also available via web services from http://funcnet.eu/.
PLoS ONE 06/2010; 5(6):e10908. · 3.53 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Integration of biological data of various types and the development of adapted bioinformatics tools represent critical objectives to enable research at the systems level. The European Network of Excellence ENFIN is engaged in developing an adapted infrastructure to connect databases, and platforms to enable both the generation of new bioinformatics tools and the experimental validation of computational predictions. With the aim of bridging the gap existing between standard wet laboratories and bioinformatics, the ENFIN Network runs integrative research projects to bring the latest computational techniques to bear directly on questions dedicated to systems biology in the wet laboratory environment. The Network maintains internally close collaboration between experimental and computational research, enabling a permanent cycling of experimental validation and improvement of computational prediction methods. The computational work includes the development of a database infrastructure (EnCORE), bioinformatics analysis methods and a novel platform for protein function analysis FuncNet.
[Show abstract][Hide abstract] ABSTRACT: Over the last 2 years the Gene3D resource has been significantly improved, and is now more accurate and with a much richer interactive display via the Gene3D website (http://gene3d.biochem.ucl.ac.uk/). Gene3D provides accurate structural domain family assignments for over 1100 genomes and nearly 10,000,000 proteins. A hidden Markov model library, constructed from the manually curated CATH structural domain hierarchy, is used to search UniProt, RefSeq and Ensembl protein sequences. The resulting matches are refined into simple multi-domain architectures using a recently developed in-house algorithm, DomainFinder 3 (available at: ftp://ftp.biochem.ucl.ac.uk/pub/gene3d_data/DomainFinder3/). The domain assignments are integrated with multiple external protein function descriptions (e.g. Gene Ontology and KEGG), structural annotations (e.g. coiled coils, disordered regions and sequence polymorphisms) and family resources (e.g. Pfam and eggNog) and displayed on the Gene3D website. The website allows users to view descriptions for both single proteins and genes and large protein sets, such as superfamilies or genomes. Subsets can then be selected for detailed investigation or associated functions and interactions can be used to expand explorations to new proteins. Gene3D also provides a set of services, including an interactive genome coverage graph visualizer, DAS annotation resources, sequence search facilities and SOAP services.
Nucleic Acids Research 11/2009; 38(Database issue):D296-300. · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The phenotypic effects of sequence variations in protein-coding regions come about primarily via their effects on the resulting structures, for example by disrupting active sites or affecting structural stability. In order better to understand the mechanisms behind known mutant phenotypes, and predict the effects of novel variations, biologists need tools to gauge the impacts of DNA mutations in terms of their structural manifestation. Although many mutations occur within domains whose structure has been solved, many more occur within genes whose protein products have not been structurally characterized.
Here we present 3DSim (3D Structural Implication of Mutations), a database and web application facilitating the localization and visualization of single amino acid polymorphisms (SAAPs) mapped to protein structures even where the structure of the protein of interest is unknown. The server displays information on 6514 point mutations, 4865 of them known to be associated with disease. These polymorphisms are drawn from SAAPdb, which aggregates data from various sources including dbSNP and several pathogenic mutation databases. While the SAAPdb interface displays mutations on known structures, 3DSim projects mutations onto known sequence domains in Gene3D. This resource contains sequences annotated with domains predicted to belong to structural families in the CATH database. Mappings between domain sequences in Gene3D and known structures in CATH are obtained using a MUSCLE alignment. 1210 three-dimensional structures corresponding to CATH structural domains are currently included in 3DSim; these domains are distributed across 396 CATH superfamilies, and provide a comprehensive overview of the distribution of mutations in structural space.
The server is publicly available at http://3DSim.bioinfo.cnio.es/. In addition, the database containing the mapping between SAAPdb, Gene3D and CATH is available on request and most of the functionality is available through programmatic web service access.
[Show abstract][Hide abstract] ABSTRACT: The automated extraction of gene and/or protein interactions from the literature is one of the most important targets of biomedical text mining research. In this paper we present a realistic evaluation of gene/protein interaction mining relevant to potential non-specialist users. Hence we have specifically avoided methods that are complex to install or require reimplementation, and we coupled our chosen extraction methods with a state-of-the-art biomedical named entity tagger.
Our results show: that performance across different evaluation corpora is extremely variable; that the use of tagged (as opposed to gold standard) gene and protein names has a significant impact on performance, with a drop in F-score of over 20 percentage points being commonplace; and that a simple keyword-based benchmark algorithm when coupled with a named entity tagger outperforms two of the tools most widely used to extract gene/protein interactions.
In terms of availability, ease of use and performance, the potential non-specialist user community interested in automatically extracting gene and/or protein interactions from free text is poorly served by current tools and systems. The public release of extraction tools that are easy to install and use, and that achieve state-of-art levels of performance should be treated as a high priority by the biomedical text mining community.
[Show abstract][Hide abstract] ABSTRACT: One of the fastest-growing fields in bioinformatics is text mining: the application of natural language processing techniques to problems of knowledge management and discovery, using large collections of biological or biomedical text such as MEDLINE. The techniques used in text mining range from the very simple (e.g., the inference of relationships between genes from frequent proximity in documents) to the complex and computationally intensive (e.g., the analysis of sentence structures with parsers in order to extract facts about protein-protein interactions from statements in the text). This chapter presents a general introduction to some of the key principles and challenges of natural language processing, and introduces some of the tools available to end-users and developers. A case study describes the construction and testing of a simple tool designed to tackle a task that is crucial to almost any application of text mining in bioinformatics--identifying gene/protein names in text and mapping them onto records in an external database.
Methods in Molecular Biology 02/2008; 453:471-91. · 1.29 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Interest is growing in the application of syntactic parsers to natural language processing problems in biology, but assessing their performance is difficult because differences in linguistic convention can falsely appear to be errors. We present a method for evaluating their accuracy using an intermediate representation based on dependency graphs, in which the semantic relationships important in most information extraction tasks are closer to the surface. We also demonstrate how this method can be easily tailored to various application-driven criteria.
Using the GENIA corpus as a gold standard, we tested four open-source parsers which have been used in bioinformatics projects. We first present overall performance measures, and test the two leading tools, the Charniak-Lease and Bikel parsers, on subtasks tailored to reflect the requirements of a system for extracting gene expression relationships. These two tools clearly outperform the other parsers in the evaluation, and achieve accuracy levels comparable to or exceeding native dependency parsers on similar tasks in previous biological evaluations.
Evaluating using dependency graphs allows parsers to be tested easily on criteria chosen according to the semantics of particular biological applications, drawing attention to important mistakes and soaking up many insignificant differences that would otherwise be reported as errors. Generating high-accuracy dependency graphs from the output of phrase-structure parsers also provides access to the more detailed syntax trees that are used in several natural-language processing techniques.
[Show abstract][Hide abstract] ABSTRACT: It is not clear a priori how well parsers trained on the Penn Treebank will parse significantly different corpora without retraining. We carried out a compet-itive evaluation of three leading tree-bank parsers on an annotated corpus from the human molecular biology do-main, and on an extract from the Penn Treebank for comparison, performing a detailed analysis of the kinds of errors each parser made, along with a quan-titative comparison of syntax usage be-tween the two corpora. Our results sug-gest that these tools are becoming some-what over-specialised on their training domain at the expense of portability, but also indicate that some of the errors en-countered are of doubtful importance for information extraction tasks. Furthermore, our inital experiments with unsupervised parse combination techniques showed that integrating the output of several parsers can ameliorate some of the performance problems they encounter on unfamiliar text, providing accuracy and coverage improvements, and a novel measure of trustworthiness.
[Show abstract][Hide abstract] ABSTRACT: We present MPL (Metapattern Language), a new formalism for defining patterns over dependency-parsed text, and GraphSpider, a matching engine for extracting depen- dency subgraphs which match against MPL patterns. Using a regexp-like syntax, MPL allows the definition of subgraphs matching user-specified patterns which can be con- strained by word or word class, part-of- speech tag, dependency type and direction, and presence of named variables in partic- ular locations. Although MPL and Graph- Spider are general-purpose, we developed a set of patterns to capture biomolecular interactions which achieved very high pre- cision results (92.6% at 31.2% recall) on the LLL Challenge corpus. MPL specifi- cations and pattern sets, and the GraphSpi- der software, are available on SourceForge: http://graphspider.sf.net/