-
Ruihua Fang,
Gary Schindelman,
Kimberly Van Auken,
Jolene Fernandes,
Wen Chen, Xiaodong Wang,
Paul Davis,
Mary Ann Tuli,
Steven J Marygold,
Gillian Millburn,
Beverley Matthews,
Haiyan Zhang,
Nick Brown,
William M Gelbart,
Paul W Sternberg
[show abstract]
[hide abstract]
ABSTRACT: Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance.
We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction.
Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort.
BMC Bioinformatics 01/2012; 13:16. · 2.75 Impact Factor
-
Karen Yook,
Todd W Harris,
Tamberlyn Bieri,
Abigail Cabunoc,
Juancarlos Chan,
Wen J Chen,
Paul Davis,
Norie de la Cruz,
Adrian Duong,
Ruihua Fang, [......],
Daniel Wang, Xiaodong Wang,
Gary Williams,
Jonathan Hodgkin,
Matthew Berriman,
Richard Durbin,
Paul Kersey,
John Spieth,
Lincoln Stein,
Paul W Sternberg
[show abstract]
[hide abstract]
ABSTRACT: Since its release in 2000, WormBase (http://www.wormbase.org) has grown from a small resource focusing on a single species and serving a dedicated research community, to one now spanning 15 species essential to the broader biomedical and agricultural research fields. To enhance the rate of curation, we have automated the identification of key data in the scientific literature and use similar methodology for data extraction. To ease access to the data, we are collaborating with journals to link entities in research publications to their report pages at WormBase. To facilitate discovery, we have added new views of the data, integrated large-scale datasets and expanded descriptions of models for human disease. Finally, we have introduced a dramatic overhaul of the WormBase website for public beta testing. Designed to balance complexity and usability, the new site is species-agnostic, highly customizable, and interactive. Casual users and developers alike will be able to leverage the public RESTful application programming interface (API) to generate custom data mining solutions and extensions to the site. We report on the growth of our database and on our work in keeping pace with the growing demand for data, efforts to anticipate the requirements of users and new collaborations with the larger science community.
Nucleic Acids Research 11/2011; 40(Database issue):D735-41. · 8.03 Impact Factor
-
Chris Stark,
Bobby-Joe Breitkreutz,
Andrew Chatr-Aryamontri,
Lorrie Boucher,
Rose Oughtred,
Michael S Livstone,
Julie Nixon,
Kimberly Van Auken, Xiaodong Wang,
Xiaoqi Shi,
Teresa Reguly,
Jennifer M Rust,
Andrew Winter,
Kara Dolinski,
Mike Tyers
[show abstract]
[hide abstract]
ABSTRACT: The Biological General Repository for Interaction Datasets (BioGRID) is a public database that archives and disseminates genetic and protein interaction data from model organisms and humans (http://www.thebiogrid.org). BioGRID currently holds 347,966 interactions (170,162 genetic, 177,804 protein) curated from both high-throughput data sets and individual focused studies, as derived from over 23,000 publications in the primary literature. Complete coverage of the entire literature is maintained for budding yeast (Saccharomyces cerevisiae), fission yeast (Schizosaccharomyces pombe) and thale cress (Arabidopsis thaliana), and efforts to expand curation across multiple metazoan species are underway. The BioGRID houses 48,831 human protein interactions that have been curated from 10,247 publications. Current curation drives are focused on particular areas of biology to enable insights into conserved networks and pathways that are relevant to human health. The BioGRID 3.0 web interface contains new search and display features that enable rapid queries across multiple data types and sources. An automated Interaction Management System (IMS) is used to prioritize, coordinate and track curation across international sites and projects. BioGRID provides interaction data to several model organism databases, resources such as Entrez-Gene and other interaction meta-databases. The entire BioGRID 3.0 data collection may be downloaded in multiple file formats, including PSI MI XML. Source code for BioGRID 3.0 is freely available without any restrictions.
Nucleic Acids Research 11/2010; 39(Database issue):D698-704. · 8.03 Impact Factor
-
Todd W Harris,
Igor Antoshechkin,
Tamberlyn Bieri,
Darin Blasiar,
Juancarlos Chan,
Wen J Chen,
Norie De La Cruz,
Paul Davis,
Margaret Duesbury,
Ruihua Fang, [......],
Mary Ann Tuli,
Kimberly Van Auken,
Daniel Wang, Xiaodong Wang,
Gary Williams,
Karen Yook,
Richard Durbin,
Lincoln D Stein,
John Spieth,
Paul W Sternberg
[show abstract]
[hide abstract]
ABSTRACT: WormBase (http://www.wormbase.org) is a central data repository for nematode biology. Initially created as a service to the Caenorhabditis elegans research field, WormBase has evolved into a powerful research tool in its own right. In the past 2 years, we expanded WormBase to include the complete genomic sequence, gene predictions and orthology assignments from a range of related nematodes. This comparative data enrich the C. elegans data with improved gene predictions and a better understanding of gene function. In turn, they bring the wealth of experimental knowledge of C. elegans to other systems of medical and agricultural importance. Here, we describe new species and data types now available at WormBase. In addition, we detail enhancements to our curatorial pipeline and website infrastructure to accommodate new genomes and an extensive user base.
Nucleic Acids Research 11/2009; 38(Database issue):D463-7. · 8.03 Impact Factor
-
Anthony Rogers,
Igor Antoshechkin,
Tamberlyn Bieri,
Darin Blasiar,
Carol Bastiani,
Payan Canaran,
Juancarlos Chan,
Wen J Chen,
Paul Davis,
Jolene Fernandes, [......],
Mary Ann Tuli,
Kimberly Van Auken,
Daniel Wang, Xiaodong Wang,
Gary Williams,
Karen Yook,
Richard Durbin,
Lincoln D Stein,
John Spieth,
Paul W Sternberg
[show abstract]
[hide abstract]
ABSTRACT: WormBase (www.wormbase.org) is the major publicly available database of information about Caenorhabditis elegans, an important system for basic biological and biomedical research. Derived from the initial ACeDB database of C. elegans genetic and sequence information, WormBase now includes the genomic, anatomical and functional information about C. elegans, other Caenorhabditis species and other nematodes. As such, it is a crucial resource not only for C. elegans biologists but the larger biomedical and bioinformatics communities. Coverage of core areas of C. elegans biology will allow the biomedical community to make full use of the results of intensive molecular genetic analysis and functional genomic studies of this organism. Improved search and display tools, wider cross-species comparisons and extended ontologies are some of the features that will help scientists extend their research and take advantage of other nematode species genome sequences.
Nucleic Acids Research 02/2008; 36(Database issue):D612-7. · 8.03 Impact Factor
-
Tamberlyn Bieri,
Darin Blasiar,
Philip Ozersky,
Igor Antoshechkin,
Carol Bastiani,
Payan Canaran,
Juancarlos Chan,
Nansheng Chen,
Wen J Chen,
Paul Davis, [......],
Will Spooner,
Mary Ann Tuli,
Kimberly Van Auken,
Daniel Wang, Xiaodong Wang,
Gary Williams,
Richard Durbin,
Lincoln D Stein,
Paul W Sternberg,
John Spieth
[show abstract]
[hide abstract]
ABSTRACT: WormBase (http://wormbase.org), a model organism database for Caenorhabditis elegans and other related nematodes, continues to evolve and expand. Over the past year WormBase has added new data on C.elegans, including data on classical genetics, cell biology and functional genomics; expanded the annotation of closely related nematodes with a new genome browser for Caenorhabditis remanei; and deployed new hardware for stronger performance. Several existing datasets including phenotype descriptions and RNAi experiments have seen a large increase in new content. New datasets such as the C.remanei draft assembly and annotations, the Vancouver Fosmid library and TEC-RED 5' end sites are now available as well. Access to and searching WormBase has become more dependable and flexible via multiple mirror sites and indexing through Google.
Nucleic Acids Research 02/2007; 35(Database issue):D506-10. · 8.03 Impact Factor