The FlyBase C: a Chado case study: an ontology-based modular schema for representing genome-associated biological information

Harvard University, Cambridge, Massachusetts, United States
Bioinformatics (Impact Factor: 4.98). 08/2007; 23(13):i337-46. DOI: 10.1093/bioinformatics/btm189
Source: PubMed


Motivation: A few years ago, FlyBase undertook to design a new database schema to store Drosophila data. It would fully integrate genomic sequence and annotation data with bibliographic, genetic, phenotypic and molecular data from the literature representing a distillation of the first 100 years of research on this major animal model system. In developing this new integrated schema, FlyBase also made a commitment to ensure that its design was generic, extensible and available as open source, so that it could be employed as the core schema of any model organism data repository, thereby avoiding redundant software development and potentially increasing interoperability. Our question was whether we could create a relational database schema that would be successfully reused. Results: Chado is a relational database schema now being used to manage biological knowledge for a wide variety of organisms, from human to pathogens, especially the classes of information that directly or indirectly can be associated with genome sequences or the primary RNA and protein products encoded by a genome. Biological databases that conform to this schema can interoperate with one another, and with application software from the Generic Model Organism Database (GMOD) toolkit. Chado is distinctive because its design is driven by ontologies. The use of ontologies ( or controlled vocabularies) is ubiquitous across the schema, as they are used as a means of typing entities. The Chado schema is partitioned into integrated subschemas ( modules), each encapsulating a different biological domain, and each described using representations in appropriate ontologies. To illustrate this methodology, we describe here the Chado modules used for describing genomic sequences.

Full-text preview

Available from:
  • Source
    • "To annotate a gene, curators commonly proceed by: (1) locating the region of interest; (2) inspecting all available gene predictions and biological evidence aligned to the region; (3) creating a gene model; (4) if necessary, modifying these gene models using the editing functions; (5) corroborating the accuracy of the annotation by comparing the resulting annotation with available homologs; and (6) ensuring that correct naming conventions and relevant comments have been added, utilizing available literature as needed. Importing genomic data: Using server-side middleware, the system can load data tracks from a variety of sources, including the UCSC genome database [23], Chado databases [24], Ensembl DAS [25], and GenBank XML [26]. In our recent experience, however, the most common sources of genomic information are the laboratories of individual researchers themselves and therefore we focused our attention on direct loading of genomic data files. "
    Full-text · Dataset · Apr 2015
  • Source
    • "Data models devised in this study cover all the data manipulations and data types supported by the UGENE platform. However, other projects aimed at the development of a comprehensive relational data model for genomic data exist, for example, GMOD Chado [13], BioSQL [14] and BioMart [15]. Although, they may provide access to specific data types that are not supported by UGENE, certain interfaces to such external systems can be integrated into future versions of UGENE in order to achieve interoperability with existing data storages. "
    [Show abstract] [Hide abstract] ABSTRACT: Unipro UGENE is an open-source bioinformatics toolkit that integrates popular tools along with original instruments for molecular biologists within a unified user interface. Nowadays, most bioinformatics desktop applications, including UGENE, make use of a local data model while processing different types of data. Such an approach causes an inconvenience for scientists working cooperatively and relying on the same data. This refers to the need of making multiple copies of certain files for every workplace and maintaining synchronization between them in case of modifications. Therefore, we focused on delivering a collaborative work into the UGENE user experience. Currently, several UGENE installations can be connected to a designated shared database and users can interact with it simultaneously. Such databases can be created by UGENE users and be used at their discretion. Objects of each data type, supported by UGENE such as sequences, annotations, multiple alignments, etc., can now be easily imported from or exported to a remote storage. One of the main advantages of this system, compared to existing ones, is the almost simultaneous access of client applications to shared data regardless of their volume. Moreover, the system is capable of storing millions of objects. The storage itself is a regular database server so even an inexpert user is able to deploy it. Thus, UGENE may provide access to shared data for users located, for example, in the same laboratory or institution. UGENE is available at:
    Preview · Article · Jan 2015
  • Source
    • "In addition, the GDR has been rebuilt using Tripal, a toolkit for construction of online biological databases (4,5). Tripal uses the generic, modular, ontology-driven and open-source database schema called Chado (6). In addition to storage of genomic and genetic data, Chado also enables storage of large-scale phenotypic and genotypic data using the recently added Natural Diversity tables (7). "
    [Show abstract] [Hide abstract] ABSTRACT: The Genome Database for Rosaceae (GDR, http:/, the long-standing central repository and data mining resource for Rosaceae research, has been enhanced with new genomic, genetic and breeding data, and improved functionality. Whole genome sequences of apple, peach and strawberry are available to browse or download with a range of annotations, including gene model predictions, aligned transcripts, repetitive elements, polymorphisms, mapped genetic markers, mapped NCBI Rosaceae genes, gene homologs and association of InterPro protein domains, GO terms and Kyoto Encyclopedia of Genes and Genomes pathway terms. Annotated sequences can be queried using search interfaces and visualized using GBrowse. New expressed sequence tag unigene sets are available for major genera, and Pathway data are available through FragariaCyc, AppleCyc and PeachCyc databases. Synteny among the three sequenced genomes can be viewed using GBrowse_Syn. New markers, genetic maps and extensively curated qualitative/Mendelian and quantitative trait loci are available. Phenotype and genotype data from breeding projects and genetic diversity projects are also included. Improved search pages are available for marker, trait locus, genetic diversity and publication data. New search tools for breeders enable selection comparison and assistance with breeding decision making.
    Full-text · Article · Nov 2013 · Nucleic Acids Research
Show more