Vol. 23 ISMB/ECCB 2007, pages i337–i346
A Chado case study: an ontology-based modular schema for
representing genome-associated biological information
Christopher J. Mungall1,*,†, David B. Emmert2,†and The FlyBase Consortium
1Lawrence Berkeley National Laboratory, Lawrence Berkeley National Lab, Mail Stop 64R0121, Berkeley, CA 94720
and2Harvard University, Molecular and Cell Biology: FlyBase, 16 Divinity Avenue, Cambridge, MA 02138, USA
Motivation: A few years ago, FlyBase undertook to design a new
database schema to store Drosophila data. It would fully integrate
genomic sequence and annotation data with bibliographic, genetic,
phenotypic and molecular data from the literature representing
a distillation of the first 100 years of research on this major animal
model system. In developing this new integrated schema, FlyBase
also made a commitment to ensure that its design was generic,
extensible and available as open source, so that it could be
employed as the core schema of any model organism data
repository, thereby avoiding redundant software development and
potentially increasing interoperability. Our question was whether we
could create a relational database schema that would be success-
Results: Chado is a relational database schema now being used to
manage biological knowledge for a wide variety of organisms, from
human to pathogens, especially the classes of information that
directly or indirectly can be associated with genome sequences or
the primary RNA and protein products encoded by a genome.
Biological databases that conform to this schema can interoperate
with one another, and with application software from the Generic
Model Organism Database (GMOD) toolkit. Chado is distinctive
because its design is driven by ontologies. The use of ontologies
(or controlled vocabularies) is ubiquitous across the schema, as they
are used as a means of typing entities. The Chado schema
is partitioned into integrated subschemas (modules), each encapsu-
lating a different biological domain, and each described using
representations in appropriate ontologies. To illustrate this metho-
dology, we describe here the Chado modules used for describing
Availability: GMOD is a collaboration of several model organism
database groups, including FlyBase, to develop a set of open-source
software for managing model organism data. The Chado schema is
freely distributed under the terms of the Artistic License (http://
Contact: firstname.lastname@example.org or email@example.com.
1.1On the need for standardized database schemas
Organism-specific genome databases are expertly curated
repositories of data and knowledge concerning a particular
biological species, or a collection of closely related similar
species. These biological databases are typically (but not
always) implemented as relational databases that encode their
domain model using the relational model. A relational database
requires a data base management system (DBMS) to access and
update data. Data housed in a database must be modeled
according to a database schema, a computable description of
the data domain, expressed mainly as table definitions. Data
modelers, in conjunction with domain experts, design database
schemas. Users interact with the database via software
applications and user interfaces (often via another layer of
indirection, i.e. an intermediate, middleware layer). The design
consuming and labor-intensive. Furthermore, when database
applications are constructed to work with a particular schema
or set of schemas, changes to the database schema may dictate
reciprocal changes to this software. All of this makes schema
evolution a costly affair. From this it would seem to follow that
a small number of stable schemas would be favored over
a plethora of rapidly evolving schemas, and yet the latter is
more common in bioinformatics. Why is this the case?
changes in requirements. Most critical are the changes in the
nature of the underlying data, which is constantly accruing
and evolving. The nature of biological data has expanded
tremendously over time, ranging from classical genetic studies
performed a century ago (Morgan, 1907), to present-day
genome-scale molecular knowledge. For example, a database
schema built around the one-time central dogma of ‘one gene
codes for one enzyme’ (Beadle and Tatum, 1941) would be
considerably simpler than a schema that accurately represents
our present understanding of the complexities of genetic
information transfer. As our understanding of the natural
world changes over time, the requirements must necessarily
change as well.
New knowledgeSchemas must evolve to cope with
accrual of biological knowledge, are the advances in the
methods and materials we use to gain this understanding.
These rapid technological changes place additional require-
ments on the schema. During the short time that genetic
New experimental techniquesConcomitant with our
*To whom correspondence should be addressed.
yThe authors wish it to be known that, in their opinion, the first two
authors should be regarded as joint First Authors.
? 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
by guest on August 21, 2015
databases have been in existence we have seen experimental
techniques expand from physical mapping and PCR; to
sequencing of whole genomes; to modern high throughput
technologies for microarray and proteomics analysis: all of
which place increasing demands on the database schema that
must represent these.
in research because each offers unique leverage to explore
certain aspects of life. The taxonomic variance of biological
properties, along with the different experimental methods
utilized in these species, add another dimension to the
requirements. Any given organism is selected for a research
project based on its utility in answering different questions, and
this has made it historically difficult to create a species-blind
Different organisms A wide variety of species are used
the data; the changing requirements as science and technologies
progress; and the variability between research projects, the
design of stable, shared schemas that are acceptable to a wide
variety of different projects is a challenging task. Even within
the realm of model organism database projects, the historical
tendency has been for each project to design their own schema
de novo, or in some cases to start with an existing schema
and customize it to satisfy a different set of requirements.
Such customizations inevitably lead to divergence, loss of
interoperability and duplication of effort.
Because these factors make it difficult to create schemas that
are stable, schemas are constantly evolving with concomitant
high costs in software maintenance. The challenge of biological
database design is how to keep pace with a moving target.
AcceptabilityCoupled with the innate complexity of
1.2 Existing approaches to biological schemas
As might be expected, there are a wide variety of approaches in
designing schemas for biological databases. Some of the more
notable schemas, with which we have direct personal experi-
ence, include: ACeDB, ArkDB, Ensembl, Genomics Unified
Schema (GUS) and Gadfly (Mungall et al., 2002). ACeDB
(A C.elegans Database) was one of the first model organism
databases. It was built for Caenorhabditis elegans (Durbin and
Theirry-Mieg, 1994) and is actually a DBMS that follows a
hierarchical rather than relational model. ACeDB was adapted
for use in a number of model organism projects (as well as
projects not related to biology at all, testimony to its flexibility).
The ArkDB schema (Hu et al., 2001) was created to serve the
needs for the subset of the model organism community
interested in agriculturally important animals. It has been
successfully used across different species by different commu-
nities, but is rarely used outside the agricultural community.
The Ensembl database system and schema was initially
constructed to analyze the newly sequenced human genome
(Hubbard, 2002) and serve the results to the scientific
community. It has since been adopted by other groups and is
used for a large variety of (primarily chordate) species. Its focus
has also expanded, and now Ensembl includes a variety of
federated databases accessible through the EnsMart. Like the
other databases, GUS was specifically built for transcript
analysis, and to serve the needs of the Plasmodium research
community, and has been extended to serve additional
communities. GadFly was designed to serve as a repository of
Drosophila genomic annotations, but was also used to hold
honeybee and Anopheles annotations.
1.3Ontologies and terminologies
Chado differs from the schemas mentioned above by the
centrality of ontologies and terminologies as a core component.
Chado uses ontologies not just for annotation of biological
entities, but also as a schema-wide entity-typing and entity-
relationship-typing mechanism. This methodology of ontology-
driven design is explored in this article. We contend that it is the
key to the success and flexibility of Chado and of its adoption
by a wide variety of research projects. In other schemas with
which we have experience, typing of the data is enforced at the
relational layer. In Chado, in contrast, data typing is driven
by ontologies in the controlled vocabularies module, and this
makes it possible for the same schema and application to be
reused and to evolve over time.
2 SYSTEM AND METHODS
The Chado package uses postGreSQL and Perl. In addition to the
Chado DDL (Data Definition Language), installation requires three
additional Perl packages: bioperl-live, go-perl and DBIx::DBStag.
To install on Fedora Core 1-5, OS X or CentOS 4 you may use the
RPM packages for installing Chado, and its prerequisites, provided by
Allen Day (http://biopackages.net). Otherwise installation requires
checking out the Chado package via anonymous CVS and performing
a series of command line operations. Instantiations of Chado in Oracle
or mySQL idiom are also available.
3 DESIGN APPROACH
Because Chado makes extensive use of ontologies (also known
as controlled vocabularies1) as a means of typing entities in the
schema, and as metadata for extensible data properties,
an appreciation of the fundamentals of ontologies and how
they are coded in the Chado schema is required. The rationale
for this approach is 2-fold. It addresses both the significant
issue of constantly evolving requirements and provides support
for reasoning. An ontology is a representation of the different
types of entity that exist in the world, and the relationships that
hold between these entities. Examples would be the anatomical
type ‘eye’ or the process type ‘cysteine biosynthesis’. These
types stand in certain relationships to one another; for example,
‘eye is_a sense organ’ or ‘ommatidium part_of compound eye’.
The relationships in an ontology can be represented as a graph
(often, but not always a directed acyclic graph, or DAG).
The OBO relationships paper by Smith et al. in 2005 provides a
detailed treatment of relationship types in biological ontologies.
Of particular interest to Chado is the relation, which specifies
a subtyping relationship between two terms or classes. It is the
relations that exist between the types in the ontology that
supply a means of supporting reasoning.
1In fact there are crucial differences between ontologies and vocabul-
aries. However, not everyone agrees on what these are. For the
purposes of this article it is simpler to gloss over these differences.
C.J.Mungall et al.
by guest on August 21, 2015
The authors gratefully acknowledge the support and advice
from all of our colleagues in the FlyBase Consortium, whose
expertise and initiative led to Chado’s development. We also
gratefully acknowledge the input from everyone in the
GMOD Consortium, especially Scott Cain, Brian Osborne
and Guanming Wu for carrying this work forward for
implementation by other database groups. FlyBase is supported
by a grant from the Public Health Services (NIH grant 5P41
HG000739, through the National Human Genome Research
Institute, W. Gelbart, PI) with additional support for Chado
development from the HHMI (G. Rubin, PI). Finally, we are
extremely appreciative to Chado’s users whose feedback and
support continues to improve Chado and its associated
FlyBase consortium contributions Current and former FlyBase
Consortium members making notable contributions to this
project are William M. Gelbart, Aubrey de Grey, Stan
Letovsky, Suzanna E. Lewis, Gerald M. Rubin, ShengQiang
Shu, Colin Wiel, Peili Zhang and Pinglei Zhou.
Conflict of Interest: none declared.
Arnaiz,O et al. (2007) ParameciumDB: a community resource that integrates
the Paramecium tetraurelia genome sequence with genetic data. Nucleic Acids
Res., 35, D439–D444Epub.
Ashburner,M et al. (2000) Gene ontology: tool for the unification of biology.
The gene ontology consortium. Nat. Genet., 25, 25–29.
Bard,J et al. (2005) An ontology for cell types. Genome Biol., 6, R21.
Beadle,G.W. and Tatum,E.L. (1941) Genetic control of biochemical reactions in
neurospora. Proc. Natl Acad. Sci., 27, 499–506.
Brickley,D. and Guha,RV. (2000) Resource description framework (RDF)
schema specification 1.0, W3C Candidate Recommendation.
Clark,T. et al. (2004) Globally distributed object identification for biological
knowledgebases. Brief Bioinform., 5, 59–70.
Durbin,R. and Theirry-Mieg,J. (1994) ACeDB. Computational Methods in
Genome Research. Plenum, New York.
Eilbeck,K.and Lewis,S.(2004) Sequence
Comp. Funct. Genomics, 5, 642–647.
Eilbeck,K. et al. (2005) The sequence ontology: a tool for the unification of
genome annotations. Genome Biol., 6, R44.
Harris,MA. et al. (2004) The Gene Ontology (GO) database and informatics
resource. Nucleic Acids Res., 32, D258–D261.
Higgins,D. et al. (1994) CLUSTALW: improving the sensitivity of progressive
multiple sequence alignment through sequence weighting, position-specific
gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680.
Hoskins,RA. et al. (2002) Heterochromatic sequences in a Drosophila whole-
genome shotgun assembly. Genome Biol., 3, RESEARCH0085.
Hu,J. et al. (2001) The ARKdb: genome databases for farmed and other animals.
Nucleic Acids Res., 29, 106–110.
Hubbard,T. et al. (2005) Ensembl. Nucleic Acids Res., 33, D447–D453.
Lewis,SE. et al. (2002) Apollo: a sequence annotation editor. Genome Biol., 3,
Morgan,TH. (1907) The cause of gynandromorphism in insects. Am. Nat., 41,
database to support whole-genome sequence annotation. Genome Biol.,
Smith,B. et al. (2005) Relations in biomedical ontologies. Genome Biol., 6, R46.
Stajich,JE. et al. (2002) The Bioperl toolkit: Perl modules for the life sciences.
Genome Res., 12, 1611–1618.
Stajich,JE. and Lapp,H. (2006) Open source tools and toolkits for bioinformatics:
significance, and where are we? Brief Bioinform., 7, 287–296.
Stein,LD. et al. (2002) The generic genome browser: a building block for a model
organism system database. Genome Res., 12, 1599–1610.
Wang,L. et al. (2007) BeetleBase: the model organism database for Tribolium
castaneum. Nucleic Acids Res., 35, D476–D479.
Yandell,M. et al. (2006) Large-scale trends in the evolution of gene structures
within 11 animal genomes. PLoS Comput. Biol., 2.
C.J.Mungall et al.
by guest on August 21, 2015