ArticlePDF AvailableLiterature Review

Database Development in Toxicogenomics: Issues and Efforts

  • PharmPoint Consulting
  • ILSI Health and Environmental Sciences Institute

Abstract and Figures

The marriage of toxicology and genomics has created not only opportunities but also novel informatics challenges. As with the larger field of gene expression analysis, toxicogenomics faces the problems of probe annotation and data comparison across different array platforms. Toxicogenomics studies are generally built on standard toxicology studies generating biological end point data, and as such, one goal of toxicogenomics is to detect relationships between changes in gene expression and in those biological parameters. These challenges are best addressed through data collection into a well-designed toxicogenomics database. A successful publicly accessible toxicogenomics database will serve as a repository for data sharing and as a resource for analysis, data mining, and discussion. It will offer a vehicle for harmonizing nomenclature and analytical approaches and serve as a reference for regulatory organizations to evaluate toxicogenomics data submitted as part of registrations. Such a database would capture the experimental context of in vivo studies with great fidelity such that the dynamics of the dose response could be probed statistically with confidence. This review presents the collaborative efforts between the European Molecular Biology Laboratory-European Bioinformatics Institute ArrayExpress, the International Life Sciences Institute Health and Environmental Science Institute, and the National Institute of Environmental Health Sciences National Center for Toxigenomics Chemical Effects in Biological Systems knowledge base. The goal of this collaboration is to establish public infrastructure on an international scale and examine other developments aimed at establishing toxicogenomics databases. In this review we discuss several issues common to such databases: the requirement for identifying minimal descriptors to represent the experiment, the demand for standardizing data storage and exchange formats, the challenge of creating standardized nomenclature and ontologies to describe biological data, the technical problems involved in data upload, the necessity of defining parameters that assess and record data quality, and the development of standardized analytical approaches.
Content may be subject to copyright.
Genomics and Risk Assessment
Toxicology, the study of poisons, focuses
on substances and treatments that cause
adverse effects in living things. A critical
part of this study is the characterization of
the adverse effects at the level of the organ-
ism, the tissue, the cell, and the molecular
makeup of the cell. Thus, studies in toxi-
cology measure effects on body weight and
food consumption of an organism, on indi-
vidual organ weights, on microscopic
histopathology of tissues, and on cell via-
bility, necrosis, and apoptosis. Recently
added to the arsenal of end points that
such toxicological studies can use is the
measurement of levels of the thousands of
proteins and mRNAs present in the cell.
The former measurement was made possi-
ble with the advent of two-dimensional gel
electrophoresis and forms the basis of the
field of proteomics. The latter measure-
ment was made possible with the advent of
whole genomic sequencing and the subse-
quent development of microarrays capable
of measuring thousands of transcripts at
once and is best described as transcript
profiling, although it has often been
referred to as genomics or transcriptomics.
The application of these technologies to
toxicology is based on the assumption that
the sequelae of events leading to adverse
events at the cellular and organism levels
will include critical changes in certain
mRNAs and proteins. Consequently, these
changes may give insight into the molecu-
lar mechanisms of toxicity and/or may be
diagnostic for a given mode of toxicity.
Thus the number of toxicology studies
incorporating either proteomics or tran-
script profiling has been exponentially
increasing for several years.
Although both proteomics and transcript
profiling measure molecular events at a
global and cellular levels, the two are dramat-
ically different in both technology and read-
out. Proteomics relies on the physical
separation of all the proteins of a sample,
usually by means of two separate characteris-
tics such as charge and molecular weight, fol-
lowed by detection of the protein with a dye,
and finally, identification by means of mass
spectrometry. Transcript profiling with
microarrays makes use of hundreds to
thousands of defined probes, each of which
is intended to detect a single mRNA mole-
cule. The mRNA sample is labeled and
hybridized to the microarray such that the
signal at a given probe is related to the
amount of that particular mRNA in the sam-
ple. This readout characteristic makes
microarray-based transcript profiling particu-
larly appealing because the identities of the
signals are predetermined. In this sense, data
generated in transcript profiling experiments
are rather straightforward. However, because
of the relatively poor annotation of expressed
genes and sequence tags, particularly in the
dog and rat, the interpretation of transcript
profiling experiments is challenging.
The field of toxicogenomics integrates
the data-rich science of transcript profiling
with traditional toxicological end point eval-
uation. If successfully implemented, this
integration has the potential to serve as a
powerful synergistic tool for understanding
the relationship between gross toxicology
and genome-level effects. From its inception
the field of transcript profiling using
microarrays has, through the sheer volume
Environmental Health Perspectives
VOLUME 112 | NUMBER 4 | March 2004
Database Development in Toxicogenomics: Issues and Efforts
William B. Mattes,
Syril D. Pettit,
Susanna-Assunta Sansone,
Pierre R. Bushel,
and Michael D. Waters
Pfizer Inc, Groton, Connecticut, USA;
ILSI Health and Environmental Sciences Institute, Washington, DC, USA;
European Molecular
Biology Laboratory–European Bioinformatics Institute, Hinxton, United Kingdom;
National Center for Toxicogenomics, National
Institute of Environmental Health Sciences, National Institutes of Health, Department of Health and Human Services, Research Triangle
Park, North Carolina, USA
This article is part of the mini-monograph
“Application of Genomics to Mechanism-Based
Risk Assessment.”
Address correspondence to W.B. Mattes,
GeneLogic, Inc., 610 Professional Dr., Gaithersburg,
MD 20879 USA. Telephone: (240) 364-6238. Fax:
(240) 364-6262. E-mail:
We thank A. Brazma, Microarray Informatics,
(EMBL–EBI);C. Bradfield, McArdle Laboratory for
Cancer Research, University of Wisconsin, Madison,
WI; W. Tong, National Center for Toxicological
Research, Jefferson, AR; and W. Eastin, National
Toxicology Program, National Institute of
Environmental Health Sciences, Research Triangle
Park, NC, for their review of this manuscript prior
to submission. We also thank the microarray infor-
matics team at EMBL-EBI, the expression profiler
developers, and the ArrayExpress curation and devel-
opment teams. We especially thank S. Contrino for
his contribution to Tox-MIAMExpress. The
ArrayExpress project is funded by EMBL, the
European Commission [TEMBLOR (The European
Molecular Biology Linked Original Resources)
grant], the EBI Industry Programme (Biostandards),
the CAGE (Compendium of Arabidosis Gene
Expression) consortium, and the Health and
Environmental Sciences Institute (HESI)
Toxicogenomics Database grant.
The authors declare they have no competing
financial interests.
Received 25 August 2003; accepted 12 January
The marriage of toxicology and genomics has created not only opportunities but also novel infor-
matics challenges. As with the larger field of gene expression analysis, toxicogenomics faces the
problems of probe annotation and data comparison across different array platforms.
Toxicogenomics studies are generally built on standard toxicology studies generating biological end
point data, and as such, one goal of toxicogenomics is to detect relationships between changes in
gene expression and in those biological parameters. These challenges are best addressed through
data collection into a well-designed toxicogenomics database. A successful publicly accessible toxi-
cogenomics database will serve as a repository for data sharing and as a resource for analysis, data
mining, and discussion. It will offer a vehicle for harmonizing nomenclature and analytical
approaches and serve as a reference for regulatory organizations to evaluate toxicogenomics data
submitted as part of registrations. Such a database would capture the experimental context of in
vivo studies with great fidelity such that the dynamics of the dose response could be probed statisti-
cally with confidence. This review presents the collaborative efforts between the European
Molecular Biology Laboratory–European Bioinformatics Institute ArrayExpress, the International
Life Sciences Institute Health and Environmental Science Institute, and the National Institute of
Environmental Health Sciences National Center for Toxigenomics Chemical Effects in Biological
Systems knowledge base. The goal of this collaboration is to establish public infrastructure on an
international scale and examine other developments aimed at establishing toxicogenomics data-
bases. In this review we discuss several issues common to such databases: the requirement for iden-
tifying minimal descriptors to represent the experiment, the demand for standardizing data storage
and exchange formats, the challenge of creating standardized nomenclature and ontologies to
describe biological data, the technical problems involved in data upload, the necessity of defining
parameters that assess and record data quality, and the development of standardized analytical
approaches. Key words: ArrayExpress, bioinformatics, CEBS, database, EBI, HESI, MIAME, NCT,
toxicogenomics. Environ Health Perspect 112:495–505 (2004). doi:10.1289/txg.6697 available via [Online 15 January 2004]
of data involved, required incorporation of
resources for bioinformatics, data manage-
ment, and statistical analysis (Bassett et al.
1999; Eisen et al. 1998; Ermolaeva et al.
1998). The addition of toxicology informa-
tion to these data poses additional and
unique informatics challenges. A typical
toxicogenomics study might involve an ani-
mal study with three dose groups (one vehi-
cle group, one low-dose group, and one
high-dose group), two to three sacrifice
times, and four to five animals per group.
Even if only one tissue is examined per ani-
mal, this represents 36–45 arrays per study,
not including replicates. In addition, each
animal will have associated data on total
body and organ weight measurements, clini-
cal chemistry measurements (often up to 25
parameters), and microscopic histopathol-
ogy for several tissues. The challenges and
opportunities for a rigorous toxicogenomics
database are the capture, storage, and inte-
gration of a large volume of diverse data.
Several commercial ventures, including
GeneLogic (Gaithersburg, MD; www.genel- (Castle et al. 2002), Curagen
(New Haven, CT; http://www.curagen.
com/) (Rininger et al. 2000), and Iconix
Pharmaceuticals (Mountain View, CA; have devel-
oped proprietary databases of this type. In
this article we focus on the development of
public toxicogenomics databases and the
application of international database stan-
dards in that process. The authors acknowl-
edge that the review does not include all
public databases of microarray toxico-
genomics experiments.
Role of Public Toxicogenomics
Although several reports have described
software for managing genomic/transcript
profiling data at the local or laboratory level
(Bumm et al. 2002; Bushel et al. 2001;
Ermolaeva et al. 1998; Liao et al. 2000;
Stoeckert et al. 2001), there are compelling
reasons for the establishment of public data-
bases that house not only such transcript
profiling data but also the associated toxico-
logical end points. First and foremost is that
such a public warehouse would provide a
means for the scientific community to pub-
lish and share the data from such experi-
ments to advance understanding of
biological systems. These repositories would
also serve as a resource for data mining and
discovery of expression patterns common to
certain experimental conditions. In addi-
tion, a public repository would offer the
regulatory community a resource for com-
parison with toxicogenomics data submit-
ted as part of the compound registration
process (Petricoin et al. 2002). Deposition
of data into public databases has already
been proposed as a requirement for journal
publication of standard genomics experi-
ments (Anonymous 2002; Ball et al. 2002),
and public databases for microarray data
have been established (Anonymous 2002;
Brazma et al. 2003; Edgar et al. 2002).
Another important function of some
public repositories is the promotion of
international standards in data organiza-
tion and nomenclature (Anonymous 2002;
Bassett et al. 1999; Brazma et al. 2001;
Stoeckert et al. 2002). Particularly in the
case of biological data, the establishment
of standard ontologies allows uniform
analysis of diverse data (Ashburner et al.
2000). Finally, public toxicogenomics
databases would also offer the larger toxi-
cology community common resources for
comparing analytical tools and discussing
experimental approaches. Thus a database
that organizes results from diverse labora-
tories and platforms would allow the iden-
tification of experimental practices that
introduce undesirable variability into toxi-
cogenomics data. Although there are chal-
lenges for developing public databases that
combine genomic and toxicological data,
the example of an international infrastruc-
ture for nucleotide sequence data such as
the GenBank/EMBL/DDBJ (European
Molecular Biology Laboratory/DNA Data
Bank of Japan) collaboration points to the
vast benefit that the larger scientific
community would reap from them.
Despite the obvious scientific benefits of
public toxicogenomics resources, as
described below, many technical and logis-
tical issues challenge their implementation.
These problems may be broadly classified
as a) approaches addressing accuracy and
specificity, b) standardization of data
inputs, c)methods assuring data quality
and comparability, and d) development
and design of standardization experiments.
Accuracy and Specificity
The use of advanced data referencing and
analysis tools is valuable only to the extent
that the data employed by these tools have a
high degree of internal accuracy. However,
the inherent dynamics of hybridization cou-
pled with the incomplete nature of genomic
sequence information create the potential for
imprecision and/or error in a transcript pro-
filing experiment even before the assay is run.
Hybridization specificity. At the design
stage, each element of a microarray, be it a
cDNA clone or oligonucleotide sequence,
must be selected from the thousands of
entries in sequence databases on the basis of
several characteristics (Lockhart et al. 1996;
Schena et al. 1995). The first of these is
specificity: for example, the mRNA for
cytochrome P450 (Cyp) 3A4 (GenBank
accession no. NM_017460; http://www. is 92% identical to
the mRNA for Cyp3A7 (GenBank accession
no. NM_000765), and thus a microarray
element consisting of a cDNA sequence for
Cyp3A7 would be expected to detect
Cyp3A4 as well. Similarly, a microarray ele-
ment may lack specificity because it corre-
sponds to a sequence (e.g., a 3´ untranslated
region) common to several alternatively
spliced transcripts, for example, the UDP-
glucuronyltransferase 1A family, where seven
transcripts (UGT1A1, UGT1A3, UGT1A4,
UGT1A6, UGT1A7, UGT1A8, and
UGT1A9) all share the same 3´ sequences
(Burchell et al. 1991). Commercial chip
manufacturers have gradually recognized
some of these problems and have refined
their probe sets accordingly.
Accuracy of gene sequences in public
databases. Many early sequence entries
deposited in public sequence databases were
the product of less advanced and less accu-
rate sequencing techniques than are cur-
rently available. Thus, when multiple
sequence entries for a single gene are avail-
able, they should be cross-checked against
each other to determine the best consensus
sequence to use. The formerly common
practice of relating a given clone sequence to
only one of several possible GenBank acces-
sion numbers must be avoided. Finally,
sequences must be examined for hybridiza-
tion characteristics, as abnormally high or
low G + C content may skew the signal for
that target relative to other targets.
Accuracy of annotation on a microarray
platform. Commercial array manufacturers
and custom array designers use GenBank,
EMBL Nucleotide Sequence Database
(, or
DDBJ ( gene
sequence accession numbers of either
cDNA expressed sequence tags (ESTs) or
mRNAs, IMAGE clone or RefSeq identi-
fiers (IDs), UniGene cluster numbers, or
proprietary/internal accessioning to identify
gene features. As such, annotation of the
elements present on any given microarray
can be a potential confounder for microar-
ray use in toxicology. This problem can arise
because entries to sequence databases often
predate the standardized gene names and
descriptions found in curated resources such
as LocusLink (
LocusLink/; Pruitt and Maglott 2001) or
may not be represented in such resources.
Such inconsistency in annotation hampers
toxicogenomics in two ways. First, it
complicates mechanistic interpretation
of transcript changes, as nonstandard
Mattes et al.
VOLUME 112 | NUMBER 4 | March 2004
Environmental Health Perspectives
annotation of a sequence element (i.e.,
gene) may limit the effectiveness of a litera-
ture search. Second, differences in annota-
tion both within and between different
microarray platforms hamper the ability to
compare results obtained with their use.
One approach would be to adopt an auto-
mated client server–based system that incor-
porates annotation from several sources and
allows those sources to contribute equally in
real time to annotation content (Dowell
et al. 2001). Another approach (described
in this volume) cross-references the
GenBank accession number for a given
array sequence with UniGene (http://www. and LocusLink
(Mattes 2004). This process creates a single
identifier number and annotation, making
the assumption that LocusLink information
(if available) represents the best (i.e.,
curated) annotation, with UniGene infor-
mation as an alternative. The single identi-
fier then allows intra- and interplatform
comparisons, with the caveat that different
probe sequences annotating to the same
gene may still give different results based on
the hybridization specificity noted above.
Standardization of Data Inputs
Critical to the utility of a database is the
breadth, depth, and uniformity of the
information it contains. To address the last
issue, the same standard nomenclature and
numerical units must be used for different
data sets so data may be compared across
experiments. To address the first two
issues, guidelines must be developed detail-
ing what information must be included in
a data set. Minimum Information About a
Microarray Experiment (MIAME) guide-
lines (Brazma et al. 2001) allow sufficient
and structured information to be recorded
to correctly interpret and replicate the
experiments or retrieve and analyze the
data. Accordingly, guidelines for journal
publication of microarray experiments have
been proposed (Ball et al. 2002), along
with submission of the data to either of
the two existing public repositories:
ArrayExpress (
arrayexpress/; Brazma et al. 2003) or Gene
Expression Omnibus (GEO; www.ncbi. (Edgar et al. 2002).
Several journals now require an accession
number (indicating that a data set has been
submitted successfully to one of these two
public repositories) to be supplied at or
before acceptance of publication.
Although current MIAME guidelines
address the information content for a variety
of microarray experiments, a need for
comparable guidelines for the toxicology
component of some microarray experiments
was identified. To address the additional
information content in toxicology studies,
the National Institute of Environmental
Health Sciences National Center for
Toxicogenomics (NIEHS NCT; Research
Triangle Park, NC) has partnered with
EMBL–European Bioinformatics Institute
(EBI) (Toxicogenomics at EBI; http://www.;
Hinxton, U.K.); the International Life
Sciences Institute (ILSI) Health and
Environmental Sciences Institute (HESI)
Technical Committee on the Application of
Genomics to Mechanism-Based Risk
Assessment; ( and
pubentityid=120; Washington, DC), and
more recently, the National Center for
Toxicological Research (NCTR), Center for
Toxicoinformatics, U.S. Food and Drug
Administration (U.S. FDA) (Jefferson, AR)
to initiate the development of guidelines for
describing toxicogenomics experiments—
MIAME/Tox [see Microarray Gene
Expression Data (MGED); http://www.]. MIAME/Tox extends MIAME
to provide a structured annotation and
framework for capturing information associ-
ated with the toxicology component of toxi-
cogenomic experiments. MIAME/Tox
includes some free-text fields along with
controlled vocabularies or external ontolo-
gies, specifically regarding species taxonomy,
cell types, anatomy terms, histopathology,
clinical chemistry, toxicology, and chemical
compound nomenclature. An additional
objective of MIAME/Tox is to guide the
development of toxicogenomics databases
and data management software.
One challenge that has arisen as part of
the ongoing formulation of the MIAME/
Tox structure is the harmonization of diverse
ontologies. Pathology observations, both
macro- and microscopic, are critical compo-
nents of toxicological data. To this end
pathologists have developed (and are contin-
uing to develop) controlled vocabularies for
both human clinical pathology [e.g., system-
ized nomenclature of medicine (SNOMED);] and veterinary
pathology (e.g., Society of Toxicologic
Pathology (STP)/ILSI; http://www.; National Toxicology Program
Pathology Code Table (NTP PCT; http://
These efforts obviously do not include gene
expression terms or genomic and postge-
nomic information and may be contrasted
with those efforts in the bioinformatics com-
munity to develop ontologies (MGED;, Gene Ontology
Human Genome Organisation/Proteomics
Standards Initiative (HUGO/PSI; http:// where the clinical
annotation of the samples (anatomy,
pathology, clinical pathology) is pending. It
is unlikely that a single terminology will
cover all domains, and recent effort has been
placed on the semantic mapping and the
interoperability among terminologies [e.g.,
Standards and Ontologies for Functional
Genomics (SOFG) mouse anatomy effort;]. However, semantic
mapping/interoperability can only be
achieved among true ontologies. Ontology is
a key step to integration—a system of coding
knowledge. An ontology is a formal and
declarative representation that includes the
vocabulary (or names) for referring to the
terms in that subject area and the logical
statements that describe what the terms are,
how they are related to each other, and how
they can or cannot be related to each other.
An ontology has a definition for each term it
includes and is built in a tool that allows
export in a standard and machine-readable
format. Developing pathology-controlled
vocabularies as an ontology will facilitate
data exchange with databases that use a dif-
ferent ontology, subject to a semantic map-
ping but will require close collaboration
between the pathology and bioinformatics
Data Quality and Comparability
The old adage “garbage in, garbage out” is
constantly reiterated in the world of data-
base development regarding population of
data and information. Toxicogenomics
databases are not immune to the pitfalls of
a poorly guarded data storage system and
may contain data of subpar quality and/or
insufficient or incorrect biological informa-
tion to describe or annotate experiments.
Data quality metrics for microarray data
have been extensively investigated in recent
years without any clear consensus in the
toxicological community as to which uni-
versal standard to adopt (Finkelstein et al.
2002; Gollub et al. 2003; Hessner et al.
2003; Model et al. 2002; Tseng et al.
2001). The difficulty in reaching consensus
is due in part to the diversity of existing
(and pending) array platforms, data acqui-
sition methods, and normalization proce-
dures. This systematic complexity makes
the distillation of consensus data quality
standards a significant challenge.
Some of this complexity can be
circumvented by storing pixel intensity
images (rawest form of the microarray
data) from the array scanners in a data
repository. Image processing software
could be archived and made available on
request to permit reanalysis using a com-
mon data acquisition method. In addition,
several microarray data analysis tools use
the mean, median, or other measure of
Toxicogenomic database review
Environmental Health Perspectives
VOLUME 112 | NUMBER 4 | March 2004
central tendency of acquired data. As such,
the unnormalized or unadjusted data
could also be captured in a database. By
starting with unadjusted data, the same
background subtraction procedure and
common normalization or transformation
method could be applied to the data to
make assessments of gene expression
changes across different microarray
platforms more comparable.
Although they do not represent
consensus standards, the statistical mea-
sures and data quality metrics associated
with an individual technical platform can
be useful tools. This information may be
used to compare data quality between two
samples run on the same type of technical
platform. Alternatively, these measures can
be used as qualitative tools for comparison
across different technical platforms.
A major impediment to comparing
gene expression data and subsequent data
quality across platforms is resolving the
annotation of the gene features arrayed
on the chips. As discussed above (see
“Accuracy of Annotation on a Microarray
Platform”), when comparing gene expres-
sion of features on separate arrays, the
problem is ascertaining whether the
probes for the genes with equivalent
accessioning are actually derived from the
same sequence region (start and end base
position for oligos or cDNA fragment in
the case of ESTs) of the gene. At worst
the features with the same gene ID actu-
ally may be probing different alternative
splice variants of the same gene (Murphy
2002; Wolfinger et al. 2001). Data
integrity and usability within toxicoge-
nomic databases can thus be improved by
a) maintaining the DNA sequence of
gene features on the arrays, b) regularly
updating gene annotation and description
by BLAST sequence analysis, and c) clus-
tering similar gene sequences to reduce or
identify redundancy.
In addition to the challenges associated
with microarray data, the challenge in
development of a toxicogenomics database
is also to effectively capture clinical chem-
istry parameters and histopathological
observations in a manner not only practical
for relational or object-oriented database
structuring but also intuitive for extracting
informative association rules. Customary
laboratory quality control measures and
routine calibration parameters for clinical
chemistry profiles must be stored in a toxi-
cogenomics database to effectively assess
the quality of the data when modeling gene
expression data in conjunction with clinical
pathology evaluations. For instance,
checking the linearity of a response variable
and collecting a control standard curve
measurement are extremely useful in
assessing the quality of clinical chemistry
data and will prove to be imperative for
correlating biochemical changes in biosam-
ples with gene expression–level alterations
in cell populations.
Histopathology data result largely from
discrete observations drawn from standard
nomenclatures or tables of gross and
micropathology observations and thus lend
themselves well to constructing indicator
(class) variables in statistical modeling and
data mining algorithms. Pathologists use
these descriptors and conventions to
describe the essential components of a spe-
cific target organ response to a toxicant.
Historically, there has not been much
agreement with, or standardization of, a
common system for pathologists to
describe conventional pathological inter-
pretations. Therefore, compatibility among
histopathology observations coded by
pathologists is a challenge to resolve in a
toxicogenomics database and makes the
integration of the microarray, clinical
chemistry, and histopathology data
domains difficult to stage. The use of a
sophisticated indexing system to reconcile
differing pathological evaluations stored in
the toxicogenomics database will theoreti-
cally improve the possibility of merging
good-quality microarray data with equally
precise toxicological information.
Standardization Experiments
In mid-1999 the membership of the HESI
formed a project committee to develop a
collaborative scientific program to address
issues, challenges, and opportunities
afforded by the emerging field of toxico-
genomics. This committee, comprising
corporate members from the pharmaceuti-
cal, agrochemical, chemical, and consumer
products industries as well as advisors from
academia and government, conducted a
program in which common pools of RNA
were analyzed in more than 30 different
laboratories on both similar and different
technical platforms. As reported in this vol-
ume, the considerable data set generated by
the HESI Genomics Committee has been
useful in increasing the understanding of
sources of biological and technical variabil-
ity, the alignment of toxicant-induced tran-
scription changes with the accepted
mechanism of action of these agents, and
the challenges in the consistent analysis and
sharing of the voluminous data sets gener-
ated by these approaches (Pennie et al.
2004). The experimental programs have
shown that patterns of gene expression
relating to biological pathways are robust
enough to allow insight into mechanisms
even across different platforms and analysis
sites. Thus, toxicogenomics experiments
within the broad fields of hepatotoxicity,
nephrotoxicity, and genotoxicity have
determined that known mechanisms and
pathways of toxicity can be associated with
characteristic gene expression profiles. This
data set, including both genomic and toxi-
cology data, is currently being deposited in
EBI’s ArrayExpress (see discussion below).
In a parallel effort the NIEHS Division
of Extramural Research and Training
(DERT), under the auspices of the NIEHS
NCT, initiated the Toxicogenomics
Research Consortium (TRC; http://www. in
November 2000 to serve as the extramural
research arm of the NIEHS NCT. The
TRC consists of several academic institu-
tions, and its primary goal is to perform
investigator-initiated molecular toxicology
research using current gene expression
technologies. In addition to these indepen-
dent research projects, all centers partici-
pate along with the NIEHS Microarray
Group in two types of collaborative
research projects, standardization experi-
ments and collaborative toxicology, or
Science To Achieve Results (STAR), pro-
jects. The STAR projects involve investiga-
tors from two or more centers conducting
collaborative toxicology research using gene
expression profiling. Through a series of
experiments, researchers are evaluating
sources of technical variation in gene
expression experiments, with an eye toward
establishing standards for evaluating com-
petency and quality of gene expression data
across multiple technology platforms and
research centers. Findings from the first
standardization experiment indicate that
individual centers can identify differentially
expressed genes in standard RNA samples
with moderate to high correlation across a
variety of microarray platforms. Conversely,
the greatest variation is observed when gene
expression experiments are conducted
across multiple centers using one or more
microarray platforms (Unpublished data).
Gene expression data generated by the TRC
will support the field of toxicogenomics as a
whole as well as assist the NIEHS NCT in
developing the Chemical Effects in
Biological Systems (CEBS) knowledge base
described below.
Path Forward
The many informatics challenges encoun-
tered during the course of toxicogenomics
projects (as described above) are all
surmountable but are made far more
tractable if the data required for and gener-
ated by the project flow seamlessly in and
out of a well-designed database. Thus,
array design information such as clone or
Mattes et al.
VOLUME 112 | NUMBER 4 | March 2004
Environmental Health Perspectives
sequence ID, if properly stored, can be
verified, updated, and linked with current
annotation automatically. Similarly, such
information is more readily indexed across
platforms if stored in a single database.
Analysis of data from both a single experi-
ment and across experiments is simplified if
the data already have a structure that can
be readily accessed with standard query
tools and statistical routines. Integration
and correlation of microarray data with
biological data are made easier and are
often possible only if both are housed
within the same database. Certainly the
utility of such a database for toxicoge-
nomics is greatly enhanced if data from
microarray and other relevant global tech-
nologies and biological or toxicological
phases of the experiment are captured elec-
tronically and/or automatically loaded into
the database.
Existing and Emerging
As noted above, several public databases for
microarray data have been established
(Anonymous 2002; Brazma et al. 2003;
Edgar et al. 2002; Thomas et al. 2002). The
extension of efforts like these to incorporate
toxicological and biological end points criti-
cally defines and distinguishes toxicoge-
nomic databases, and several key initiatives
will be discussed here. It should be noted
that creation of an internationally compati-
ble informatics platform for toxicogenomics
data will enhance the impact of the individ-
ual data sets and provide the scientific com-
munity with easy access to integrated data
in a structured standard format that will
facilitate data comparison and data analysis.
Coordination of database structure develop-
ment and acceptance of common guidelines
will result in robust databases valuable to
many scientific communities.
One effort to build a public toxico-
genomics database focuses on structuring the
existing ArrayExpress database to include
toxicogenomics data and is under way at the
EMBL-EBI, in collaboration with the HESI
Technical Committee on the Application of
Genomics to Mechanism-Based Risk
Assessment. A parallel effort is the CEBS
knowledge base (Waters et al. 2003)
under development by the NIEHS NCT
( Both
the EBI ArrayExpress and NIEHS NCT
CEBS database models are based on the
international standards developed by the
MGED Society (,
including common minimal descriptors,
standard data storage and exchange format,
and harmonized nomenclature. These
similarities position both CEBS and
ArrayExpress as highly collaborative public
repositories for scientists internationally.
Other efforts include the public
Comparative Toxicogenomics Database
factsheets/ctd.htm) at the Mount Desert
Island Biological Laboratory, (Mount Desert
Island, Salsbury Cove, ME), the dbZach
System ( at the
Molecular and Genomic Toxicology
Laboratory at Michigan State University
(East Lansing, MI) and the Toxico-
informatics Integrated System (TIS;
toxicoinformatics/) at the NCTR.
European Bioinformatics Institute:
ArrayExpress and Tox-MIAMExpress
The ArrayExpress infrastructure for
microarray-based data (ArrayExpress; (Brazma
et al. 2003) has been accepting submissions
since February 2002 and sees a rapidly
growing volume of data deposited. The
goals of the ArrayExpress infrastructure are
to a) provide the community with easy
access to high-quality data in a well-struc-
tured standard format; b) serve as a reposi-
tory for gene expression data and any
biological metadata correlated with the
experiments (e.g., toxicological or pharma-
cological end points) that support publica-
tions; c) allow data mining, data
comparison, and data analysis across differ-
ent technology platforms, associating gene
expression patterns with the biological
metadata; and d) facilitate the sharing and
reuse of array designs and experimental pro-
tocols. The meaningful exchange of infor-
mation is supported by the use of standard
contextual information, MIAME (Brazma
et al. 2001) (MIAME1.1;http://www.
1.html) and MIAME/Tox (http://www., and a common data exchange
format, MAGE Markup Language
(MAGE-ML) (Spellman et al. 2002), devel-
oped by MGED Society (http://www. MAGE-ML is an extensible
markup language (XML)-based data
exchange format adopted as a standard by
the Object Management Group (OMG;
The ability to compare data obtained across
different platforms is facilitated by a set of
procedures for updating the array annota-
tion and formatting the design into a stan-
dard referencing system. A high level of
data annotation is ensured by a team of cura-
tors assisting the data producers in providing
the appropriate information. An MAGE-ML
pipeline for direct data submission has been
established or is under testing and construc-
tion with a number of companies and many
academic and governmental laboratories,
including Affymetrix, Agilent, Sanger
Institute, Stanford University, The Institute
for Genomics Research, NIEHS NCT,
NCTR, and National Environmental
Research Council-United Kingdom.
Currently, others (Xybion, Rosetta
Biosoftware, Silicon Genetics, National
Cancer Institute, Lund University) are
adopting MAGE-ML format, including
recently developed analysis tools [J-Express
Imagene (
imagene.asp), BioConductor (http://www.].
New collaborations have been established
to populate ArrayExpress with high-quality
reference data sets, such as human and
mouse expression atlases (e.g., in collabora-
tion with Human Genome Mapping
Project-Medical Research Council, Cam-
bridge, U.K.), gene expression time courses
of basic biological processes in model organ-
isms, and expression profiles of toxic sub-
stances (HESI and NIEHS NCT). In the
long term, ArrayExpress aims to build a gene
expression atlas characterizing gene expres-
sion in different tissue and cell types by sys-
tematic added value annotation from the
team of curators. As of 17 December 2003,
the database contains more than 4,000
hybridizations (excluding the approximately
1,000 hybridizations from the HESI geno-
toxicity, hepatotoxicity, and nephrotoxicity
studies in curation phase).
The ArrayExpress infrastructure consists
of two data submission routes, a core
repository, an online query interface, a
query-optimized data warehouse (under
development), and an online analysis tool,
Expression Profiler (
uk/EP/). The first data submission route (via
an ftp site) allows batch submission in
MAGE-ML format. The creation of
MIAME-compliant MAGE-ML files is a
demanding but necessary exercise for high-
throughput data transfer. For smaller data
sets, users can take advantage of a simpler
submission process via MIAMExpress.
MIAMExpress (
miamexpress/) is an online annotation and
submission tool presented in the form of a
MIAME-based questionnaire, where MGED
Ontology is used to structure inputs and
provide controlled vocabularies for entry.
As part of a collaborative undertaking
with the HESI Committee on Genomics
( and
index.cfm?pubentityid=120), MIAMExpress
has undergone further development. These
modifications allow the incorporation of
standard microarray data in conjunction
with conventional toxicological end points
(e.g., clinical observations, histopathology
Toxicogenomic database review
Environmental Health Perspectives
VOLUME 112 | NUMBER 4 | March 2004
evaluation, and clinical pathology)
(Figure 1). The new annotation and submis-
sion tool Tox-MIAMExpress (http://www. is tailored to
accommodate input of the results from the
HESI Committee on Genomics toxico-
genomics experimental program. Tox-
MIAMExpress allows three type of
submission: experiment, protocol, and array
design. The experiment (a set of related
hybridizations) contains the information
related to samples, descriptions, treatments,
and toxicological assessments data. The inte-
gration of toxicology and microarray end
points in this database will serve as a proto-
type for toxicogenomic database design and
execution and will allow for a more power-
ful analysis of the experimental data than is
possible in the absence of such a resource.
To ensure harmonization of the
toxicological end points, allowing successful
data mining, data evaluation, and data
comparison, Tox-MIAMExpress is designed
according to proposed MIAME/Tox stan-
dards. Tox-MIAMExpress uses MGED
Ontology concepts that point to established
controlled vocabularies for toxicological end
points: the International Union of Pure and
Applied Chemistry (
dhtml_home.html) for clinical pathology
and the NTP PCT (http://hazel.niehs.nih.
gov/user_spt/pct_terms.htm) for clinical
observations and pathological and
histopathological evaluations.
Tox-MIAMExpress is an open-source
project, consisting of a perl-CGI interface,
MySQL database, and MAGE-ML export
component implemented using MAGE
Software ToolKit. The system can be also
installed locally and used as an electronic
notebook for toxicogenomics experiments,
potentially allowing one-click submissions
to ArrayExpress or to any other toxicoge-
nomics database or tool that accepts
MAGE-ML–formatted data. The first
beta version of the Tox-MIAMExpress
was launched in January 2003, and the
public online version has been accepting
submissions since September 2003.
The ArrayExpress core repository itself
accepts experiments, protocols, and array
designs submissions. An accession number
is assigned to each completed and curated
submission. Upon submission of array
designs, a set of procedures is provided to
the users to format the array into a stan-
dard referencing system. This format will
unambiguously locate each element on the
array and provide a consistent biological
annotation for data mining, data evalua-
tion, and data comparison across different
arrays and technology platforms. Another
set of tools allows the user to access the lat-
est gene annotation, to reannotate or
update their array, by the link provided to
another EBI database, EnsMart (http:// Although
the array annotation is created on the basis
of the sequence information available at
the time of its release, drafts of the
genomes are continuously updated and
subsequent array annotation can be always
improved and harmonized. EnsMart is
built on the data in EnsEMBL, the
genome database at EBI containing consis-
tent species-specific and interspecies anno-
tation (including Homo sapiens, Mus
musculus, and Rattus norvegicus). EnsMart
is the recipient of the latest updated drafts
of the genomes, with cross-references
between identifiers from a wide variety of
the public sequence repositories and inter-
nal EnsEMBL identifiers (LocusLink,
RefSeq, Swiss-Prot, Interpro, GO,
VEGA). The output of the query to the
EnsMart system can be downloaded in
various formats also for direct submission
to Tox-MIAMExpress.
The ArrayExpress online query interface
allows simple queries and data retrieval.
The query parameters can be either general
experiment properties (e.g., accession num-
ber, author name, condition tested) or sam-
ple properties (e.g., species used). The query
results are provided as lists of experiments,
array designs, and protocols with associated
gene expression data and toxicological end
points. Users can also receive the data as a
MAGE-ML download for easy export into
any MAGE-standard supportive tool
The data warehouse, under development,
will allow gene- and datacentric queries where
queries can be expressed in terms of gene
properties (e.g., GO categories, accession
numbers) and expression value restrictions.
The query results can span multiple experi-
ments, combining MAGE patterns with toxi-
cological end points, and provide cross-array
platform analysis facilities. To facilitate gene-
based queries, a gene index will be estab-
lished. This index will link standard gene
identifiers, where they exist, to the elements
on the array. The gene index will be devel-
oped in collaboration with other groups at
the EBI and the established model organism
Mattes et al.
VOLUME 112 | NUMBER 4 | March 2004
Environmental Health Perspectives
Figure 1. EBI ArrayExpress toxicogenomics infrastructure.
databases, forming the basis of database
The gene expression data from
ArrayExpress can also be exported into
Expression Profiler (
uk/EP/), a set of online tools for the analysis
and interpretation of gene expression and
other functional genomics data. It incorpo-
rates data subselection and transformation
components as well as hierarchical and
K-means clustering, principal component
analysis (PCA), between group analysis
(BGA), and several others, together with
facilities for the visualization of the data and
the analysis results. The data cross-linking
module augments the analysis by linking to
other tools and databases, for example,
metabolic pathway databases. Expression
Profiler allows the user to benefit from the
latest annotations of the genomes via
EnsMart. Third-party components and
algorithms, both those installed locally and
remote web services, can also be integrated
into the workflow mechanism, enabling the
platform to expand and develop with the
needs of the microarray data analysis
The ArrayExpress database will be also
fully integrated with other relevant databases
at EBI, also as part of Integr8, a new data
integration project coordinated by the EBI
and funded by the European Union as part of
the TEMBLOR (The European Molecular
Biology Linked Original Resources) project
Temblor1.html). It aims to provide a new
integrated layer for the exploitation of
genomic and proteomic data by drawing on
databases maintained at major bioinformatics
centers throughout Europe.
The ArrayExpress infrastructure for
toxicogenomics provides the community
with easy access to highly curated, quality
integrated data in a structured standard
format, supporting publications, guiding
the harmonization process, and facilitating
data comparison and analysis.
The Chemical Effects in Biological
Systems Knowledge Base
The CEBS knowledge base (http://www. is under develop-
ment (Waters et al. 2003) by the NIEHS
NCT as a public toxicogenomics informa-
tion resource combining data sets from
transcriptomics, proteomics, metabonom-
ics, and conventional toxicology with path-
way and network information relevant to
environmental toxicology and human dis-
ease. The overall goal of CEBS is to
support hypothesis-driven and discovery
research in environmental toxicology and
the research needs of risk assessment.
Specific objectives are a) to compare
toxicogenomic effects of chemicals/stressors
across species—yielding signatures of
altered molecular expression; b) to pheno-
typically anchor these changes with con-
ventional toxicology data—classifying
biological effects as well as disease pheno-
types; and c) to delineate global changes as
adaptive, pharmacologic, or toxic out-
comes—defining early biomarkers, the
sequence of key events, and mechanisms of
toxicant action. CEBS is designed to meet
the information needs of systems toxicol-
ogy (Waters et al. 2003), and involves
study of chemical or stressor perturbations,
monitoring changes in molecular expres-
sion, and iteratively integrating biological
response data to describe the functioning
organism (Ideker et al. 2001). CEBS is a
dynamic concept for integrating large vol-
umes of transcriptomic, proteomic,
metabonomic, and toxicological knowledge
in a framework that serves as a continually
changing heuristic engine. The interna-
tional data-capture guideline, MIAME,
and draft MIAME/Tox for toxicogenomics
experiments, MAGE-ML data exchange
format, is used to assemble and exchange
high-quality data sets with the goal of cre-
ating a system of predictive toxicology.
Toxicogenomics experiments performed
using validated NIEHS NCT and NTP
methodologies will be captured in their
entirety via a unique microarray, pro-
teomics, metabonomics object model that
has been extended from MAGE Object
Model (MAGE-OM) OMG.
To phase the development of CEBS as
well as test the design and implementation of
the knowledge base components and system
information technology architecture, a proto-
typic database system has been constructed at
the NIEHS to explore the management, inte-
gration, mining, and analysis of microarray,
histopathology, and clinical chemistry data
(Figure 2). Microarray data assessed for dis-
tinct gene expression signatures (Bushel et al.
2002) are formatted in an abridged version of
MAGE-ML and loaded into a custom-
designed Oracle database table with a version
of the EBI ArrayExpress database installed.
Clinical chemistry and histopathology data
on chemically exposed biological samples are
obtained from the NTP Clinical Pathology
and Toxicology Database Management
System (TDMS). Data from several domains
can be extracted by means of structured
queries and formatted for SAS MicroArray
Solution software (SAS Institute Inc., Cary,
NC), which can be used to conduct several
analyses, including a mixed linear model gene
selection method (Wolfinger et al. 2001) to
identify genes that are significantly differen-
tially expressed after chemical exposure. Data
can also be explored using SAS JMP client
software (SAS Institute Inc.) to assess the
quality of the MAGE data as well as visualize
patterns of gene expression associated with
toxicity phenotypes. CEBS itself will provide
“scripted analysis tool workflows” whereby a
series of statistical approaches will be linked
to a detailed description explaining in clear
terms how each approach works and in what
situation(s) each should be applied. The
CEBS online analysis tool suite will enable
researchers to save an analysis workflow and
the corresponding parameters to CEBS.
Colleagues will then be able to apply the
same statistical analysis routine by searching
for the workflow by name, specifying the data
set, and submitting the request. CEBS has
leveraged the bioinformatics infrastructure
and annotation engine cancer Bioinformatics
Infrastructure Objects (caBIO), developed by
the National Cancer Institute Center for
Bioinformatics (NCICB, Bethdesda, MD),
to provide automated full-array annotation
(Dowell et al. 2001) for CEBS. A standards-
based set of genomic components, caBIO
objects, simulate genomic components—
genes, chromosomes, sequences, libraries,
clones, ontologies, etc. caBIO provides access
to a variety of genomic data sources includ-
ing GenBank (http://www.ncbi.nlm.nih.
gov/Genbank/), Unigene (http://www., LocusLink
Homologene (http://www.ncbi.nlm.nih.
gov/HomoloGene/), Ensembl (http://, GoldenPath (http://
org=human), BioCarta (http://www., dbSNP (http://www.ncbi., and
the NCICB Cancer Genome Anatomy
Project data repositories. caBIO is open
source and provides access to genomic infor-
mation using a standardized tool set. CEBS
will feature automated pathway projection
of expressed genes onto BioCarta (Figure 3)
and Kyoto Encyclopedia of Genes and
Genomes (KEGG;
jp/kegg/) pathways. Pathway visualization is
linked to gene annotation, enabling point-
and-click annotation of genes on these path-
ways. It will be possible to navigate the
dose–time relationships within a toxicoge-
nomics experiment and to retrieve clinical
chemistry profiles and histopathology
images at will to phenotypically anchor mol-
ecular expression profiles. Links to the
supporting literature also will be provided.
The NIEHS NCT has released the
CEBS Systems Biology object model (CEBS
SysBio-OM) to capture MAGE, pro-
teomics, and metabolomics domain infor-
mation (
The model is comprehensive and leverages
other open-source efforts, namely, the
Toxicogenomic database review
Environmental Health Perspectives
VOLUME 112 | NUMBER 4 | March 2004
MAGE-OM and the PEDRo (Proteomics
Experiment Data Repository) object model.
The NIEHS NCT contractor, Science
Applications International Corporation
(SAIC; San Diego, CA), in consultation
with the NIEHS NCT and The Scripps
Research Institute (La Jolla, CA), has
designed the CEBS SysBio-OM by
extending MAGE-OM to represent protein
expression data elements (including those
from PEDRo), protein–protein interaction
data, and metabonomic data. CEBS SysBio-
OM promotes the standardization of data
representation as well as the standardization
of the data quality by facilitating the capture
of the minimum annotation required for an
experiment so that the resulting data can be
interpreted accurately. The CEBS SysBio-
OM is open source and can be implemented
on varied computing platforms.
As a multigenome knowledge base,
CEBS allows characterization of the effects
of chemicals or stressors across species as a
function of dose, time, and phenotype sever-
ity. This permits classifying toxicological
effects and disease phenotypes, as well as
ultimately delineating biomarkers, sequences
of key molecular events responsible for bio-
logical response, and mechanisms of action
of a chemical or stressor on a biological
system. By analogy to GenBank, CEBS will
support global sequence-based query using,
for example, probe sequence of differentially
expressed genes or analytically determined
proteins. This will be possible because all
probe sets and analytically determined pro-
teins represented in the knowledge base will
be sequence-aligned to gene models for all
genes known to the knowledge base. As a
consequence of this design, reverse query of
phenotypic severity attributes (e.g., specific
histopathology) can provide entry into mol-
ecular expression profiles and associated
sequelae. Molecular expression profiles that
match a query data set of nucleic acid or
amino acid sequence for an experimentally
determined gene or protein expression pro-
file can be presented in rank order by quality
of match for all significant matches, together
with all contextually associated (e.g., dose,
time, phenotypic severity) experimental
data. In situations involving proprietary
chemicals or drugs, a sequence-based (DNA,
RNA, or amino acid sequence) query can be
performed without divulging the name or
chemical structure. CEBS also will support
conventional simple and complex query for
compound/structure/class, toxic/pathologic
effects, gene annotation, gene groups, path-
ways and phenotypes, etc. Because CEBS
will contain data from multiple species of
organisms, as the understanding of genetic
and biochemical pathways builds toward
congruence over time, the sequence-based
system facilitates more precise definition of
biological pathways and networks as well as
genetic variability and susceptibility to, for
example, environmental, chemical, or
biological insult among species.
CEBS will leverage the NTP Clinical
Pathology and TDMS Oracle databases
in experimental design and interpretation
of phenotypes (Figure 3). In addition, the
NIEHS NCT CEBS is taking steps to
extend MAGE-ML to load toxicology
and pathology data sets compatible with
NTP and Xybion toxicology databases
(Cedar Knolls, NJ; http://www.xybion.
com/). This will facilitate pipelining of
toxicogenomics data sets from several
Mattes et al.
VOLUME 112 | NUMBER 4 | March 2004
Environmental Health Perspectives
Microarray Group
Toxicology Program
Array LIMS
MAGE-ML pipelines
Microarray Group
Toxicology Program
Proteomics and
Microarray DBs
Data analysis
XML import
XML export
Annotation resources
clin bath
Figure 2. CEBS knowledge base infrastructure as of 18 December 2003. CGAP, Cancer Genome Anatomy
Project (; LIMS, Laboratory Information Management System; NCBI, National
Center for Biotechnology Information (; USCS, USC Genome Bioinformatics
Figure 3. CEBS display of differentially expressed genes on a BioCarta pathway.
NTP contract laboratories that use NTP
databases and from pharmaceutical labo-
ratories that use the Xybion toxicology
database. The NIEHS NCT CEBS is cur-
rently in the process of mapping the NTP
and Xybion toxicology databases to the
MIAME/Tox minimal toxicogenomics
data guidelines. The availability of NTP
toxicogenomics data sets in CEBS will be
announced in Environmental Health
Because CEBS will contain data on
global gene expression, protein expression,
metabolite profiles, and associated chemi-
cal/stressor-induced effects in multiple
species (e.g., from yeast to humans), it will
be possible to derive functional pathway and
network information based on cross-species
homology and pathway conservation. CEBS
will ultimately become a knowledge base for
both discovery and hypothesis-driven
research. CEBS version 1.0 (microarray) was
available for internal evaluation in August
2003 and will be available for public use by
the end of 2004. Completion of the knowl-
edge base is scheduled for 2012.
Comparative Toxicogenomics
The NIEHS DERT supports an inter-
national public database devoted primarily
to comparative toxicogenomics in aquatic
and mammalian species, the CTD (http:// The Mount Desert
Island Biological Laboratory is developing
CTD as a community-supported genomic
resource devoted to genes of human toxi-
cological significance. CTD will be the
first publicly available database to a) pro-
vide annotated associations between genes,
references, and toxic agents; b)include
nucleotide and protein sequences from
diverse species with a focus on aquatic and
mammalian organisms; c) offer a range of
analytical tools for customized compara-
tive studies; and d) provide information to
investigators on available molecular
reagents. This combination of features will
facilitate cross-species comparisons of toxi-
cologically significant genes and proteins,
providing unique insights into the signifi-
cance of conserved sequences and poly-
morphisms, the genetic basis of variable
sensitivity, molecular evolution, and adap-
tation. The CTD was developed through a
collaboration of five NIEHS-funded
marine and freshwater biomedical sciences
centers. These centers include Mount
Desert Island Biological Laboratory,
Oregon State University (Corvallis, OR),
University of Wisconsin–Milwaukee
(Milwaukee, WI), University of Miami
(Miami, FL), and The Jackson Laboratory
(Bar Harbor, ME). The goal of the CTD
is to develop a comparative database that
links sequence information for genes rele-
vant to toxicology to information about
gene expression, toxicology, and biological
processes. The primary focus of the CTD is
on marine and aquatic organisms as model
systems for human diseases. The initial
focus is also on genes that have been identi-
fied through the NIEHS Environmental
Genome Project as important for toxicology
in model systems. However, the database
will eventually merge all gene sequence
information generated on all vertebrates
and invertebrates, including aquatic organ-
isms, worms, flies, rodents, and people. The
CTD provides information about genes
and annotation (gene synonyms, sets, and
functions) and links between gene
sequence and toxicity data published in
the scientific literature. These aspects of
the database represent an important
advancement for comparative toxico-
genomics. Such information will include
all the synonyms by which a gene is
known in different organisms, the toxicant
responses identified for specific genes in
different species, and a platform that pro-
motes comparisons of gene sequences and
toxicant activity among diverse organisms.
This data structure will provide compre-
hensive information about the mechanism
of action of toxicants. Understanding these
mechanisms will allow more informed
assessment of human risk by extrapolating
toxicity data from animal models to people
and will provide a mechanism by which
members of the research community can
share their data and promote fruitful
avenues for future toxicological research.
The Molecular and Genomic Toxicology
Laboratory, Michigan State University (East
Lansing, MI) has developed the dbZach
System (, a multi-
faceted toxicogenomics bioinformatics infra-
structure. The goal of the dbZach System is
to provide a) facilities for the modeling of
toxicogenomics data; b)a centralized source
of biological knowledge to facilitate data
mining and allow full knowledge-based
understanding of the toxicological mecha-
nisms; and c)an environment for bioinfor-
matic algorithmics and analysis tools
development. dbZach, designed in a modu-
lar structure to handle multispecies array-
based toxicogenomics information, is the
core database implemented in Oracle. This
includes several subsystems:
Clone Subsystem, containing information
concerning the cDNA/EST clones
Microarray Subsystem, containing
information concerning custom array
designs and the microarray data files
Gene Function Subsystem, cataloging all
the genes represented on the arrays and
their annotations
Sample Annotation Subsystem, collecting
MIAME-compliant information about
the samples and their treatments
Toxicology Subsystem, indexing end-point
toxicity measures to facilitate correlation
with gene expression analyses. The
Toxicology Subsystem stores clinical chem-
istry parameters and manages histopatho-
logical data, allowing the pathologist to
annotate the sample using the controlled
vocabulary from the Pathology Ontology
developed by Pathbase database (http://
dbZach also stores the actual histopatho-
logical images in the database, allowing users
to mine large numbers of images, such as a
tissue microarray, with ease. By storing the
results of several toxicology assessments in
the database, dbZach facilitates comparisons
of chemical mechanisms of action and sup-
ports functional toxicogenomic and
chemoinformatic investigations of struc-
ture–activity relationships. In contrast to
ArrayExpress, CEBS, and CTD, dbZach is
not designed to be a public repository for
data sets contributed from diverse groups
but rather to serve as a standalone database
for a laboratory or institution.
The NCTR (
is developing TIS to integrate genomics,
proteomics, and metabonomics data with
conventional in vivo and in vitro toxicology
data (Tong et al. 2003). TIS is designed to
meet the challenge of data management,
analysis, and interpretation through the
integration of toxicogenomics data, gene
function, and pathways to enable hypothesis
generation. To achieve this, TIS will provide
or interface with a large collection of tools
for data analysis and knowledge mining.
A first prototype of TIS has been
developed. ArrayTrack is a microarray data
management and analysis system compris-
ing three integrated components: a data-
base (Microarray DB) storing microarray
experiment information; a library (LIB)
mirroring critical data in public databases;
and a tool module (TOOL) providing
analysis capability on experimental and
public data for knowledge discovery.
ArrayTrack allows users to select an analy-
sis method from the TOOL, apply it to
the data stored in Microarray DB, and link
the result to gene information stored in
the LIB module. The Microarray DB com-
ponent is designed to be a rich resource for
cross-experiment and platform compari-
son, storing information according to the
Toxicogenomic database review
Environmental Health Perspectives
VOLUME 112 | NUMBER 4 | March 2004
MIAME requirements and deriving
toxicity-specific signatures from data
analysis. The LIB component contains
information on gene annotation, protein
function, and pathways from a diverse col-
lection of public biological databases
(GenBank, Swiss-Prot (, LocusLink, KEGG,
GO), facilitating the annotation and the
interpretation of MAGE data. The TOOL
component provides a spectrum of algo-
rithmic tools for microarray data visualiza-
tion, quality control, normalization,
pattern discovery, and class prediction.
TOOL is also designed to provide interop-
erability between ArrayTrack system and
other analysis software.
TIS will serve as repository for
genomics, proteomics, metabonomics, and
conventional toxicology data management,
supporting data mining and analysis activi-
ties. Through cross-linking with a diverse
collection of public biological data, TIS
will serve as a robust system for exploring
toxicological mechanisms.
The EDGE database (http://genome. was
developed at the McArdle Laboratory for
Cancer Research, University of Wisconsin
(Madison, WI) as a resource for toxicology-
related gene expression information
(Thomas et al. 2002). It is based on experi-
ments conducted using custom cDNA
microarrays that include unique ESTs
identified as regulated under conditions of
toxicity. To a large extent the experiments
were conducted in mice using a variety of
agents known to produce toxicity. The
database is not designed as a public reposi-
tory of submitted data; rather, it serves as a
query reference and the basis for an algo-
rithm predictive of toxicological potential
for a chemical treatment. The ultimate goal
of the EDGE is to map transcriptional
changes from chemical exposure to predict
toxicity and provide valuable insights into
the basic molecular changes responsible.
“Ocean in view! O! the joy.” So wrote
William Clark in November 1805 on
sighting the Pacific Ocean after traveling
4,142 miles in his expedition across the
western North American continent
(Ambrose 1996). Yet it would be several
days before the expedition actually reached
the ocean, and a year filled with many
hardships before it returned home. So it is
with toxicogenomics—we can see the goal
more clearly than before, but a thorough
understanding of this landscape is yet to
come. All the sources of experimental and
analytical variability are not yet known.
The nature of the data demands machine
handling and analysis. The path forward
must involve a rigorous meshing of biology
and toxicology with the computer science
of bioinformatics and statistics. To allow
the field to develop and to allow scientists
to view, assess, and mine each other’s data
demands the development of public data-
bases and standards for data storage and
The challenges in the path forward may
be seen as falling into three main cate-
gories: information quality, information
standardization, and analytical conven-
tionalization. Information quality chal-
lenges include not only the need for
codified, standard metrics for assessing
data quality, but also encompass the accu-
racy, specificity, and annotation of
microarray probe elements. Ideally, each
element would have associated with it a
wealth of information regarding exactly
what transcript it detects and how that
transcript fits into the overall program of a
particular cell type. Furthermore, not only
must this information be easily accessible
to the casual user (e.g., through a browser
interface) but it must also be in a machine-
readable format suitable for data mining
and statistical analysis. Approaches to
either automated or controlled data upload
can also be considered issues of informa-
tion quality. Information standardization,
on the other hand, focuses on issues cov-
ered in the MIAME and MIAME/Tox
efforts. Standardization of pathology terms,
which is an ambitious effort given the spec-
trum of conceivable gross and microscopic
observations and the diversity of opinions
as to how these observations may best be
described, is critical for toxicology.
Although a few independent groups have
attempted to solve this issue, only true
international harmonization of terms and
ontologies, resulting in uniform, machine-
readable data, will allow the field of toxi-
cogenomics to realize its potential.
Information standardization also encom-
passes the creation of standard data sets
that can be used as reference points; the
studies of the HESI Committee on the
Application of Genomics to Mechanism-
Based Risk Assessment can be cited as
efforts along these lines. Finally, the chal-
lenge of defining best practice analytical
approaches to toxicogenomic data may be
referred to as analytical conventionaliza-
tion. Simply put, there are not enough
public examples where the same data set
was analyzed by multiple approaches and
the results compared. Such comparisons
would allow the field of toxicogenomics to
move forward with confidence in specific
approaches. Together, the outstanding
challenges in the field of toxicogenomics
provide evidence that despite its rapid
growth, toxicogenomics remains a nascent
Clearly the key to many of these
challenges lies in creating well-structured
databases to house the results of toxico-
genomic experiments. The contents
of these databases allow data quality
parameters to be explored and analytical
approaches compared; in addition, the very
effort of creating these databases forces a
focus in the community on information
standardization. Only in such a database
can microarray annotation be kept current
and at a high level of sophistication. The
collaborative efforts described in this
review represent major steps along this
path and will represent an additional bonus
in information standardization and knowl-
edge sharing. Success in these efforts will be
determined by the acceptance of and sup-
port for these databases in the toxico-
genomics community. Such acceptance will
also spur input from the broad scientific
community and lead to increasingly diverse
and extensive inputs to and use of these
public resources.
Toxicogenomics is an information- and
informatics-intensive field. To share experi-
mental results successfully between labs and
to address larger issues of data handling and
analysis, public databases designed to ware-
house toxicogenomic data must be devel-
oped and populated. The databases
mentioned in this review represent the
promise and future of toxicogenomics.
Ambrose SE 1996. Undaunted Courage: Meriwether
Lewis, Thomas Jefferson, and the Opening of the
American West. New York:Simon and Schuster,
Anonymous. 2002. Microarray standards at last
[Editorial]. Nature 419:323.
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H,
Cherry JM, et al. 2000. Gene ontology: tool for the
unification of biology. The Gene Ontology
Consortium. Nat Genet 25:25–29.
Ball CA, Sherlock G, Parkinson H, Rocca-Sera P,
Brooksbank C, Causton HC, et al. 2002. Standards
for microarray data. Science 298:539.
Bassett DE, Jr, Eisen MB, Boguski MS. 1999. Gene
expression informatics—it’s all in your mine. Nat
Genet 21:51–55.
Brazma A, Parkinson H, Sarkans U, Shojatalab M,
Vilo J, Abeygunawardena N, et al. 2003.
ArrayExpress—a public repository for microar-
ray gene expression data at the EBI. Nucleic
Acids Res 31:68–71.
Brazma A, Hingamp P, Quackenbush J, Sherlock G,
Spellman P, Stoeckert C, et al. 2001. Minimum
information about a microarray experiment
(MIAME)—toward standards for microarray
data. Nat Genet 29:365–371.
Bumm K, Zheng M, Bailey C, Zhan F, Chiriva-Internati
Mattes et al.
VOLUME 112 | NUMBER 4 | March 2004
Environmental Health Perspectives
M, Eddlemon P, et al. 2002. CGO: utilizing and
integrating gene expression microarray data in
clinical research and data management.
Bioinformatics 18:327–328.
Burchell B, Nebert DW, Nelson DR, Bock KW,
Iyanagi T, Jansen PL, et al. 1991. The UDP glu-
curonosyltransferase gene superfamily: sug-
gested nomenclature based on evolutionary
divergence. DNA Cell Biol 10:487–494.
Bushel PR, Hamadeh HK, Bennett L, Green J,
Ableson A, Misener S, et al. 2002. Computational
selection of distinct class- and subclass-specific
gene expression signatures. J Biomed Inform
Bushel PR, Hamadeh H, Bennett L, Sieber S, Martin
K, Nuwaysir EF et al. 2001. MAPS: a microarray
project system for gene expression experiment
information and data validation. Bioinformatics
Castle AL, Carver MP, Mendrick DL. 2002.
Toxicogenomics: a new revolution in drug safety.
Drug Discov Today 7:728–736.
Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L.
2001. The distributed annotation system. BMC
Bioinformatics 2:7.
Edgar R, Domrachev M, Lash AE. 2002. Gene expres-
sion omnibus: NCBI gene expression and
hybridization array data repository. Nucleic
Acids Res 30:207–210.
Eisen M, Spellman P, Brown P, Botstein D. 1998.
Cluster analysis and display of genome-wide
expression patterns. Proc Natl Acad Sci USA
Ermolaeva O, Rastogi M, Pruitt KD, Schuler GD,
Bittner ML, Chen Y, et al. 1998. Data management
and analysis for gene expression arrays. Nat
Genet 20:19–23.
Finkelstein D, Ewing R, Gollub J, Sterky F, Cherry JM,
Somerville S. 2002. Microarray data quality
analysis: lessons from the AFGC project.
Arabidopsis Functional Genomics Consortium.
Plant Mol Biol 48:119–131.
Gollub J, Ball CA, Binkley G, Demeter J, Finkelstein
DB, Hebert JM, et al. 2003. The Stanford
Microarray Database: data access and quality
assessment tools. Nucleic Acids Res 31:94–96.
Hessner MJ, Wang X, Hulse K, Meyer L, Wu Y, Nye S,
et al. 2003. Three color cDNA microarrays: quan-
titative assessment through the use of fluores-
cein-labeled probes. Nucleic Acids Res 31:e14.
Ideker T, Galitski T, Hood L. 2001. A new approach to
decoding life: systems biology. Annu Rev
Genomics Hum Genet 2:343–372.
Liao B, Hale W, Epstein CB, Butow RA, Garner HR.
2000. MAD: a suite of tools for microarray data
management and processing. Bioinformatics
Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo
MV, Chee MS, et al. 1996. Expression monitoring
by hybridization to high-density oligonucleotide
arrays. Nat Biotechnol 14:1675–1680.
Mattes WB. 2004. Annotation and cross-indexing of
array elements on multiple platforms. Environ
Health Perspect 112:506–510.
Model F, Konig T, Piepenbrock C, Adorjan P. 2002.
Statistical process control for large scale
microarray experiments. Bioinformatics
18(suppl 1):S155–S163.
Murphy D. 2002. Gene expression studies using
microarrays: principles, problems, and
prospects. Adv Physiol Educ 26:256–270.
Pennie WD, Pettit SD, Lord PG. 2004. Toxicogenomics
in risk assessment: an overview of an HESI col-
laborative research program. Environ Health
Perspect 112:417–419.
Petricoin EF III, Hackett JL, Lesko LJ, Puri RK,
Gutman SI, Chumakov K, et al. 2002. Medical
applications of microarray technologies: a regu-
latory science perspective. Nat Genet
Pruitt KD, Maglott DR. 2001. RefSeq and LocusLink:
NCBI gene-centered resources. Nucleic Acids
Res 29:137–140.
Rininger JA, DiPippo VA, Gould-Rothberg BE. 2000.
Differential gene expression technologies for
identifying surrogate markers of drug efficacy
and toxicity. Drug Discov Today 5:560–568.
Schena M, Shalon D, Davis RW, Brown PO. 1995.
Quantitative monitoring of gene expression pat-
terns with a complementary DNA microarray.
Science 270:467–470.
Spellman PT, Miller M, Stewart J, Troup C, Sarkans
U, Chervitz S, et al. 2002. Design and implementa-
tion of microarray gene expression markup lan-
guage (MAGE-ML). Genome Biol
Stoeckert C, Pizarro A, Manduchi E, Gibson M, Brunk
B, Crabtree J, et al. 2001. A relational schema for
both array-based and sage gene expression
experiments. Bioinformatics 17:300–308.
Stoeckert CJ Jr., Causton HC, Ball CA. 2002.
Microarray databases: standards and ontologies.
Nat Genet 32(suppl):469–473.
Thomas RS, Rank DR, Penn SG, Zastrow GM, Hayes
KR, Tianhua H, et al. 2002. Application of
genomics to toxicology research. Environ Health
Perspect 110:919–923.
Tong W, Cao X, Harris S, Sun H, Fang H et al. 2003.
ArrayTrack—supporting toxicogenomic research
at the U.S. Food and Drug Administration
National Center for Toxicological Research.
Environ Health Perspect. 111:1819–1826.
Tseng GC, Oh MK, Rohlin L, Liao JC, Wong WH. 2001.
Issues in cDNA microarray analysis: quality fil-
tering, channel normalization, models of varia-
tions and assessment of gene effects. Nucleic
Acids Res 29:2549–2557.
Waters MD, Boorman G, Bushel P, Cunningham M,
Irwin R, Merrick A, et al. 2003. Systems toxicol-
ogy and the chemical effects in biological sys-
tems knowledge base. Environ Health Perspect
Wolfinger RD, Gibson G, Wolfinger ED, Bennett L,
Hamadeh H, Bushel P, et al. 2001. Assessing gene
significance from cDNA microarray expression
data via mixed models. J Comput Biol 8:625–637.
Toxicogenomic database review
Environmental Health Perspectives
VOLUME 112 | NUMBER 4 | March 2004
... Toxicogenomics emerged in the 1990s, and the number of toxicology studies incorporating transcript profiling or proteomics has since increased significantly ( Mattes et al., 2004). Toxicogenomics was validated by the International Life Sciences Institute (ILSI) Health and Environmental Sciences Institute (HESI) collaborative scientific program-the HESI Committee on the Application of Genomics to Mechanism-Based Risk Assessment ( Pennie et al., 2004). ...
... Database content is increasingly specialized, with databases for genomics, transcriptomics, proteomics, metabolomics, and structure ( Tyshenko & Leiss, 2005). Challenges in the creation of these databases include data capture, storage, and international standards for data organization and nomenclature ( Mattes et al., 2004;Baken et al., 2007). Examples of public gene expression databases include GEO (United States:, ...
... International standards for data organization and nomenclature were proposed by the Microarray Gene Expression Database (MGED) Society (; Mattes et al., 2004) Toxicogenomics provides the ability to examine a biological system, thereby enhancing our ability to understand the complexities of interactions which occur at multiple levels within the cell ( Hood, 2007). The systems biology approach enables researchers to examine less well studied areas of environmental health such as immunotoxicology ( Holsapple, 2003;Van Loveren et al., 2003). ...
... 43 Also, every animal contain treatment-associated data on their total body and organ-weight measurements, clinical chemistry measurements and microscopic histopathology findings for several tissues. [44][45][46] This important data collection, management and integration are done carefully and this process is very crucial for the experimental protocol and for interpreting toxicological outcomes. So, all this data must be collected in terms of dose, time and severity of the toxicological or histopathological phenotype. ...
... 62 Comparisons of gene, protein and metabolite data in public databases will be valuable for promoting a global understanding of how biological systems function and respond to environmental stressors. 44 As these repositories are developed, experiments will be deposited from disparate sources, using different experimental designs, but targeting the same toxicity endpoint or a similar class of toxicant. 63 In these cases, it will be important that the databases integrate data from related studies before data mining occurs. ...
... 82 Bioinformatics link between biological profile and statistical analysis and this improve the bioinformatics experimental knowledgebase. 44 This means that once a set of genes with altered expression is identified, their biological functions must be ascertained. Mechanical interpretation of transcript changes provides non-standard or imprecise annotation of a sequence element. ...
... Most importantly for this study is the fact that environmental health research makes secondary use of previously collected data. A current lack of shared standards and practices for metadata creation, data quality assurance, and quality control can make finding and reusing data difficult (Hendrickx, et al., 2014;Kearny et al., 2015;Lake et al., 2010;Mattes et al., 2004;Palmer et al, 2016). ...
... Convergence research areas, including environmental health research, involve the reuse and integration of disparate and often extremely large datasets, each of which were collected and described based on the norms and disciplinary practices of scientists involved in the previous research (Palmer et al., 2016;Sharp et al., 2016). Convergence researchers often lack the tools, methods, and training to locate needed datasets, to properly interpret and integrate datasets, to determine the quality of datasets generated by others, and to provide appropriate metadata for the datasets they themselves create in the course of their research (Hendrickx, et al., 2014;Kearny et al., 2015;Mattes et al., 2004;Wendt, et al., 2015). The lack of data integration can ...
Full-text available
Open science data benefit society by facilitating convergence across domains that are examining the same scientific problem. While cross-disciplinary data sharing and reuse is essential to the research done by convergent communities, so far little is known about the role data play in how these communities interact. An understanding of the role of data in these collaborations can help us identify and meet the needs of emerging research communities which may predict the next challenges faced by science. This paper represents an exploratory study of one emerging community, the environmental health community, examining how environmental health research groups form, collaborate, and share data. Five key insights about the role of data in emerging research communities are identified and suggestions are made for further research.
... A vital part of this study is the empirical and contextual characterization of adverse effects at the level of the organism, the tissue, the cell and intracellular molecular systems. Therefore, studies in toxicology measure the effects of an agent on an organism's food consumption and digestion, on its body and organ weight, on microscopic histopathology, and on cell viability, immortalization, necrosis and apoptosis (Mattes et al., 2004). ...
... Even to examine, per dose-time group, one tissue per animal requires 18-45 microarrays (more if replicates are used) and the attendant measurement of as many as 20,000 or more transcripts per array. Also, each animal will typically have treatment associated data on total body and organ weight measurements, clinical chemistry measurements (often up to 25 parameters) and microscopic histopathology findings for several tissues (Mattes et al., 2004). Data must be recorded in terms of dose, time and severity of the toxicological and/or histopathological phenotype(s). ...
Full-text available
Toxicogenomics is the study of the response of a whole genome to toxicants or environmental stressors and thereby providing the potential to accelerate the discovery of toxicant pathways, modes of action, specific chemical and drug targets (Waters and Fostel, 2004). Applications of toxicogenomics include exposure assessment, hazard screening, characterizing variability in susceptibility, mechanistic information, cross species extrapolation, dose response relationship and developmental exposures. These and the other applications can be used in conjunction with risk assessment, although they are also important in predictive toxicology (NRC, 2007). Biomarkers with clinical relevance have also been found using toxicogenomics approaches. Carcinogenic potential is conventionally measured after years of study, incurring notable expense in both animal and human resources. It is therefore of great interest to identify biomarkers of carcinogenicity that can be detected in acute, short-term studies, and efforts towards achieving this have been reported (Newton et al., 2004). Toxicogenomics will increase the relevance of toxicology through the global observation of genomic responses with therapeutically and environmentally realistic dose-regimens. Metabonomics research will help to identify alterations in the levels of small endogenous molecules as important changes in a sequence of key metabolic events; such “metabolite fingerprints” will help to diagnose and define the ways in which specific chemicals, environmental exposures or stressors cause disease. Further considerations in assessing the toxicogenomic response to environmental exposures are the individual genotype, lifestyle, age and exposure history. Toxicogenomics will help to ascertain the degree to which these factors influence the balance between healthy and disease states (Kaput and Rodriguez, 2004).
... Literatures more showed member of the same profession share their information in desire to improve patient care [14] and such aspects of information Sharing is also observed in portal of education where learning facilitated through Social Networking [15]. In both case, actors in different area share information for a variety purposes as those in work related environment do for patient care. ...
Introduction: Social network is systematic means of assessing formation and informal networks by mapping and analysing relationships among people, groups, and units of work group or even entire organizations. In this article information Sharing and problem solving methods of health extension worker in Konso woreda was assessed. Objective: This study is aimed to assess information sharing using social network analysis among health extension workers in Konso woreda. Methods: A cross-sectional survey was conducted on all health extension workers in Konso woreda in Southern Ethiopia, using pretested structured pretested roster type questionnaire. All analysis performed by UCINET. Results: The response was rate 77(93%). For who know who network: Degree (64.8), Betweenness (11.1), Eigenvector (0.11), density (79.28%), and for information sharing network Degree (22.3), Betweenness (54.6), Eigenvector (0.11), density (27.2%). Using MR-QAP indicated significant variables such as experience (B=-0.041, p=0.0085), media (B=-0.0430, p=0.0055), site (B=-0.11, p=0.0005) and who know who B=0.1722, p=0.0005). People share information have positive performance (B=0.0466, p=0.01450) Conclusion: The information sharing in HEWs was inadequate. Sharing was observed among different sites rather than the same, people of different experiences than that of the same, and people who have different knowledge of Medias for information Sharing but for who know each other and have performance.
... The use of tailored DNA biochips may support advances in this field [66,67]. Another challenging field of research is represented by the investigation of single nucleotide polymorphism (SNP) in humans, as a basis for the interpretation of the interindividual variability in toxicological studies, under a toxicogenomic perspective [68][69][70]. On the other hand, novel metabolite profiling techniques allowed for the identification of molecular markers' effects in cells and/or tissues [71,72]. ...
Full-text available
The presence of mycotoxins in food represents a severe threat for public health and welfare, and poses relevant research challenges in the food toxicology field. Nowadays, food toxicologists have to provide answers to food-related toxicological issues, but at the same time they should provide the appropriate knowledge in background to effectively support the evidence-based decision-making in food safety. Therefore, keeping in mind that regulatory actions should be based on sound scientific findings, the present opinion addresses the main challenges in providing reliable data for supporting the risk assessment of foodborne mycotoxins
Full-text available
Global gene expression analysis may be investigated in a number of ways, including fullgenome microarrays, which are specialized array panels that contain specific series of DNA or RNA sequences that may be directly or indirectly related to the immune system and the highthroughput platforms for genotyping and haplotype mapping. While genome-wide, large-scale arrays may be useful for generalized toxicological screening, the management and interpretation of the large numbers of gene changes which may or may not be relevant to the immune system is difficult. “Data overload” may be minimized by using commercially available pathway mapping tools that focus on patterns of gene expression in subpopulations of T and B lymphocytes, macrophages, and dendritic cells (DC) which encompass signaling pathways for differentiation and proliferation that would (i) allow for the monitoring of the progress of the immune response and (ii) help identify defects at the cellular and molecular level (4).
Toxicology is a multidisciplinary field, and an important science that impacts both environmental health regulation and the development and practice of medicine. The rapid progress in cellular and molecular biology, like many other branches of biomedical research, toxicology is now experiencing a renaissance fueled by the application of "omic" technologies to gain a better understanding of the biological basis of toxicology of drugs and other environmental factors. In this review on current progress on toxicology, the future perspective, concept, approaches and applications of toxicogenomics as next generation promising technology in toxicology field will be described.
Toxicology information systems have evolved swiftly from early, library-based bibliographic tools to advanced packages utilizing sophisticated computer and telecommunication technologies. These systems have evolved concurrently with the rapid expansion of the science of toxicology itself. Bibliographic files such as TOXLINE represent first attempts to handle the toxicology literature through online retrieval. Subsequent approaches applied the use of computers to provide literature-derived data, as in the HSDB or RTECS databanks, or to capture data directly from the laboratory. More advanced systems are utilizing computational and analytical approaches to extract knowledge from the laboratory data to predict outcomes. Societal concerns about hazardous substances, manifested in legislation and regulations, have been responsible for the generation of much toxicity information and the impetus to systematically collect and disseminate these data. Scientific progress in molecular biology, chemistry, and bioinformatics combined with advances in information technologies are also impacting the supply of toxicology information. The amount of information has risen dramatically over the years and it is more accessible and current than ever before. However, in order to derive meaning from the mounds of data and information, the field must rise to the challenge of fostering data accessibility, promoting data standards, and ensuring data quality.
Full-text available
Summary The Gene Expression Omnibus (GEO) project was initiated at NCBI in 1999 in response to the growing demand for a public repository for data generated from high-throughput microarray experiments. GEO has a flexible and open design that allows the submission, storage, and retrieval of many types of data sets, such as those from high-throughput gene expression, genomic hybridization, and antibody array experiments. GEO was never intended to replace lab-specific gene expression databases or laboratory information management systems (LIMS), both of which usually cater to a particular type of data set and analytical method. Rather, GEO complements these resources by acting as a central, molecular abundance-data distribution hub. GEO is available on the World Wide Web at (
Full-text available
MAPS is a MicroArray Project System for management and interpretation of microarray gene expression experiment information and data. Microarray project information is organized to track experiments and results that are: (1) validated by performing analysis on stored replicate gene expression data; and (2) queried according to the biological classifications of genes deposited on microarray chips. Availability: MAPS is accessible at Contact: * To whom correspondence should be addressed. 3 Present address: NimbleGen Systems, LLC, University Research Park, 505 South Rosa Road, Madison, WI 53719, USA.
Full-text available
Microarray analysis has become a widely used tool for the generation of gene expression data on a genomic scale. Although many significant results have been derived from microarray studies, one limitation has been the lack of standards for presenting and exchanging such data. Here we present a proposal, the Minimum Information About a Microarray Experiment (MIAME), that describes the minimum information required to ensure that microarray data can be easily interpreted and that results derived from its analysis can be independently verified. The ultimate goal of this work is to establish a standard for recording and reporting microarray-based gene expression data, which will in turn facilitate the establishment of databases and public repositories and enable the development of data analysis tools. With respect to MIAME, we concentrate on defining the content and structure of the necessary information rather than the technical format for capturing it.
Genome-wide expression profiling with DNA microarrays has and will provide a great deal of data to the plant scientific community. However, reliability concerns have required the development data quality tests for common systematic biases. Fortunately, most large-scale systematic biases are detectable and some are correctable by normalization. Technical replication experiments and statistical surveys indicate that these biases vary widely in severity and appearance. As a result, no single normalization or correction method currently available is able to address all the issues. However, careful sequence selection, array design, experimental design and experimental annotation can substantially improve the quality and biological of microarray data. In this review, we discuss these issues with reference to examples from the Arabidopsis Functional Genomics Consortium (AFGC) microarray project.
A nomenclature system for the UDP glucuronosyltransferase superfamily is proposed, based on divergent evolution of the genes. A total of 26 distinct cDNAs in five mammalian species have been sequenced to date. Comparison of the deduced amino acid sequences leads to the definition of two families and a total of three subfamilies. For naming each gene, we propose that the root symbol UGT for human (Ugt for mouse), representing "UDP glucuronosyltransferase," be followed by an Arabic number denoting the family, a letter designating the subfamily, and an Arabic numeral representing the individual gene within the family or subfamily (hyphen before the Arabic number for mouse), e.g., human UGT2B1 and murine Ugt2b-1. Whereas the gene and cDNA should be italicized, the corresponding transcript, protein, and enzyme activity should not be written with lowercase letters or in italics, e.g., human or murine UGT2B1. Recent experimental evidence suggests that several exons of the UGT1 gene might be shared, indicating that distinct UGT1 transcripts and proteins may arise via alternative splicing; the gene and gene product of alternative splicing will be designated with an asterisk, e.g., UGT1*6 and UGT1*6, respectively. When an orthologous gene between species cannot be identified with certainty, as occurs in the UGT2B subfamily, we recommend sequential naming of the genes chronologically as they become characterized. We suggest that the human nomenclature system be used for species other than the mouse. We anticipate that this UGT gene nomenclature system will require updating on a regular basis.