Vol. 23 no. 6 2007, pages 783–784
BIOINFORMATICS APPLICATIONS NOTE
Databases and ontologies
AphidBase: a database for aphid genomic resources
Jean-Pierre Gauthier1, Fabrice Legeai2, Alain Zasadzinski3, Claude Rispe1and
1INRA, Agrocampus Rennes, UMR 1099 BiO3P (Biology of Organisms and Populations applied to Plant Protection),
F-35653 LE RHEU, France,2INRA, URGI - Genoplante Info, Infobiogen, 523 place des Terrasses, F-91000 Evry,
France and3CNRS INIST, 2, alle ´e du Parc de Brabois 54500 Vandoeuvre-le `s-Nancy, France
Received and revised on October 13, 2006; accepted on January 5, 2007
Advance Access publication January 19, 2007
Associate Editor: Alvis Brazma
AphidBase aims to (i) store recently acquired genomic resources on
aphids and (ii) compare them to other insect resources as functional
annotation tools. For that, the Drosophila melanogaster genome has
been loaded in the database using the GMOD open source software
for a comparison with the 17069 pea aphid unique transcripts
(contigs) and the 13639 gene transcripts of the Anopheles gambiae.
Links to FlyBase and A.gambiae Entrez databases allow a rapid
characterization of the putative functions of the aphid sequences.
Text mining of the D.melanogaster literature was performed to
construct a network of co-cited gene or protein names, which should
facilitate functional annotation of aphid homolog sequences.
AphidBase represents one of the first genomic databases for a
Since the genome of Drosophila melanogaster, several other
insect genomes have been or are currently being sequenced.
These new genomic resources require the development of
databases in order not only to store and analyse each insect
genome separately, but also for interspecies comparisons.
Aphids are plant-sucking insects attacking plants by feeding
on fluids circulating in the plant phloem. These hemipterans
diverged from other insects 300 million years ago (Grimaldi and
Engels, 2005) and are responsible for considerable damage
worldwide to cultivated and ornamental plants. Genomic tools
have been recently developed on aphids, mainly on the pea
aphid Acyrthosiphon pisum. In the last 3 years, a large collection
of ESTs (more than 60000) have been developed, and the
AphidBase is a project of the International Aphid Genomic
Consortium as a resource for users to analyse, compare,
retrieve, and annotate aphid sequences, in comparison with
D.melanogaster and Anopheles gambiae.
We used the Generic Model Organism Database (GMOD;
http://www.gmod.org/home) open source project for adminis-
trating of genomic databases. Among GMOD tools, Chado
(database architecture on PostgreSQL; http://www.gmod.org/
apollo-chado) and Gbrowse (a genomic data browser, Stein
et al., 2002) were installed to set up and develop AphidBase.
The complete D.melanogaster chromosome sequences (Flybase
version r4.2.1; Drysdale et al., 2005), 13639 A.gambiae gene
transcripts (MOZ2a) and 17069 A.pisum putative unique
transcribed sequences (or contigs; see Sabater-Mun ˜ oz et al.,
2006) were retrieved. These 17069 A.pisum contigs were
assembled from 53190 ESTs, forming the so-called updated
‘v5’ version of contigs (Sabater-Mun ˜ oz et al., 2006 and
unpublisheddata). In order
A.gambiae to D.melanogaster sequences, tblastx was performed
between the A.pisum contigs or A.gambiae gene transcripts and
the D.melanogaster genomic sequence. Matching A.pisum or
A.gambiae sequences to fly sequences were displayed onto the
D.melanogaster genome. For A.pisum, 12280 (72%) contigs
had no match to D.melanogaster sequences (e value¼10?6) and
4789 (28%) were homologous to D.melanogaster sequences.
5551 (32%) A.pisum contigs were homologous to A.gambiae
sequences, and 2922 (17%) matched to both D.melanogaster
and A.gambiae sequences. This lack of homology reflects the
divergence between hemipteran and dipteran, as already
discussed in Sabater-Mun ˜ oz et al. (2006).
3FUNCTIONAL ANNOTATION TOOLBOX
In order to facilitate functional annotation of the pea aphid
sequences for which very little data is available on gene
and protein functions, several descriptions are proposed for
D.melanogaster and/or A.gambiae sequences. A direct link to
the FlyBase report for each D.melanogaster gene was activated,
in order to take advantage of the whole molecular and genetic
description of fly genes homologous to aphid genes. In parallel,
a similar direct link was performed for the A.gambiae
Transcript Report. Finally, a contig report for each of
the A.pisum sequences was created, containing a list of
*To whom correspondence should be addressed.
? The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: email@example.com
several features. First, the sequence of each contig as well as Download full-text
EST composition and cDNA libraries of origin were indicated.
Second, the translation in the six open reading frames were
displayed, as well as the identification of the largest putative
ORF after FrameD analysis (Schiex et al., 2003). Third, as each
of the aphid transcripts had several matches to D.melanogaster
genomic sequences, only the first hit was displayed on Gbrowse
for search of clarity. Thus, the complete tblastx report was
included on the contig report. A major fraction (2872) of the
4789 pea aphid contigs homologous to D.melanogaster
sequences had a single match (the one visualized by Gbrowse)
but 1827 contigs still had between 2 and 7 different matches on
the D.melanogaster genome (see ‘‘Global Report’’). Fourth, the
result of Uniprot annotation (e value¼10?5, Sabater-Mun ˜ oz
et al., 2006) was displayed. Finally, a text mining analysis was
set up, based on the large D.melanogaster literature. For this,
the thesaurus of the D.melanogaster bibliography (about 37000
bibliographical records from the Medline database) was
constructed (i) by automatic extraction of gene names from
the abstracts, (ii) by automated pattern recognition and natural
language processing methods for names of genes and their
products, and (iii) by compiling literature annotations available
in various databases such as FlyBase, SwissProt or Entrez
Gene. Co-citation clusters and networks were also constructed
by automatic clustering approaches, and annotated by biolo-
gical and functional information from Medline records
(indexation keywords) and by available information for
D.melanogaster genes and their products in databases and
ontologies. Each of the 4487 pea aphid contigs homologous to a
D.melanogaster gene was thus linked to literature records, clusters
and networks by using the names of D.melanogaster homologous.
This bibliographic network of co-occurrence of cited genes in the
literature is a quick and efficient tool to infer biological functions
in which a given pea aphid contig might be involved.
As soon as the sequence and assembly of the 530Mb genome
of A.pisum will be available, gameXML files of specific regions
of interest would be easily extracted in order to be loaded into
genome annotation editors. Comparisons with (i) other
aphid ESTs or cDNAs sequences (e.g. Hunter et al., 2003),
(ii) D.melanogaster and A.gambiae genomes or (iii) other insect
genomes under annotation (e.g. the honey bee) will help human
expert decisions. AphidBase will also be implemented with new
functional annotation modules, mainly for transcript and
proteic profilings. AphidBase will thus represent the necessary
infrastructure for curation, archiving and functional annotation
of an aphid genome, one of the first hemipteran sequenced
S. Cain (Gmod), O. Chenede ´ (INRA Rennes), J.P. Gaultier
J.C. Simon, C. Soster (INRA Rennes) and L. Stein (CSHL,
USA) are acknowledged for their technical support, advice
and discussions. Financial
Rennes Metropole and ANR Exdisum.
Conflict of Interest: none declared.
Drysdale,R. et al. (2005) FlyBase: genes and gene models. Nucleic Acids Res., 33,
Grimaldi,D. and Engels,M.S. (2005) Evolution of the Insects. Cambridge
University Press, New York .
Hunter,W.B. et al. (2003) Aphid biology: expressed genes from the alate
Toxoptera citricida, the brown citrus aphid. J. Ins. Sci., 3, 23.
Sabater-Mun ˜ oz,B. et al. (2006) Large-scale gene discovery in the pea aphid
Acyrthosiphon pisum (Hemiptera). Genome Biol., 7, R21.
Schiex,T. et al. (2003) FrameD: a flexible program for quality check and gene
prediction in prokaryotic genomes and noisy matured eukaryotic sequences.
Nucleic Acids Res., 31, 3738–3741.
Stein,L.D. et al. (2002)The generic
block foramodelorganism system
J.-P.Gauthier et al.