Published online 11 December 2007Nucleic Acids Research, 2008, Vol. 36, Database issueD25–D30
Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell and
David L. Wheeler*
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health,
Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA
Received September 18, 2007; Accepted October 10, 2007
GenBank (R) is a comprehensive database that
contains publicly available nucleotide sequences
for more than 260000 named organisms, obtained
primarily through submissions from individual labo-
ratories and batch submissions from large-scale
sequencing projects. Most submissions are made
using the web-based BankIt or standalone Sequin
programs and accession numbers are assigned by
GenBank staff upon receipt. Daily data exchange
with the European Molecular Biology Laboratory
Nucleotide Sequence Database in Europe and the
DNA Data Bank of Japan ensures worldwide cover-
age. GenBank is accessible through NCBI’s retrieval
system, Entrez, which integrates data from the
major DNA and protein sequence databases along
with taxonomy, genome, mapping, protein structure
and domain information, and the biomedical journal
literature via PubMed. BLAST provides sequence
similarity searches of GenBank and other sequence
databases. Complete bimonthly releases and daily
updates of the GenBank database are available by
FTP. To access GenBank and its related retrieval
and analysis services, begin at the NCBI Homepage:
GenBank (1) is a comprehensive public database of
nucleotide sequences and supporting bibliographic and
biological annotation, built and distributed by the
National Center for Biotechnology Information (NCBI),
a division of the National Library of Medicine (NLM),
located on the campus of the US National Institutes of
Health (NIH) in Bethesda, MD, USA.
NCBI builds GenBank primarily from the submission
of sequence data from authors and from the bulk
submission of expressed sequence tag (EST), genome
survey sequence (GSS), and other high-throughput data
from sequencing centers. The US Office of Patents and
Trademarks also contributes sequences from issued
patents. GenBank, the European Molecular Biology
Laboratory Nucleotide Sequence Database (EMBL) (2)
in Europe, and the DNA Databank of Japan (DDBJ) (3)
comprise the International Nucleotide Sequence Database
Collaboration (INSDC), and are members of a long-
standing collaboration in which data is exchanged daily to
ensure a uniform and comprehensive collection of
sequence information. NCBI makes the GenBank data
available at no cost over the Internet, via FTP and via a
wide range of Web-based retrieval and analysis services
which operate on the GenBank data (4).
ORGANIZATION OF THE DATABASE
From its inception, GenBank has doubled in size about
every 18 months. The traditional GenBank divisions
contain over 80 billion nucleotide bases from more than
76 million individual sequences, with 15 million new
sequences added in the past year. Contributions from
Whole Genome Shotgun (WGS) projects supplement the
data in the traditional divisions to bring the total beyond
190 billion bases. Complete genomes (www.ncbi.nlm.
nih.gov/Genomes/index.html) continue to represent a
rapidly growing segment of the database, with some 200
of more than 570 complete microbial genomes in
GenBank deposited over the past year. The number of
eukaryote genomes for which coverage and assembly are
significant continues to increase as well, with over 190
assemblies now available, including that of the reference
Database sequences are classified and can be queried using
a comprehensive sequence-based taxonomy (www.ncbi.
nlm.nih.gov/sites/entrez? db=taxonomy) developed by
NCBI in collaboration with EMBL and DDBJ and with
the valuable assistance of external advisers and curators.
More than 260000 named species are represented in
GenBank and new species are being added at the rate of
over 1700 per month. About 12% of the sequences in
GenBank are of human origin and 8% of all sequences are
*To whom correspondence should be addressed. Tel: 301 435 5950; Fax: 301 480 9241; Email: email@example.com
? 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
human expressed sequence tags (ESTs). The top species in
GenBank in terms of number of bases are Homo sapiens
(12.7 billion bases), Mus musculus (8.3 billion), Rattus
norvegicus (5.8 billion), Bos taurus (3.8 billion), Zea
mays (3.6 billion), Danio rerio (2.8 billion), Sus scrofa
(1.9 billion), Oryza sativa (1.5 billion), Strongylocentrotus
purpuratus (1.4 billion), Xenopus tropicalis (1.1 billion) and
Pan troglodytes (940 million).
GenBank records anddivisions
Each GenBank entry includes a concise description of
the sequence, the scientific name and taxonomy of the
source organism, bibliographic references and a table of
listing areas of biological significance, such as coding
regions and their protein translations, transcription units,
repeat regions and sites of mutations or modifications.
The files in the GenBank distribution have traditionally
been partitioned into ‘divisions’ that roughly correspond
to taxonomic groups such as bacteria (BCT), viruses
(VRL), primates (PRI) and rodents (ROD). In recent
years, divisions have been added to support specific
sequencing strategies. These include divisions for expres-
sed sequence tag (EST), genome survey (GSS), high-
throughput genomic (HTG), high-throughput cDNA
(HTC) and environmental sample (ENV) sequences,
making a total of 18 divisions. For convenience in file
transfer, the GenBank data is partitioned into multiple
files, currently more than 1300, for the bimonthly
GenBank releases on NCBI’s FTP site.
Expressed sequence tags (ESTs). ESTs continue to be
a major source of new sequence records and gene
sequences, comprising over 25 billion nucleotide bases in
GenBank release 161. Over the past year, the number of
ESTs has increased by over 19% to a total of 45.5 million
sequences representing more than 1370 different organ-
isms. The top organisms represented in the EST division
are Homo sapiens (8.1 million records), Mus musculus
(4.9 million), Bos taurus (1.5 million), Sus scrofa
(1.5 million), Danio rerio (1.4 million) and Arabidopsis
thaliana (1.3 million). As part of its daily processing of
GenBank EST data, NCBI identifies through BLAST
searches all homologies for new EST sequences and
incorporates that information into the companion data-
base, dbEST (www.ncbi.nlm.nih.gov/dbEST/index.html)
(5). The data in dbEST is processed further to produce the
UniGene database (www.ncbi.nlm.nih.gov/sites/entrez?
db=unigene) of more than 1.5 million gene-oriented
sequence clusters representing over 85 organisms and
described more fully in Ref. (4).
Sequence-tagged sites (STSs), genome survey sequences
(GSSs) and environmental sample sequences (ENV). The
STS division of GenBank (www.ncbi.nlm.nih.gov/dbSTS/
index.html) contains over 930000 sequences, including
anonymous STSs based on genomic sequence as well as
gene-based STSs derived from the 30ends of genes and
ESTs. These STS records usually include mapping
The GSS division of GenBank (www.ncbi.nlm.nih.gov/
dbGSS/index.html) has grown over the past year by 29%
to a total of 21 million records for over 670 organisms and
contributes over 13.5 billion nucleotide bases. GSS
sequences are the products of as many as 80 different
experimental techniques, including ‘metagenomic’ surveys
of sequencesarising from
However, about half of all GSS records are single reads
from Bacterial Artificial Chromosomes (‘BAC-ends’) used
in a variety of genome sequencing projects. The most
highly represented species in the GSS division, including
metagenomic surveys, are marine metagenome (2.6 million
records), Zea mays (2.1 million), Mus musculus (1.8
million) and Homo sapiens (1.1 million). The human
data has been used(www.ncbi.nlm.nih.gov/projects/
genome/clone/) along with the STS records in tiling the
BACs for the Human Genome Project (6).
The ENV division of GenBank accommodates non-
WGS sequences obtained via environmental sampling
methods in which the source organism is unknown.
Records in the ENV division contain ‘ENV’ in the
keyword field and use an‘/environmental_sample’ qualifier
in the source feature. As of GenBank release 161, the ENV
division of GenBank contained over 600000 sequences,
comprising 403 million base pairs.
High-throughput genomic (HTG) and high-throughput
cDNA(HTC) sequences. The
GenBank (www.ncbi.nlm.nih.gov/HTGS/) contains unfin-
ished large-scale genomic records, which are in transition
to a finished state (7). These records are designated as
Phase 0–3 depending on the quality of the data. Upon
reaching Phase 3, the finished state, HTG records are
movedinto the appropriate
GenBank. As of release 161 of GenBank, the HTG
division comprised 18 billion base pairs of sequence, an
increase of more than 2 billion bases over the past year.
The HTC division of GenBank accommodates high-
throughput cDNA sequences. HTCs are of draft quality
but may contain 50UTRs and 30UTRs, partial coding
regions and introns. HTC sequences which are finished
and of high quality are moved to the appropriate
organism GenBank division. GenBank release 161 con-
tained more than 429000 HTC sequences totaling 570
million bases. A project generating HTC data is described
in Ref. (8).
Whole Genome Shotgun (WGS) sequence. More than 101
billion bases of WGS sequence appear in GenBank as sets
of WGS contigs, many of them bearing annotations
originating from a single sequencing project. These
sequences are issued accession numbers consisting of a
4-letter project ID, followed by a two-digit version
number and a 6-digit contig ID. Hence, the WGS
accession number ‘AAAA01072744’ is assigned to contig
number ‘072744’ of the first version of project ‘AAAA’.
Whole Genome Shotgun (WGS) sequencing projects have
contributed some 25 million contigs to GenBank, a 39%
increase over last year’s total. These primary sequences
have been used to construct 4.1 million large-scale
assemblies of scaffolds and chromosomes. WGS project
Nucleic Acids Research, 2008, Vol. 36, Databaseissue
contigs for Homo sapiens, Pan trodlodytes, Macacca
mulatta, Equus caballus, Canis familiaris, Drosophila,
Saccharomyces and 800 other organisms and environ-
mental samples are available. For a complete list of WGS
projects with links to the data, see (www.ncbi.nlm.
Although WGS project sequences may be annotated,
many low-coverage genome projects do not contain
annotation. Because these sequence projects are ongoing
and incomplete, these annotations may not be tracked
from one assembly version to the next and should be
Submitters of WGS sequences, and genomic sequences in
general, are urged to use a new set of evidence tags of
the form‘/experimental=text’ and‘/inference=TYPE:text’,
where‘TYPE’ is one of a number of standard inference
types and ‘text’ is made up of structured text. These
new qualifiersreplace ‘evidence=experimental’
‘evidence=non-experimental’, respectively, which are no
Special Record types
Third Party Annotation (TPA). Third Party Annotation
(TPA) records support the reporting of published
sequence annotation by a scientist other than the original
submitter of the primary sequence record in DDBJ/
EMBL/GenBank. TPA records fall into one of two
categories, ‘experimental’, in which case there is direct
experimental evidence for the existence of the annotated
molecule, and ‘inferential’, in which case the experimental
evidence is indirect. TPA sequences may be created by
assembling a number of primary sequences. The format of
a TPA record (e.g. BK000016) is similar to that of a
conventional GenBank record but includes the label
‘TPA:’ at the beginning of each Definition Line and the
keywords ‘Third Party Annotation; TPA’ in the Keywords
field. The Comment field of TPA records lists the primary
sequences used to assemble the TPA sequence; the
Primary field provides the base ranges of the primary
sequences that contribute to the TPA sequence.
Over 5500 TPA records are contained in GenBank
release 161, including 2170 for Drosophila melanogaster,
960 for Homo sapiens, 330 for Oryza sativa and 290 for
Mus musculus. TPA sequences are not released to the
public until their accession numbers or sequence data and
annotation appear in a peer-reviewed biological journal.
TPA submissions to GenBank may be made using either
BankIt or Sequin. For more information on TPA, see
records. Although many genomes, such as bacterial
genomes, are represented in GenBank as single sequences,
it is desirable from the standpoints of data transfer and
analysis to break some very long sequences, such as
portions of eukaryotic genomes, into smaller segments. In
these cases, CON division records for the entire sequence
are produced that contain assembly instructions to allow
the seamless display and download of the full sequence.
Many CON records also include annotations.
CON recordsfor assembliesof smaller
BUILDING THE DATABASE
The data in GenBank, and the collaborating databases
EMBL and DDBJ, is submitted primarily by individual
authors to one of the three databases, or by sequencing
centers as batches of EST, STS, GSS, HTC, WGS or HTG
sequences. Data is exchanged daily with DDBJ and
EMBL so that the daily updates from NCBI servers
incorporate the most recently available sequence data
from all sources.
Direct electronic submission
Virtually all records enter GenBank as direct elec-
tronic submissions (www.ncbi.nlm.nih.gov/Genbank/
index.html), with the majority of authors using the
BankIt or Sequin programs. Many journals require
authors with sequence data to submit the data to a
public database as a condition of publication.
GenBank staff can usually assign an accession number
to a sequence submission within two working days of
receipt, and do so at a rate of almost 1600 per day. The
accession number serves as confirmation that the sequence
has been submitted and allows readers of articles, in which
the sequence is cited, to retrieve the data. Direct
submissions receive a quality assurance review that
includes checks for vector contamination, proper transla-
tion of coding regions, correct taxonomy and correct
bibliographic citations. A draft of the GenBank record is
passed back to the author for review before it enters the
database. Authors may ask that their sequences be kept
confidential until the time of publication. Since GenBank
policy requires that the deposited sequence data be made
public when the sequence or accession number is
published, authors are instructed to inform GenBank
staff of the publication date of the article in which the
sequence is cited in order to ensure a timely release of the
data. Although only the submitting scientist is permitted
to modify sequence data or annotations, all users are
encouraged to report lags in releasing data or possible
errors or omissions to GenBank at (firstname.lastname@example.org.
NCBI works closely with sequencing centers to ensure
timely incorporation of bulk data into GenBank for public
release. GenBank offers special batch procedures for
large-scale sequencing groups to facilitate data sub-
mission, including the program ‘tbl2asn’, described at
submissions are received through NCBI’s Web-based
data submission tool, BankIt (www.ncbi.nlm.nih.gov/
BankIt). Using BankIt, authors enter sequence informa-
tion directly into a form and add biological annotation
such as coding regions or mRNA features. Free-form text
boxes, list boxes and pull-down menus allow the submitter
to further describe the sequence without having to learn
formatting rules or restricted vocabularies. Before creating
a draft record in GenBank flat file format for the
submitter to review, BankIt validates submissions, flag-
ging many common errors and checks for vector con-
tamination using a variant of BLAST called Vecscreen.
usingBankIt. About a third of author
Nucleic Acids Research, 2008, Vol. 36, Database issue D27
BankIt is the tool of choice for simple submissions,
especially when only one or a small number of records is
to be submitted (7). BankIt can also be used by submitters
to update their existing GenBank records.
Submission using Sequin and tbl2asn. NCBI also offers a
standalone multi-platform submission program called
Sequin (www.ncbi.nlm.nih.gov/Sequin/index.html) that
can be used interactively with other NCBI sequence
retrieval and analysis tools. Sequin handles simple
sequences such as a cDNA, as well as segmented entries,
phylogenetic studies, population studies, mutation studies,
environmental samples and alignments for which BankIt
and other Web-based submission tools are not well-suited.
Sequin has convenient editing and complex annotation
capabilities and contains a number of built-in validation
functions for quality assurance. In addition, Sequin is
able to accommodate large sequences, such as that of the
5.6 Mb Escherichia coli genome, and read in a full
complement of annotations via simple tables. Versions for
Macintosh, PC and Unix computers are available via
anonymous FTP at (ftp.ncbi.nih.gov) in the ‘sequin’
directory. Once a submission is completed, submitters
can e-mail the Sequin file to the address (gb-sub@ncbi.
Submitters of large, heavily annotated genomes may
find it convenient to use ‘tbl2asn’, referenced above under
‘Direct submission’, to convert a table of annotations
generated via an annotation pipeline into an ASN.1
(Abstract Syntax Notation One) record suitable for
submission to GenBank.
Submission of barcode sequences. The Consortium for the
Barcode of Life (CBOL) is an international initiative to
develop DNA barcoding as a tool for characterizing
species of organisms using a short, usually a 648 bp DNA
sequence derived from a portion of the cytochrome
oxidase subunit I gene. NCBI, in collaboration with
CBOL, (www.barcoding.si.edu/index.htm) has created an
online tool for the bulk submission of barcode sequences
to GenBank (www.ncbi.nlm.nih.gov/BankIt/websub/?
tool=barcode) that allows users to upload files containing
a batch of sequences with associated source information.
It is anticipated that this tool will be used for other types
of bulk submissions in the near future.
Sequenceidentifiers and accession numbers
Accession.Version. Each GenBank record, consisting of
both a sequence and its annotations, is assigned a unique
identifier, the accession number that is shared across the
three collaborating databases (GenBank, DDBJ, EMBL)
and remains constant over the lifetime of the record even
when there is a change to the sequence or annotation.
Each version of the DNA sequence within a GenBank
record is also assigned a unique NCBI identifier, called
a ‘gi’, that appears on the VERSION line of GenBank
flat file records following the accession number. A third
identifier of the form ‘Accession.version’, also displayed
on the VERSION line of flat file records, contains the
information present in both the gi and accession numbers.
An entry appearing in the database for the first time
has an ‘Accession.version’ identifier equivalent to the
ACCESSION number of the GenBank record followed by
‘.1’ to indicate the first version of the sequence for the
VERSION AF000001:1 GI : 987654321
When a change is made to a sequence in a GenBank
record, a new gi number is issued to the sequence and the
version extension of the ‘Accession.version’ identifier is
incremented. The accession number for the record as a
whole remains unchanged and the older sequence remains
available under the old ‘Accession.version’ identifier
A similar system tracks changes in the corresponding
protein translations. These identifiers appear as qualifiers
for CDS features in the FEATURES portion of a
GenBank entry, e.g./protein_id=’AAA00001.1’. Protein
sequence translations also receive their own unique gi
number, which appears as a second qualifier on the CDS
=db xref ¼ ‘::GI : 1233445‘:::
Ensuring stable access tosequence data
A convenient way to share the data among a set of
collaborators is to post the data to a locally maintained
Web site. However, if original data and updates are not
simultaneously submitted to a central repository, signifi-
cant problems can arise.
The access lifetime of the data may be reduced. The
ephemeral nature of much of the content on the Web is
part of the common experience. In one attempt to
quantify content lifetime, 360 randomly selected web
pages were tracked for a period of four years, and a half-
life of only two years was measured for the set (9). While a
well-maintained web page can certainly persist for longer
than two years, the relatively short half-life reported for
this set of pages is worth noting.
The full biological context of the data may not be
realized. Even during the accessible lifetime of locally
posted sequence data, the full biological context of a
sequence may not be realized, if the sequence cannot be
conveniently compared to others—perhaps derived from
distantly related organisms that are beyond the scope of
the host web page.
Existing data in heavily used, centralized databases will
become outdated. If updates to sequences contained
within centralized databases are made to a local page,
but not also made to corresponding records in a central
database, the newer data will not reach the wider research
community and much of its impact will be lost.
Submission of sequence data to a centralized repository
solves these problems. Centralized databases, such as
GenBank and the other members of the INSDC, ensure
stable access to sequence data by providing versioned
Nucleic Acids Research, 2008, Vol. 36, Databaseissue
releases available by FTP, Web interfaces to a uniform
data set and archival redundancy. Combining new data
with that of other researchers worldwide within a central
database provides a broad biological context that
stimulates discovery—keeping each sequence up to date
magnifies the utility of all the sequences in the database.
RETRIEVING GENBANK DATA
The Entrez system
The sequence records in GenBank are accessible via
Entrez (www.ncbi.nlm.nih.gov/sites/gquery), a flexible
database retrieval system that covers 35 biological
databases. Entrez databases contain DNA and protein
sequences derived from GenBank and other sources,
genome maps, population, phylogenetic and environmen-
tal sequence sets, gene expression data, the NCBI
taxonomy, protein domain information and protein
structures from the Molecular Modeling Database,
MMDB (10). Each database is linked to the scientific
literature via PubMed and PubMed Central.
Associating sequence records withsequencing projects
The ability to identify all GenBank records submitted by a
specific group or those with a particular focus, such as
metagenomic surveys, is essential for the analysis of large
volumes of sequence data. The use of organism or
submitter names as a means to define such a set of
sequences is unreliable. The Genome Project Database,
developed at NCBI and subsequently adopted across the
INSDC, allows sequencing centers to register projects
under a unique project identifier, enabling reliable linkage
between sequencing projects and the data they produce.
A new ‘PROJECT’ line appearing in GenBank flat files
identifies the sequencing projects with which a GenBank
sequence record is associated. The PROJECT line may
contain multiple identifiers of the form ‘type’ and ‘value’,
respectively, separated by a semicolon. As an example, the
PROJECT line below associates a GenBank sequence
record with Genome Project (www.ncbi.nlm.nih.gov/sites/
entrez? db=genomeprj) record ‘18787’.
PROJECT GenomeProject : 18787
Genome Project record ‘18787’ provides details of the
progress made in the effort to sequence Anolis carolinensis
(the green anole) (www.broad.mit.edu/models/anole/).
Within the Entrez system, such a sequence record is
linked directly to the appropriate Genome Project record;
conversely, Genome Project records link back to asso-
ciated sequence records.
BLAST sequence-similarity searching
Sequence-similarity searches are the most fundamental
and frequent type of analysis performed on the GenBank
data. NCBI offers the BLAST (www.ncbi.nlm.nih.gov/
BLAST/) family of programs to detect similarities between
a query sequence and database sequences (11,12). BLAST
searches may be performed on NCBI’s Web site (13), or
via a set of standalone programs distributed by FTP.
BLAST is discussed in a separate article in this issue (4).
ObtainingGenBank by FTP
NCBI distributes GenBank releases in the traditional flat
file format as well as in the ASN.1 format used for internal
maintenance. The full bimonthly GenBank release and the
daily updates, which also incorporate sequence data from
EMBL and DDBJ, are available by anonymous FTP from
NCBI at (ftp.ncbi.nih.gov) or (www.ncbi.nlm.nih.gov/
Ftp/) as well as from a mirror site at the University of
full release in flat file format is available as compressed
files in the directory, ‘genbank’ with a non-cumulative set
of updates contained in ‘daily-nc’. A script is provided in
the ‘tools’ directory of the GenBank FTP site to convert a
set of daily updates into a cumulative update.
Rockville Pike, Bethesda, MD 20894, USA. Tel: +1 301
496 2475; Fax: +1 301 480 9241.
email@example.com NCBI Home Page.
firstname.lastname@example.org Submission of sequence data
email@example.com Revisions to, or notification of
release of ‘confidential’ GenBank entries.
NCBI and services.
If you use the GenBank database in your published
research, we ask that this article be cited.
Funding to pay the Open Access publication charges for
this article was provided by the Intramural Research
Program of the National Institutes of Health, National
Library of Medicine.
Conflict of interest statement. None declared.
1. Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J. and
Wheeler,D.L. (2007) GenBank. Nucleic Acids Res., 35(Database
2. Kulikova,T., Akhtar,R., Aldebert,P., Althorpe,N., Andersson,M.,
Baldwin,A., Bates,K., Bhattacharyya,S., Bower,L. et al. (2007)
EMBL Nucleotide Sequence Database in 2006. Nucleic Acids Res.,
35(Database issue), 16–20.
3. Sugawara,H., Abe,T., Gojobori,T. and Tateno,Y. (2007) DDBJ
working on evaluation and classification of bacterial genes in
INSDC. Nucleic Acids Res., 35(Database issue), 13–15.
4. Wheeler,D.L., Barrett,T., Benson,D.A., Bryant,S.H., Canese,K.,
Chetvernin,V., Church,D.M., DiCuccio,M., Edgar,R. et al. (2008)
Database resources of the National Center for Biotechnology
Information. Nucleic Acids Res., This issue (Database issue).
Nucleic Acids Research, 2008, Vol. 36, Database issueD29
5. Boguski,M.S., Lowe,T.M. and Tolstoshev,C.M. (1993) dbEST – Download full-text
database for ‘expressed sequence tags’. Nat. Genet., 4, 332–333.
6. Smith,M.W., Holmsen,A.L., Wei,Y.H., Peterson,M. and
Evans,G.A. (1994) Genomic sequence sampling: a strategy for high
resolution sequence-based physical mapping of complex genomes.
Nat. Genet., 7, 40–47.
7. Kans,J. and Ouellette,B. (2001) Bioinformatics: A Practical Guide to
the Analysis of Genes and Proteins chapter Submitting DNA
Sequences to the Databases, John Wiley and Sons, Inc.: New York,
NY, pp. 65–81.
8. Kawai,J., Shinagawa,A., Shibata,K., Yoshino,M., Itoh,M., Ishii,Y.,
Arakawa,T., Hara,A., Fukunishi,Y. et al. (2001) Functional annota-
tion of a full-length mouse cDNA collection. Nature, 409, 685–690.
9. Koehler,W. (2002) Web page change and persistence – a four-year
longitudinal study. J. Am. Soc. Inf. Sci. Technol., 53, 162–171.
10. Wang,Y., Addess,K.J., Chen,J., Geer,L.Y., He,J., He,S., Lu,S.,
Madej,T., Marchler-Bauer,A. et al. (2007) MMDB: annotating
protein sequences with Entrez’s 3D-structure database. Nucleic
Acids Res., 35(Database issue), 298–300.
11. Altschul,S.F., Madden,T.L., Scha ¨ ffer,A.A., Zhang,J., Zhang,Z.,
Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-
BLAST: a new generation of protein database search programs.
Nucleic Acids Res., 25, 3389–3402.
12. Zhang,Z., Scha ¨ ffer,A.A., Miller,W., Madden,T.L., Lipman,D.J.,
Koonin,E.V. and Altschul,S.F. (1998) Protein sequence
similarity searches using patterns as seeds. Nucleic Acids Res., 26,
13. Ye,J., McGinnis,S. and Madden,T.L. (2006) BLAST: improvements
for better sequence analysis. Nucleic Acids Res., 34(Web Server
Nucleic Acids Research, 2008, Vol. 36, Databaseissue