National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA.
Nucleic Acids Research (Impact Factor: 9.11). 12/2011; 40(Database issue):D48-53. DOI: 10.1093/nar/gkr1202
Source: PubMed


GenBank® ( is a comprehensive database that contains publicly available nucleotide sequences for over 300 000 formally described species.
These sequences are obtained primarily through submissions from individual laboratories and batch submissions from large-scale
sequencing projects, including whole-genome shotgun and environmental sampling projects. Most submissions are made using the
web-based BankIt or standalone Sequin programs, and GenBank staff assign accession numbers upon data receipt. Daily data exchange
with the European Nucleotide Archive and the DNA Data Bank of Japan ensures worldwide coverage. GenBank is accessible through
the NCBI Entrez retrieval system, which integrates data from the major DNA and protein sequence databases along with taxonomy,
genome, mapping, protein structure and domain information, and the biomedical journal literature via PubMed. BLAST provides
sequence similarity searches of GenBank and other sequence databases. Complete bimonthly releases and daily updates of the
GenBank database are available by FTP.

Download full-text


Available from: Ilene Karsch-Mizrachi
  • Source
    • "This dataset represented clinical samples collected in the Philippines from 22 provinces in the period between 1994 and 2005 (Additional file 1). In addition, a further 210 VP1 coding region sequences and representing isolates collected from Austria, China, Germany, Hong Kong SAR, Malaysia, Russia, Taiwan, Thailand and Vietnam [8,13-18] were retrieved from both GenBank at NCBI [19] and the WRLFMD sequence archive and, then, integrated with the Philippines collection to comprise a total dataset of 322 VP1 coding sequences (Additional file 2) These VP1 coding region sequences have been submitted to GenBank as have been assigned the following accession numbers: KM243030-KM243172. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Reconstructing the evolutionary history, demographic signal and dispersal processes from viral genome sequences contributes to our understanding of the epidemiological dynamics underlying epizootic events. In this study, a Bayesian phylogenetic framework was used to explore the phylodynamics and spatio-temporal dispersion of the O CATHAY topotype of foot-and-mouth disease virus (FMDV) that caused epidemics in the Philippines between 1994 and 2005. Sequences of the FMDV genome encoding the VP1 showed that the O CATHAY FMD epizootic in the Philippines resulted from a single introduction and was characterised by three main transmission hubs in Rizal, Bulacan and Manila Provinces. From a wider regional perspective, phylogenetic reconstruction of all available O CATHAY VP1 nucleotide sequences identified three distinct sub-lineages associated with country-based clusters originating in Hong Kong Special Administrative Region (SAR), the Philippines and Taiwan. The root of this phylogenetic tree was located in Hong Kong SAR, representing the most likely source for the introduction of this lineage into the Philippines and Taiwan. The reconstructed O CATHAY phylodynamics revealed three chronologically distinct evolutionary phases, culminating in a reduction in viral diversity over the final 10 years. The analysis suggests that viruses from the O CATHAY topotype have been continually maintained within swine industries close to Hong Kong SAR, following the extinction of virus lineages from the Philippines and the reduced number of FMD cases in Taiwan. Electronic supplementary material The online version of this article (doi:10.1186/s13567-014-0090-y) contains supplementary material, which is available to authorized users.
    Full-text · Article · Aug 2014 · Veterinary Research
  • Source
    • "In addition, detailed information about a specific gene, regulatory sequence or intergenic region is shown after clicking on the corresponding name of the Genes table. More specifically, information about the official Human Genome Organisation (HUGO) name and symbol and other synonyms is provided as well as the chromosome and locus of the sequence, with links to the corresponding nucleotide sequence in NCBI GenBank [37], while a detailed description of the functionality of the gene and its role in inherited haemoglobinopathies is shown. Moreover, links to external databases are provided, such as NCBI Gene [35], UniProtKB [38], OMIM [11], HGNC [39] and PDB [40], as well as related publications hyperlinked to NCBI PubMed, while the corresponding locus is shown on the embedded NCBI Sequence Viewer [35]. Figure 4 shows the detailed description of the β-globin gene in IthaGenes. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Inherited haemoglobinopathies are the most common monogenic diseases, with millions of carriers and patients worldwide. At present, we know several hundred disease-causing mutations on the globin gene clusters, in addition to numerous clinically important trans-acting disease modifiers encoded elsewhere and a multitude of polymorphisms with relevance for advanced diagnostic approaches. Moreover, new disease-linked variations are discovered every year that are not included in traditional and often functionally limited locus-specific databases. This paper presents IthaGenes, a new interactive database of haemoglobin variations, which stores information about genes and variations affecting haemoglobin disorders. In addition, IthaGenes organises phenotype, relevant publications and external links, while embedding the NCBI Sequence Viewer for graphical representation of each variation. Finally, IthaGenes is integrated with the companion tool IthaMaps for the display of corresponding epidemiological data on distribution maps. IthaGenes is incorporated in the ITHANET community portal and is free and publicly available at
    Full-text · Article · Jul 2014 · PLoS ONE
  • Source
    • "While tests using randomly generated sequences are useful in providing a general picture of performance relative to data set size, they cannot accurately predict performance for real sequences that have significant similarity to each other. Therefore to provide a more realistic test case, STS was used to analyze a data set containing every E. coli genome available from GenBank [14] (62 genomes, June 2014), ranging in size from 3.9 to 5.7 million nt (Table  3). The traversal time for 62 genomes was approximately 2.5 minutes, whereas the construction time (import + build) was approximately 22 minutes. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Large DNA sequence data sets require special bioinformatics tools to search and compare them. Such tools should be easy to use so that the data can be easily accessed by a wide array of researchers. In the past, the use of suffix trees for searching DNA sequences has been limited by a practical need to keep the trees in RAM. Newer algorithms solve this problem by using disk-based approaches. However, none of the fastest suffix tree algorithms have been implemented with a graphical user interface, preventing their incorporation into a feasible laboratory workflow. Results Suffix Tree Searcher (STS) is designed as an easy-to-use tool to index, search, and analyze very large DNA sequence datasets. The program accommodates very large numbers of very large sequences, with aggregate size reaching tens of billions of nucleotides. The program makes use of pre-sorted persistent "building blocks" to reduce the time required to construct new trees. STS is comprised of a graphical user interface written in Java, and four C modules. All components are automatically downloaded when a web link is clicked. The underlying suffix tree data structure permits extremely fast searching for specific nucleotide strings, with wild cards or mismatches allowed. Complete tree traversals for detecting common substrings are also very fast. The graphical user interface allows the user to transition seamlessly between building, traversing, and searching the dataset. Conclusions Thus, STS provides a new resource for the detection of substrings common to multiple DNA sequences or within a single sequence, for truly huge data sets. The re-searching of sequence hits, allowing wild card positions or mismatched nucleotides, together with the ability to rapidly retrieve large numbers of sequence hits from the DNA sequence files, provides the user with an efficient method of evaluating the similarity between nucleotide sequences by multiple alignment or use of Logos. The ability to re-use existing suffix tree pieces considerably shortens index generation time. The graphical user interface enables quick mastery of the analysis functions, easy access to the generated data, and seamless workflow integration.
    Full-text · Article · Jul 2014 · BMC Research Notes
Show more