The DNA Data Bank of Japan launches a new resource, the DDBJ Omics Archive of functional genomics experiments

Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Research Organization for Information and Systems, Yata, Mishima 411-8510, Japan.
Nucleic Acids Research (Impact Factor: 9.11). 11/2011; 40(Database issue):D38-42. DOI: 10.1093/nar/gkr994
Source: PubMed


The DNA Data Bank of Japan (DDBJ; maintains and provides archival, retrieval and analytical resources for biological information. The central DDBJ resource consists of public, open-access nucleotide sequence databases including raw sequence reads, assembly information and functional annotation. Database content is exchanged with EBI and NCBI within the framework of the International Nucleotide Sequence Database Collaboration (INSDC). In 2011, DDBJ launched two new resources: the 'DDBJ Omics Archive' (DOR; and BioProject ( DOR is an archival database of functional genomics data generated by microarray and highly parallel new generation sequencers. Data are exchanged between the ArrayExpress at EBI and DOR in the common MAGE-TAB format. BioProject provides an organizational framework to access metadata about research projects and the data from the projects that are deposited into different databases. In this article, we describe major changes and improvements introduced to the DDBJ services, and the launch of two new resources: DOR and BioProject.

Download full-text


Available from: Yuichi Kodama, May 09, 2014
46 Reads
  • Source
    • "As a result of advancements in sequencing technologies, with increased output and decreased costs, the number of completed genomes will continue to rise resulting in substantial amounts of data. These whole bacterial genome sequence data are housed in publically available databases such as NCBI 4 (Benson et al., 2015), European Molecular Biology Laboratory–European Bioinformatics Institute (EMBL–EBI) 5 (Amid et al., 2012), and DNA Data Bank of Japan (DDBJ) 6 (Kodama et al., 2012), which make up the International Nucleotide Sequence Database Collaboration (INSDC) (Nakamura et al., 2013). Additional databases with more specific microbial applications and bioinformatics programs include IMG (Markowitz et al., 2012) and PATRIC (Pathosystems Resource Integration Center) (Wattam et al., 2014). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Whole-genome data are invaluable for large-scale comparative genomic studies. Current sequencing technologies have made it feasible to sequence entire bacterial genomes with relative ease and time with a substantially reduced cost per nucleotide, hence cost per genome. More than 3,000 bacterial genomes have been sequenced and are available at the finished status. Publically available genomes can be readily downloaded; however, there are challenges to verify the specific supporting data contained within the download and to identify errors and inconsistencies that may be present within the organizational data content and metadata. AutoCurE, an automated tool for bacterial genome database curation in Excel, was developed to facilitate local database curation of supporting data that accompany downloaded genomes from the National Center for Biotechnology Information. AutoCurE provides an automated approach to curate local genomic databases by flagging inconsistencies or errors by comparing the downloaded supporting data to the genome reports to verify genome name, RefSeq accession numbers, the presence of archaea, BioProject/UIDs, and sequence file descriptions. Flags are generated for nine metadata fields if there are inconsistencies between the downloaded genomes and genomes reports and if erroneous or missing data are evident. AutoCurE is an easy-to-use tool for local database curation for large-scale genome data prior to downstream analyses.
    Frontiers in Bioengineering and Biotechnology 08/2015; DOI:10.3389/fbioe.2015.00138
  • Source
    • "We used the version that was released in November 2011. DDBJ [[39,40]]:.rdf.gz format, 7,902,743,055 triples, 330 files, from\_database/ ddbj/. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Biological databases vary enormously in size and data complexity, from small databases that contain a few million Resource Description Framework (RDF) triples to large databases that contain billions of triples. In this paper, we evaluate whether RDF native stores can be used to meet the needs of a biological database provider. Prior evaluations have used synthetic data with a limited database size. For example, the largest BSBM benchmark uses 1 billion synthetic e-commerce knowledge RDF triples on a single node. However, real world biological data differs from the simple synthetic data much. It is difficult to determine whether the synthetic e-commerce data is efficient enough to represent biological databases. Therefore, for this evaluation, we used five real data sets from biological databases. Results We evaluated five triple stores, 4store, Bigdata, Mulgara, Virtuoso, and OWLIM-SE, with five biological data sets, Cell Cycle Ontology, Allie, PDBj, UniProt, and DDBJ, ranging in size from approximately 10 million to 8 billion triples. For each database, we loaded all the data into our single node and prepared the database for use in a classical data warehouse scenario. Then, we ran a series of SPARQL queries against each endpoint and recorded the execution time and the accuracy of the query response. Conclusions Our paper shows that with appropriate configuration Virtuoso and OWLIM-SE can satisfy the basic requirements to load and query biological data less than 8 billion or so on a single node, for the simultaneous access of 64 clients. OWLIM-SE performs best for databases with approximately 11 million triples; For data sets that contain 94 million and 590 million triples, OWLIM-SE and Virtuoso perform best. They do not show overwhelming advantage over each other; For data over 4 billion Virtuoso works best. 4store performs well on small data sets with limited features when the number of triples is less than 100 million, and our test shows its scalability is poor; Bigdata demonstrates average performance and is a good open source triple store for middle-sized (500 million or so) data set; Mulgara shows a little of fragility.
    Journal of Biomedical Semantics 07/2014; 5(1):32. DOI:10.1186/2041-1480-5-32 · 2.26 Impact Factor
  • Source
    • "Our alternative system for recording and presenting strain-level annotation will be provided by the respective BioSample databases of the INSDC partner institutes [10-12]. BioSample records provide a single accessioned unit of information relating to a sample that has been assayed using sequencing or other platforms. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Microbial genome sequence submissions to the International Nucleotide Sequence Database Collaboration (INSDC) have been annotated with organism names that include the strain identifier. Each of these strain-level names has been assigned a unique 'taxid' in the NCBI Taxonomy Database. With the significant growth in genome sequencing, it is not possible to continue with the curation of strain-level taxids. In January 2014, NCBI will cease assigning strain-level taxids. Instead, submitters are encouraged provide strain information and rich metadata with their submission to the sequence database, BioProject and BioSample.
    Standards in Genomic Sciences 06/2014; 9(3):1275-7. DOI:10.4056/sigs.4851102 · 3.17 Impact Factor
Show more