Rapid DNA barcoding analysis of large datasets using the composition vector method

Department of Biology, The Chinese University of Hong Kong, Hong Kong, PR China.
BMC Bioinformatics (Impact Factor: 3.02). 11/2009; 10(Suppl 14):S8. DOI: 10.1186/1471-2105-10-S14-S8
Source: DBLP

ABSTRACT Background: Sequence alignment is the rate-limiting step in constructing profile trees for DNA barcoding purposes. We recently demonstrated the feasibility of using unaligned rRNA sequences as barcodes based on a composition vector (CV) approach without sequence alignment (Bioinformatics 22:1690). Here, we further explored the grouping effectiveness of the CV method in large DNA barcode datasets (COI, 18S and 16S rRNA) from a variety of organisms, including birds, fishes, nematodes and crustaceans.
Results: Our results indicate that the grouping of taxa at the genus/species levels based on the CV/NJ approach is invariably consistent with the trees generated by traditional approaches, although in some cases the clustering among higher groups might differ. Furthermore, the CV method is always much faster than the K2P method routinely used in constructing profile trees for DNA barcoding. For instance, the alignment of 754 COI sequences (average length 649 bp) from fishes took more than ten hours to complete, while the whole tree construction process using the CV/NJ method required no more than five minutes on the same computer.
Conclusion: The CV method performs well in grouping effectiveness of DNA barcode sequences, as compared to K2P analysis of aligned sequences. It was also able to reduce the time required for analysis by over 15-fold, making it a far superior method for analyzing large datasets. We conclude that the CV method is a fast and reliable method for analyzing large datasets for DNA barcoding purposes.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Species identification based on short sequences of DNA markers, i.e., DNA barcoding, has emerged as an integral part of modern taxonomy. However, software for the analysis of large and multi-locus barcoding datasets is scarce. The Basic Local Alignment Search Tool (BLAST) is currently the fastest tool capable of handling large databases (e.g., > 5,000 sequences), but its accuracy is a concern and has been criticised for its local optimisation. However, current more accurate software requires sequence alignment or complex calculations which are time-consuming when dealing with large datasets during data preprocessing or during the search stage. Therefore, it is imperative to develop a practical program for both accurate and scalable species identification for DNA barcoding. In this context, we present VIP Barcoding: user-friendly software in graphical user interface for rapid DNA barcoding. It adopts a hybrid, two-stage algorithm. First, an alignment-free composition vector method is utilised to reduce searching space by screening a reference database. The alignment-based K2P distance nearest neighbour method is then employed to analyse the smaller dataset generated in the first stage. In comparison to other software, we demonstrate that VIP Barcoding has: (1) higher accuracy than Blastn and several alignment-free methods, and (2) higher scalability than alignment-based distance methods and character-based methods. These results suggest that this platform is able to deal with both large-scale and multi-locus barcoding data with accuracy, and can contribute to DNA barcoding for modern taxonomy. VIP Barcoding is free and available at: This article is protected by copyright. All rights reserved.
    Molecular Ecology Resources 01/2014; · 7.43 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Bioinformatics has played an important role in the analysis of DNA barcoding data. The process of DNA barcoding initially involves the available data collection from the existing databases. Many databases have been developed in recent years eg. MMDBD [Medicinal Materials DNA Barcode Database], BioBarcode etc. In case of non-availability of sequences, sequencing has to be done in vitro for which a recently developed software ecoPrimers can be helpful. This is followed by multiple sequence alignment. Further, basic sequence statistics computation and phylogenetic analysis can be performed by MEGA and PHYLIP/PAUP tools respectively. Some of the recent tools for in silico and statistical analysis specifically designed for barcoding viz. CAOS (Character Based DNA Barcoding), BRONX (DNA Barcode Sequence Identification Incorporating Taxonomic Hierarchy and within Taxon Variability), Spider (Analysis of species identity and evolution, particularly DNA barcoding), jMOTU and Taxonerator (Turning DNA Barcode Sequences into Annotated OTUs), OTUbase (Analysis of OTU data and taxonomic data), SAP (Statistical Assignment Package) etc. have been discussed and analysed in this review. The paper presents a comprehensive overview of the various in silico methods, tools, softwares and databases used for DNA barcoding of plants.
    Molecular Phylogenetics and Evolution 03/2013; · 4.07 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We tested the performance of DNA barcoding in Acridoidea and attempted to solve species boundary delimitation problems in selected groups using COI barcodes. Three analysis methods were applied to reconstruct the phylogeny. K2P distances were used to assess the overlap range between intraspecific variation and interspecific divergence. "Best match (BM)", "best close match (BCM)", "all species barcodes (ASB)" and "back-propagation neural networks (BP-based method)" were utilized to test the success rate of species identification. Phylogenetic species concept and network analysis were employed to delimitate the species boundary in eight selected species groups. The results demonstrated that the COI barcode region performed better in phylogenetic reconstruction at genus and species levels than at higher-levels, but showed a little improvement in resolving the higher-level relationships when the third base data or both first and third base data were excluded. Most overlaps and incorrect identifications may be due to imperfect taxonomy, indicating the critical role of taxonomic revision in DNA barcoding study. Species boundary delimitation confirmed the presence of oversplitting in six species groups and suggested that each group should be treated as a single species.
    PLoS ONE 01/2013; 8(12):e82400. · 3.53 Impact Factor

Full-text (2 Sources)

Available from
May 22, 2014