Rapid DNA barcoding analysis of large datasets using the composition vector method

Department of Biology, The Chinese University of Hong Kong, Hong Kong, PR China.
BMC Bioinformatics (Impact Factor: 2.67). 11/2009; 10(Suppl 14):S8. DOI: 10.1186/1471-2105-10-S14-S8
Source: DBLP

ABSTRACT Background: Sequence alignment is the rate-limiting step in constructing profile trees for DNA barcoding purposes. We recently demonstrated the feasibility of using unaligned rRNA sequences as barcodes based on a composition vector (CV) approach without sequence alignment (Bioinformatics 22:1690). Here, we further explored the grouping effectiveness of the CV method in large DNA barcode datasets (COI, 18S and 16S rRNA) from a variety of organisms, including birds, fishes, nematodes and crustaceans.
Results: Our results indicate that the grouping of taxa at the genus/species levels based on the CV/NJ approach is invariably consistent with the trees generated by traditional approaches, although in some cases the clustering among higher groups might differ. Furthermore, the CV method is always much faster than the K2P method routinely used in constructing profile trees for DNA barcoding. For instance, the alignment of 754 COI sequences (average length 649 bp) from fishes took more than ten hours to complete, while the whole tree construction process using the CV/NJ method required no more than five minutes on the same computer.
Conclusion: The CV method performs well in grouping effectiveness of DNA barcode sequences, as compared to K2P analysis of aligned sequences. It was also able to reduce the time required for analysis by over 15-fold, making it a far superior method for analyzing large datasets. We conclude that the CV method is a fast and reliable method for analyzing large datasets for DNA barcoding purposes.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Species identification based on short sequences of DNA markers, i.e., DNA barcoding, has emerged as an integral part of modern taxonomy. However, software for the analysis of large and multi-locus barcoding datasets is scarce. The Basic Local Alignment Search Tool (BLAST) is currently the fastest tool capable of handling large databases (e.g., > 5,000 sequences), but its accuracy is a concern and has been criticised for its local optimisation. However, current more accurate software requires sequence alignment or complex calculations which are time-consuming when dealing with large datasets during data preprocessing or during the search stage. Therefore, it is imperative to develop a practical program for both accurate and scalable species identification for DNA barcoding. In this context, we present VIP Barcoding: user-friendly software in graphical user interface for rapid DNA barcoding. It adopts a hybrid, two-stage algorithm. First, an alignment-free composition vector method is utilised to reduce searching space by screening a reference database. The alignment-based K2P distance nearest neighbour method is then employed to analyse the smaller dataset generated in the first stage. In comparison to other software, we demonstrate that VIP Barcoding has: (1) higher accuracy than Blastn and several alignment-free methods, and (2) higher scalability than alignment-based distance methods and character-based methods. These results suggest that this platform is able to deal with both large-scale and multi-locus barcoding data with accuracy, and can contribute to DNA barcoding for modern taxonomy. VIP Barcoding is free and available at: This article is protected by copyright. All rights reserved.
    Molecular Ecology Resources 01/2014; · 7.43 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Nucleotides and amino acids are basic building units of RNA, DNA and protein. Although intensive studies on understanding how changes in these building blocks affect the phenotypes of these biopolymers are ever increasing, many popular alignment formats are generated by pair-wise comparison tools such as the Basic Local Alignment Search Tool (BLAST). These alignments are user-friendly to researchers but are not convenient for searching, filtering and storage, in particular when there are thousands of alignments generated from highly conserved sequences. Here, we introduce a new alignment format, alns, to facilitate rapid and convenient association of genetic changes and similarity to other sources of information such as phenotypes, disease state, time, geography and taxonomy via simple spreadsheet functions. The format shall assist biologists from a wide range of disciplines in knowledge discovery.
    International Journal of Data Mining and Bioinformatics 01/2013; 7(2):135-45. · 0.66 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We tested the performance of DNA barcoding in Acridoidea and attempted to solve species boundary delimitation problems in selected groups using COI barcodes. Three analysis methods were applied to reconstruct the phylogeny. K2P distances were used to assess the overlap range between intraspecific variation and interspecific divergence. "Best match (BM)", "best close match (BCM)", "all species barcodes (ASB)" and "back-propagation neural networks (BP-based method)" were utilized to test the success rate of species identification. Phylogenetic species concept and network analysis were employed to delimitate the species boundary in eight selected species groups. The results demonstrated that the COI barcode region performed better in phylogenetic reconstruction at genus and species levels than at higher-levels, but showed a little improvement in resolving the higher-level relationships when the third base data or both first and third base data were excluded. Most overlaps and incorrect identifications may be due to imperfect taxonomy, indicating the critical role of taxonomic revision in DNA barcoding study. Species boundary delimitation confirmed the presence of oversplitting in six species groups and suggested that each group should be treated as a single species.
    PLoS ONE 12/2013; 8(12):e82400. · 3.53 Impact Factor

Full-text (2 Sources)

Available from
May 22, 2014