Rapid DNA barcoding analysis of large datasets using the composition vector method

Department of Biology, The Chinese University of Hong Kong, Hong Kong, PR China.
BMC Bioinformatics (Impact Factor: 2.58). 11/2009; 10(Suppl 14):S8. DOI: 10.1186/1471-2105-10-S14-S8
Source: PubMed


Background: Sequence alignment is the rate-limiting step in constructing profile trees for DNA barcoding purposes. We recently demonstrated the feasibility of using unaligned rRNA sequences as barcodes based on a composition vector (CV) approach without sequence alignment (Bioinformatics 22:1690). Here, we further explored the grouping effectiveness of the CV method in large DNA barcode datasets (COI, 18S and 16S rRNA) from a variety of organisms, including birds, fishes, nematodes and crustaceans.
Results: Our results indicate that the grouping of taxa at the genus/species levels based on the CV/NJ approach is invariably consistent with the trees generated by traditional approaches, although in some cases the clustering among higher groups might differ. Furthermore, the CV method is always much faster than the K2P method routinely used in constructing profile trees for DNA barcoding. For instance, the alignment of 754 COI sequences (average length 649 bp) from fishes took more than ten hours to complete, while the whole tree construction process using the CV/NJ method required no more than five minutes on the same computer.
Conclusion: The CV method performs well in grouping effectiveness of DNA barcode sequences, as compared to K2P analysis of aligned sequences. It was also able to reduce the time required for analysis by over 15-fold, making it a far superior method for analyzing large datasets. We conclude that the CV method is a fast and reliable method for analyzing large datasets for DNA barcoding purposes.

Download full-text


Available from: Ka Hou Chu, Oct 04, 2015
31 Reads
  • Source
    • "In conclusion, despite the problems of sampling size [104]–[105] and the criticisms on methodological [106], theoretical [107] and empirical grounds [1], [108]–[110], the prospect of DNA barcoding is still promising if it is based on solid foundations of comprehensive taxonomy. The exploration on new analytical methods [59], [93], [99], [105], [111]–[117] and the use of nuclear genes as additional effective DNA barcodes [60], [103] will certainly promote the progress in DNA barcoding. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We tested the performance of DNA barcoding in Acridoidea and attempted to solve species boundary delimitation problems in selected groups using COI barcodes. Three analysis methods were applied to reconstruct the phylogeny. K2P distances were used to assess the overlap range between intraspecific variation and interspecific divergence. "Best match (BM)", "best close match (BCM)", "all species barcodes (ASB)" and "back-propagation neural networks (BP-based method)" were utilized to test the success rate of species identification. Phylogenetic species concept and network analysis were employed to delimitate the species boundary in eight selected species groups. The results demonstrated that the COI barcode region performed better in phylogenetic reconstruction at genus and species levels than at higher-levels, but showed a little improvement in resolving the higher-level relationships when the third base data or both first and third base data were excluded. Most overlaps and incorrect identifications may be due to imperfect taxonomy, indicating the critical role of taxonomic revision in DNA barcoding study. Species boundary delimitation confirmed the presence of oversplitting in six species groups and suggested that each group should be treated as a single species.
    PLoS ONE 12/2013; 8(12):e82400. DOI:10.1371/journal.pone.0082400 · 3.23 Impact Factor
  • Source
    • "This approach was taken, because genetic diversity between species is markedly greater than that within species [2]. Numerous computational analysis methods and systems have been introduced for this purpose [3-5]. The use of this system can provide rapid, accurate, cost-effective, and automatable process for species identification. "
    [Show abstract] [Hide abstract]
    ABSTRACT: DNA barcoding has been widely used in species identification and biodiversity research. A short fragment of the mitochondrial cytochrome c oxidase subunit I (COI) sequence serves as a DNA bio-barcode. We collected DNA barcodes, based on COI sequences from 156 species (529 sequences) of fish, insects, and shellfish. We present results on phylogenetic relationships to assess biodiversity the in the Korean peninsula. Average GC% contents of the 68 fish species (46.9%), the 59 shellfish species (38.0%), and the 29 insect species (33.2%) are reported. Using the Kimura 2 parameter in all possible pairwise comparisons, the average interspecific distances were compared with the average intraspecific distances in fish (3.22 vs. 0.41), insects (2.06 vs. 0.25), and shellfish (3.58 vs. 0.14). Our results confirm that distance-based DNA barcoding provides sufficient information to identify and delineate fish, insect, and shellfish species by means of all possible pairwise comparisons. These results also confirm that the development of an effective molecular barcode identification system is possible. All DNA barcode sequences collected from our study will be useful for the interpretation of species-level identification and community-level patterns in fish, insects, and shellfish in Korea, although at the species level, the rate of correct identification in a diversified environment might be low.
    09/2012; 10(3):206-11. DOI:10.5808/GI.2012.10.3.206
  • Source
    • "Consequently, classification based methods such as Support Vector Machine (SVM) have not been considered more effective than the nearest neighbor method. In addition, several studies have applied the alignment free methods on DNA barcode analyses using COI[14] or ITS2 [12] as the barcode markers, even though it is arguable whether these studies are necessary as COI has little indels and less difficulty in sequence alignment. It has not been determined whether or not they can be extended to other markers. "
    [Show abstract] [Hide abstract]
    ABSTRACT: DNA barcoding technology, which uses a short piece of DNA sequence to identify species, has wide ranges of applications. Until today, a universal DNA barcode marker for plants remains elusive. The rbcL and matK regions have been proposed as the "core barcode" for plants and the ITS2 and psbA-trnH intergenic spacer (PTIGS) regions were later added as supplemental barcodes. The use of PTIGS region as a supplemental barcode has been limited by the lack of computational tools that can handle significant insertions and deletions in the PTIGS sequences. Here, we compared the most commonly used alignment-based and alignment-free methods and developed a web server to allow the biologists to carry out PTIGS-based DNA barcoding analyses. First, we compared several alignment-based methods such as BLAST and those calculating P distance and Edit distance, alignment-free methods Di-Nucleotide Frequency Profile (DNFP) and their combinations. We found that the DNFP and Edit-distance methods increased the identification success rate to ~80%, 20% higher than the most commonly used BLAST method. Second, the combined methods showed overall better success rate and performance. Last, we have developed a web server that allows (1) retrieving various sub-regions and the consensus sequences of PTIGS, (2) annotating novel PTIGS sequences, (3) determining species identity by PTIGS sequences using eight methods, and (4) examining identification efficiency and performance of the eight methods for various taxonomy groups. The Edit distance and the DNFP methods have the highest discrimination powers. Hybrid methods can be used to achieve significant improvement in performance. These methods can be extended to applications using the core barcodes and the other supplemental DNA barcode ITS2. To our knowledge, the web server developed here is the only one that allows species determination based on PTIGS sequences. The web server can be accessed at
    BMC Bioinformatics 11/2011; 12 Suppl 13(Suppl 13):S4. DOI:10.1186/1471-2105-12-S13-S4 · 2.58 Impact Factor
Show more