January 2016
·
384 Reads
·
23 Citations
Genome Biology and Evolution
Orthologs are widely used for phylogenetic analysis of species; however, identifying genuine orthologs among distantly related species is challenging, because genes obtained through horizontal gene transfer (HGT) and out-paralogs derived from gene duplication before speciation are often present among the predicted orthologs. We developed a program, “Ortholog-Finder,” to obtain ortholog datasets for performing phylogenetic analysis by using all ORF data of species. The program includes 5 processes for minimizing the effects of HGT and out-paralogs in phylogeny construction: (1) HGT filtering: Genes derived from HGT could be detected and deleted from the initial sequence dataset by examining their base compositions. (2) Out-paralog filtering: Out-paralogs are detected and deleted from the dataset based on sequence similarity. (3) Classification of phylogenetic trees: Phylogenetic trees generated for ortholog candidates are classified as monophyletic or polyphyletic trees. (4) Tree splitting: Polyphyletic trees are bisected to obtain monophyletic trees and remove HGT genes and out-paralogs. (5) Threshold changing: Out-paralogs are further excluded from the dataset based on the difference in the similarity scores of genuine orthologs and out-paralogs. We examined how out-paralogs and HGTs affected phylogenetic trees constructed for species based on ortholog datasets obtained by Ortholog-Finder with the use of simulation data, and we determined the effects of confounding factors. We then used Ortholog-Finder in phylogeny construction for 12 gram-positive bacteria from 2 phyla and validated each node of the constructed tree by comparison with individually constructed ortholog trees.