Question
Asked 16th Oct, 2014

How do you establish the number of discrete gamma categories in ML tree construction?

Is it based on number of groups?

Most recent answer

9th Jan, 2015
Brian Thomas Foley
Los Alamos National Laboratory
Here is another excellent study you might want to look at:
A molecular phylogeny of living primates. Perelman P, Johnson WE, Roos C, Seuánez HN, Horvath JE, Moreira MA, Kessing B, Pontius J, Roelke M, Rumpler Y, Schneider MP, Silva A, O'Brien SJ, Pecon-Slattery J. PLoS Genet. 2011 Mar;7(3):e1001342. doi: 10.1371/journal.pgen.1001342. Epub 2011 Mar 17. PMID: 21436896
The final, post-GBLOCK, edited, annotated PAUP* nexus alignment of the 54 concatenated genes used for this study is publicly available .
2 Recommendations

Popular answers (1)

17th Oct, 2014
Heiko A Schmidt
University of Vienna
Dear Joao,
no, the number of categories is typically not related to the number of groups or species, bu often left at a software's default - which may not always the best choice.
This number sets how many classes of sites you allow ranging from faster to slower evolving. The runtime of the programs would typically grow linearly with the number of categories you allow, because for each category a site-likelihood value is computed for each site. Thus, doubling the number of categories would (roughly) double the runtime.
While in the past often the default for the number of categories was 8 it nowadays is typically 4 in most software I have seen. I guess the reason is just the speedup by  factor of 2.
While 4 categories might be enough when analyzing a single gene, it is questionable if 4 categories are also enough to reasonably cover the range of evolutionary speeds in a phylogenomic dataset (i.e. having many genes/proteins per species) as the genes can have very different evolutionary constraints.
To solve this often so-called partition models are used which allow a model (and set of parameters) per partition (e.g. a gene or protein). In this case two genes may be modeled by the same model (e.g. GTR+G4) but the rate parameters and alpha are estimated for each gene separately. Hence, you do not need the categories as the only means to reflect all different evolutionary speeds along the genome.
Phylogenetic software packages which allow for partition models are, e.g., IQ-Tree, Treefinder, RAxML, or Garli. From those (to my knowledge) only the first two allow for all three classes of partition models (branch lengths among the partitions are (a) all the same, (b) proportional to each other and (c) are all independent).
I hope that helped a bit,
Heiko
13 Recommendations

All Answers (8)

17th Oct, 2014
Heiko A Schmidt
University of Vienna
Dear Joao,
no, the number of categories is typically not related to the number of groups or species, bu often left at a software's default - which may not always the best choice.
This number sets how many classes of sites you allow ranging from faster to slower evolving. The runtime of the programs would typically grow linearly with the number of categories you allow, because for each category a site-likelihood value is computed for each site. Thus, doubling the number of categories would (roughly) double the runtime.
While in the past often the default for the number of categories was 8 it nowadays is typically 4 in most software I have seen. I guess the reason is just the speedup by  factor of 2.
While 4 categories might be enough when analyzing a single gene, it is questionable if 4 categories are also enough to reasonably cover the range of evolutionary speeds in a phylogenomic dataset (i.e. having many genes/proteins per species) as the genes can have very different evolutionary constraints.
To solve this often so-called partition models are used which allow a model (and set of parameters) per partition (e.g. a gene or protein). In this case two genes may be modeled by the same model (e.g. GTR+G4) but the rate parameters and alpha are estimated for each gene separately. Hence, you do not need the categories as the only means to reflect all different evolutionary speeds along the genome.
Phylogenetic software packages which allow for partition models are, e.g., IQ-Tree, Treefinder, RAxML, or Garli. From those (to my knowledge) only the first two allow for all three classes of partition models (branch lengths among the partitions are (a) all the same, (b) proportional to each other and (c) are all independent).
I hope that helped a bit,
Heiko
13 Recommendations
17th Oct, 2014
João Victor Leite Dias
Universidade Federal dos Vales do Jequitinhonha e Mucuri
Thank you very much, Dr Heiko! Your explanation made the concept clear for me!
Best regards
João
31st Dec, 2014
Leandro R. Jones
National Scientific and Technical Research Council
If you have doubts on the potential effect that the number of categories set could have in your results, you might want to perform a "sensitivity analysis". That is, run several independent analyses with different numbers of categories each (say 2, 3, 4, ...8, etc.) and then check the effect on the obtained trees. Be aware, however, that as pointed out by Heiko the computational cost increases as you increase the number of rate categories.
1 Recommendation
6th Jan, 2015
Leandro R. Jones
National Scientific and Technical Research Council
João, another thing that you could consider is checking for the significance of likelihood values. As you add categories, it is expected for the likelihood to increase. To see up to which point these increments are significant you may use, for example, the approximately unbiased and/or the non-scaled bootstrap probability tests implemented in CONSEL (Bioinformatics 17: 1246–1247).
1 Recommendation
6th Jan, 2015
João Victor Leite Dias
Universidade Federal dos Vales do Jequitinhonha e Mucuri
Thanks a lot Dr Jones! Your answers are very useful!!! 
I'll consider all of them.
Best regards
7th Jan, 2015
Brian Thomas Foley
Los Alamos National Laboratory
A 2007 paper by Karl Kjer and Rodney Honeycutt does a nice job of explaining the use of partitions when analyzing the complete mitochondrial genomes of mammals.  The data is also easily available, thanks to an archive of data sets being maintained by Robert Lanfear. 
I am attaching a JPG image of a tree from that data using PhyML and no data partitioning.  It illustrates that simpler methods give the same branching order (primates are still one clade for example), but that the ratio of the distance from the root of the mammals to the root of the old world primates, vs the distance from the Pan/Homo/Gorilla common ancestor is very different.  
The issue in this data set, is that the mitochondrial genomes of mammals plus the marsupial outgroup is a bit beyond saturation of silent sites with mutations, so the distances deep in the tree are underestimated if all sites are treated more equally.
A "quick and dirty" neighbor-joining tree also gives similar results as far as getting most of the major subclades within the mammals grouped together. 
4 Recommendations
7th Jan, 2015
João Victor Leite Dias
Universidade Federal dos Vales do Jequitinhonha e Mucuri
Thank you Dr Foley! Your contribution is very significant.
9th Jan, 2015
Brian Thomas Foley
Los Alamos National Laboratory
Here is another excellent study you might want to look at:
A molecular phylogeny of living primates. Perelman P, Johnson WE, Roos C, Seuánez HN, Horvath JE, Moreira MA, Kessing B, Pontius J, Roelke M, Rumpler Y, Schneider MP, Silva A, O'Brien SJ, Pecon-Slattery J. PLoS Genet. 2011 Mar;7(3):e1001342. doi: 10.1371/journal.pgen.1001342. Epub 2011 Mar 17. PMID: 21436896
The final, post-GBLOCK, edited, annotated PAUP* nexus alignment of the 54 concatenated genes used for this study is publicly available .
2 Recommendations

Similar questions and discussions

Related Publications

Article
Recently, the minimum number of reticulation events that is required to simultaneously embed a collection P of rooted binary phylogenetic trees into a so-called temporal network has been characterized in terms of cherry-picking sequences. Such a sequence is a particular ordering on the leaves of the trees in P. However, it is well-known that not al...
Data
Phylogenetic inertia of gene transfer exchanges and capsule systems. We estimated the phylogenetic inertia of several genetic transfer measures using Pagel’s λ included in the phytools package and a 16SrRNA phylogenetic tree. The null hypothesis is λ = 0 (no phylogenetic effect). (DOCX)
Article
This paper poses a problem for traditional phylogenetics: The identity of organisms that reproduce through fission can be understood in several different ways. This prompts questions about how to differentiate parent organisms from their offspring, making vertical gene transfer unclear. Differentiating between parents and offspring stems from what...
Got a technical question?
Get high-quality answers from experts.