Question
Asked 16th Oct, 2014

How do you establish the number of discrete gamma categories in ML tree construction?

Is it based on number of groups?

Most recent answer

9th Jan, 2015
Brian Thomas Foley
Los Alamos National Laboratory
Here is another excellent study you might want to look at:
A molecular phylogeny of living primates. Perelman P, Johnson WE, Roos C, Seuánez HN, Horvath JE, Moreira MA, Kessing B, Pontius J, Roelke M, Rumpler Y, Schneider MP, Silva A, O'Brien SJ, Pecon-Slattery J. PLoS Genet. 2011 Mar;7(3):e1001342. doi: 10.1371/journal.pgen.1001342. Epub 2011 Mar 17. PMID: 21436896
The final, post-GBLOCK, edited, annotated PAUP* nexus alignment of the 54 concatenated genes used for this study is publicly available .
2 Recommendations

Popular Answers (1)

17th Oct, 2014
Heiko A Schmidt
University of Vienna
Dear Joao,
no, the number of categories is typically not related to the number of groups or species, bu often left at a software's default - which may not always the best choice.
This number sets how many classes of sites you allow ranging from faster to slower evolving. The runtime of the programs would typically grow linearly with the number of categories you allow, because for each category a site-likelihood value is computed for each site. Thus, doubling the number of categories would (roughly) double the runtime.
While in the past often the default for the number of categories was 8 it nowadays is typically 4 in most software I have seen. I guess the reason is just the speedup by  factor of 2.
While 4 categories might be enough when analyzing a single gene, it is questionable if 4 categories are also enough to reasonably cover the range of evolutionary speeds in a phylogenomic dataset (i.e. having many genes/proteins per species) as the genes can have very different evolutionary constraints.
To solve this often so-called partition models are used which allow a model (and set of parameters) per partition (e.g. a gene or protein). In this case two genes may be modeled by the same model (e.g. GTR+G4) but the rate parameters and alpha are estimated for each gene separately. Hence, you do not need the categories as the only means to reflect all different evolutionary speeds along the genome.
Phylogenetic software packages which allow for partition models are, e.g., IQ-Tree, Treefinder, RAxML, or Garli. From those (to my knowledge) only the first two allow for all three classes of partition models (branch lengths among the partitions are (a) all the same, (b) proportional to each other and (c) are all independent).
I hope that helped a bit,
Heiko
8 Recommendations

All Answers (8)

17th Oct, 2014
Heiko A Schmidt
University of Vienna
Dear Joao,
no, the number of categories is typically not related to the number of groups or species, bu often left at a software's default - which may not always the best choice.
This number sets how many classes of sites you allow ranging from faster to slower evolving. The runtime of the programs would typically grow linearly with the number of categories you allow, because for each category a site-likelihood value is computed for each site. Thus, doubling the number of categories would (roughly) double the runtime.
While in the past often the default for the number of categories was 8 it nowadays is typically 4 in most software I have seen. I guess the reason is just the speedup by  factor of 2.
While 4 categories might be enough when analyzing a single gene, it is questionable if 4 categories are also enough to reasonably cover the range of evolutionary speeds in a phylogenomic dataset (i.e. having many genes/proteins per species) as the genes can have very different evolutionary constraints.
To solve this often so-called partition models are used which allow a model (and set of parameters) per partition (e.g. a gene or protein). In this case two genes may be modeled by the same model (e.g. GTR+G4) but the rate parameters and alpha are estimated for each gene separately. Hence, you do not need the categories as the only means to reflect all different evolutionary speeds along the genome.
Phylogenetic software packages which allow for partition models are, e.g., IQ-Tree, Treefinder, RAxML, or Garli. From those (to my knowledge) only the first two allow for all three classes of partition models (branch lengths among the partitions are (a) all the same, (b) proportional to each other and (c) are all independent).
I hope that helped a bit,
Heiko
8 Recommendations
17th Oct, 2014
João Victor Leite Dias
Universidade Federal dos Vales do Jequitinhonha e Mucuri
Thank you very much, Dr Heiko! Your explanation made the concept clear for me!
Best regards
João
31st Dec, 2014
Leandro R. Jones
National Scientific and Technical Research Council
If you have doubts on the potential effect that the number of categories set could have in your results, you might want to perform a "sensitivity analysis". That is, run several independent analyses with different numbers of categories each (say 2, 3, 4, ...8, etc.) and then check the effect on the obtained trees. Be aware, however, that as pointed out by Heiko the computational cost increases as you increase the number of rate categories.
1 Recommendation
6th Jan, 2015
Leandro R. Jones
National Scientific and Technical Research Council
João, another thing that you could consider is checking for the significance of likelihood values. As you add categories, it is expected for the likelihood to increase. To see up to which point these increments are significant you may use, for example, the approximately unbiased and/or the non-scaled bootstrap probability tests implemented in CONSEL (Bioinformatics 17: 1246–1247).
1 Recommendation
6th Jan, 2015
João Victor Leite Dias
Universidade Federal dos Vales do Jequitinhonha e Mucuri
Thanks a lot Dr Jones! Your answers are very useful!!! 
I'll consider all of them.
Best regards
7th Jan, 2015
Brian Thomas Foley
Los Alamos National Laboratory
A 2007 paper by Karl Kjer and Rodney Honeycutt does a nice job of explaining the use of partitions when analyzing the complete mitochondrial genomes of mammals.  The data is also easily available, thanks to an archive of data sets being maintained by Robert Lanfear. 
I am attaching a JPG image of a tree from that data using PhyML and no data partitioning.  It illustrates that simpler methods give the same branching order (primates are still one clade for example), but that the ratio of the distance from the root of the mammals to the root of the old world primates, vs the distance from the Pan/Homo/Gorilla common ancestor is very different.  
The issue in this data set, is that the mitochondrial genomes of mammals plus the marsupial outgroup is a bit beyond saturation of silent sites with mutations, so the distances deep in the tree are underestimated if all sites are treated more equally.
A "quick and dirty" neighbor-joining tree also gives similar results as far as getting most of the major subclades within the mammals grouped together. 
4 Recommendations
7th Jan, 2015
João Victor Leite Dias
Universidade Federal dos Vales do Jequitinhonha e Mucuri
Thank you Dr Foley! Your contribution is very significant.
Can you help by adding an answer?

Similar questions and discussions

Is jmodeltest.org working?
Question
1 answer
  • Gabriel M. RiañoGabriel M. Riaño
Hello,
If not, could you send me the program directly? I download it but there is no .exe file.
Thank you,
Gabriel

Related Publications

Article
Historically, the performance of phylogenetic inference algorithms has been analyzed in terms of their performance on simulated data. The techniques used to simulate this data are awed in many ways and, as such, the results of these experiments may be more a reection of the parameters of the simulation that generates input data than of the quality...
Conference Paper
Full-text available
La exposición recoge parte de los frutos de un experimento docente cuya finalidad es implicar a los estudiantes del grado de Biología en el proceso de enseñanza-aprendizaje de la asignatura Zoología General. En la realización del proyecto han participado 45 alumnos voluntarios del curso 2013-2014 del grado de Biología de la Universidad de Oviedo. L...
Got a technical question?
Get high-quality answers from experts.