IPB University
Question
Asked 9 January 2013
MEGA: How do I find and use a best-fit substitution model?
I know very little about building trees, but I want to build a phylogenetic tree of an alignment of a family of cellulase genes. The tree will be used to help me search for evidence of positive selection. I'm using MEGA (5.10) and I think I figured out how to determine the best substitution model to use; Models -> Find Best DNA/Protein Models (ML). Now I want to build a tree (neighbor joining) using the results of the find best model. However, I cannot figure out how to apply the results from the model for building the tree. I'm assuming the model with the lowest BIC score at the top of the list is the one I want to use.
Most recent answer
Dear Antonia Chroni
Ah I see, I'll try. Thank You!
Popular answers (1)
Los Alamos National Laboratory
Detecting positive selection of codons in a gene or amino acids in a protein is quite tricky. What we really do is measure the relative rate of evolution of the codons in a gene alignment. Some sites evolve faster than others, but whether the fast sites are being driven to change, or they are just under less pressure to remain invariant over time, is not always clear. Lack of negative pressure can look the same as positive pressure.
There is a really nice tool, called the ConSurf server:
http://consurf.tau.ac.il/overview.html which allows you to take a protein's 3D structure file from the PDB (there are currently 68 eukaryote and 155 bacterial cellulases in the PDB http://www.rcsb.org/pdb/results/results.do?qrid=BBC30436 so you need to choose the one that is most central to the family you are studying) a multiple sequence alignment (amino acids, so translate your genes to AA) and a tree, and it returns the 3D structure colored by rate of evolution.
For most enzymes, the observation is that the catalytic core amino acids remain invariant over time, and the sites far from the core are allowed to vary because they have little impact on the core.
It can also be useful to check the distribution of rates among the sites. Is there a bell curve from invariant to slightly variable to highly variable? Most enzymes have a large peak of invariant sites plus a bell curve of moderately variable sites, plus a few sites that have an enormous rate of evolution. Those very rapidly changing sites are usually not under positive selection but instead are just flipping around between similar amino acids (Leu, Val, Ile for example).
17 Recommendations
All Answers (32)
Bolu Abant İzzet Baysal University
You are right. See also caption in which all of column and abbreviations are explained. But you should focus on first 5 colums. Model, Parameters, AIC, BIC and Inl. The lowest AICs scores are considered to describe the substitution pattern the best. Similarly, as you said lower BIC do that. InL stands for log likelihood, and higher the better. Good luck :)
1 Recommendation
University of Pennsylvania
Thanks for responding! How do I apply the substitution model to the settings for building a tree in MEGA? For example, the best model was GTR+G, but there's no GTR+G option when choosing a substitution model to use in the tree building settings.
Molsys Private Limited
Did you try PAML package?
The CodeML program can be a good way to go about for the type of solution you are using, this is what I guess.
Bolu Abant İzzet Baysal University
Dear Caleb Radens, probably you selected NJ tree option. However GTR model not found in NJ. Instead, you should go with ML, then select general time reversible.
For the rates and patters, click gamma distribution. At the begining you can try 100 BS to see your tree fast. Good luck
3 Recommendations

Hi Caleb,
GTR+G simply means you have to note down the Gamma parameter that the model testing yields you (should be in the big result table) and use that when setting up the tree building process. That's what the "+G" stands for, simply putting. Same thing if you ever get a "+I" model, you have to note down the proportion of invariant sites and use that value in the corresponding bracket of the setup stage. And yes, like Buhara says, you have to go with Maximum Likelihood (ML) models for that. Hope it helps!
14 Recommendations
University of Córdoba
Hi Caleb, the best is use JMODELTEST for know the exactly model of DNA substitution using differents criteria, Software Link: https://code.google.com/p/jmodeltest2/
3 Recommendations
Center for Research and Advanced Studies of the National Polytechnic Institute
Hi Caleb, in MEGA, use the Maximum-Likelihood option to build the tree. There you will find the GTR +G option. You can try also to build the tree with the PYHML package. And for the natural selection analysis I recomend PAML. Best!
3 Recommendations
Los Alamos National Laboratory
Detecting positive selection of codons in a gene or amino acids in a protein is quite tricky. What we really do is measure the relative rate of evolution of the codons in a gene alignment. Some sites evolve faster than others, but whether the fast sites are being driven to change, or they are just under less pressure to remain invariant over time, is not always clear. Lack of negative pressure can look the same as positive pressure.
There is a really nice tool, called the ConSurf server:
http://consurf.tau.ac.il/overview.html which allows you to take a protein's 3D structure file from the PDB (there are currently 68 eukaryote and 155 bacterial cellulases in the PDB http://www.rcsb.org/pdb/results/results.do?qrid=BBC30436 so you need to choose the one that is most central to the family you are studying) a multiple sequence alignment (amino acids, so translate your genes to AA) and a tree, and it returns the 3D structure colored by rate of evolution.
For most enzymes, the observation is that the catalytic core amino acids remain invariant over time, and the sites far from the core are allowed to vary because they have little impact on the core.
It can also be useful to check the distribution of rates among the sites. Is there a bell curve from invariant to slightly variable to highly variable? Most enzymes have a large peak of invariant sites plus a bell curve of moderately variable sites, plus a few sites that have an enormous rate of evolution. Those very rapidly changing sites are usually not under positive selection but instead are just flipping around between similar amino acids (Leu, Val, Ile for example).
17 Recommendations
University of Pennsylvania
Thank you everyone! To search for positive selection I am using HYPHY (http://www.datamonkey.org/dataupload.php), but I will try the PAML package too, thanks for the suggestion.
2 Recommendations
LGC Biosearch Technologies
The best way to do it is to use Modeltest (Posada and Crandall 1998). I believe it is called jmodeltest now. It will test 56 different models.
3 Recommendations
Quaid-i-Azam University
The approach which you used for getting best fit substitution model is fine, but on the basis of these models using MEGA , you cannot generate a neighbor joining tree. These best fit models have their role in generating a maximum likelihood tree rather than neighbor joining.
7 Recommendations
University of Copenhagen
Neighbor joining is a distance-based method. To use substitution models, you need to use maximum likelihood (ML) or Bayesian methods.
4 Recommendations
Miami University
Hi Caleb,
I agree with Diego that model test, or jmodeltest is one of the best programs out there that I've used. Here's what I do: I download the nt sequences, translate them and align protein sequences using my favourite alignment tool, then check by visual inspection (you can do this in MEGA). Then, I reverse translate them back to the nt. Now I have an aligment that makes biological sense (in frame codons). Next, I take that alignment and use mrmodeltest and PAUP*4.0b10 to pick my model and then I use MrBayes to do the actual tree building. After the tree is built, I import the newick tree file into mega, arrange it how I like, export the EPS into Illutrator and finish the tree for publication.
5 Recommendations
Universidade da Coruña
Hi Caleb,
Since I'm one of the authors of jModelTest 2 I highly recommend it, as you might expect ;) For our last paper we validate the tool with 40,000 simulations and we got near 90% of accuracy finding the exact generating model and over 99% finding the rate variation. Moreover, we find that BIC outperformed AIC by far in this regard, so take this in mind if you finally decide to use jModelTest.
3 Recommendations

TOPALI v2 is also a nice software package for evolutionary analysis: http://www.topali.org/
"The extended TOPALi v2 provides phylogenetic model selection, Bayesian analysis (BA) and Maximum Likelihood (ML) phylogenetic tree estimation, detection of sites under positive selection, and recombination breakpoint location analysis."
5 Recommendations
King's College London
As there is lot of informative replies by RG users towards the original question, I would like to ask following question here....
I am trying to do phylogenetic analysis of upto 100 full length viral genomes (each consisting of 34000 - 36000 bp). However, I am unable to perform alignment and generate trees by using MEGA5 probably due to large size of genomes. Whenever I run alignment in MEGA5 either by CLUSTALW or MUSCLE, after 15min I get error message of "Not Enough Memory". Same error appears when running Jmodeltest 2 or MEGA5 nucleotide substitution model analysis. I would like to know is there any way to overcome this issue? If its due to lack of computational resource then what could be the alternatives? Any online servers that can help me perform multiple seq alignment, model estimation and tree construction for larger datasets? I tried above mentioned TOPALi but it didn't work either. My computer spec are: Intel Core i5 2.4Ghz, 500GB hardisk and 6GB RAM
1 Recommendation
Ontario Institute for Cancer Research
Dear Azeem,
You may try Muscle with command line on linux, it will be less limited than on front end.
You may find a lot of information here: http://www.drive5.com/muscle/muscle.html.
e.g.
### tell the path and the input - output file
### ~/path/to/muscle3.8.31_<processor type> -in <input.file> -out <output.file>
For multiple alignment of nucleic acid and protein sequences there is the clustal web server: http://www.clustal.org/
For testing positive selection on multiple sequences you may try the datamonkey.org web server with the FUBAR algorithm.
For phylogenetic analyses on R (less computationaly demanding): http://cran.r-project.org/web/views/Phylogenetics.html.
Hope it helps.
Fabien
1 Recommendation
Institue of Agrochemistry and Food Technology - CSIC
Faezeh Kharazyan: save your fasta alignment in MEGA format (.meg), open that file, and click your file (T...A... icon) and you will see your alignment. Select the V icon and all the variant sites of your alignment will be highlighted. After that in the bottom of your window you will see the number of variables/total of nucleotide positions.
2 Recommendations
University of Veterinary and Animal Sciences
Hi,
everyone, i am using PAML package foe model selection but i coul not eun the program, every point is ok. i checked every thing but programm just run and then closed giving error 15313792. Where is problem? can any one help me? Or tell me any other tool for positive selection detection.
Thank you
Sikkim Professional University
Hi Caleb, I am exactly stuck at the same point. Did you find an answer to your relatively old query ? I need to make a phylogenetic tree using Neighbor Joining method using MEGA7 (latest version). However when i calculated a best fit substitution model, i got a table showing list of models for my data. However the model with the least BIC score placed at the top of the table is not there in the option in the NJ method. But there is another model Dayhof+G model in that table, 7th place from the top. This Dayhof model is there in the drop down menu among the list of options for substitution model in NJ method. However +G is missing. Can I still use this Dayhof+G model? Please help
Italian National Research Council
As far as I know, some authors used multiple algorithm implemented in different softwares to cross-validate results obtained in phylogenetic reconstruction starting from the same multiple sequence alignment. For example, one can use MrBayes (for bayesian-based analysis) and PhyML (or others, for ML approach) and then compare the results. The way I usually follow is to retain the same substitution model while changing the software. In my experience, I found the same substitution model even by changung the software. Maybe, in your case, uncertainty in model selection, may be due to incorrect parts of the alignment so I would try to enhance the alignment first, for example by changing the BLOSUM matrix or by discarding poor aligned regions (take alook at T-Coffe manual page, it helped me a lot). Then double check again if different models are found by deifferent methods; if so, you might try the way you indicated above.
1 Recommendation
TCG-CREST
I have found the best substitution model in MEGA7, and making phylogenetic tree using Maximum Likelihood. The best model suggested based on the BIC/AIC score was GTR+G+I. While giving parameters in ML, only GTR is given, how do we solve for the parameters of 'G' and 'I'. Any suggestions.
Center for Data Driven Discovery in Biomedicine at Children's Hospital of Philadelphia
Dear Anusha Rai
You can choose for rates under the tab: Rates and Patterns/Rates among Sites/Gamma Distributed With Invariant Sites (G+I).
Best,
Toni
TCG-CREST
Antonia Chroni thank you so much! will try out. Best wishes.
IPB University
Dear Antonia Chroni
How to determine the numbe or valu of G and I I should add to the tab? The best model suggested I get also GTR+G+I . Thank you.
Center for Data Driven Discovery in Biomedicine at Children's Hospital of Philadelphia
Dear Taufan Sulaeman
You can either use the default number or run several analyses with different values of categories every time, e.g., 4, 8 etc. That should allow you to see if there is any effect on the tree obtained every time, and so, which setting fits better your data.
Best,
Toni
IPB University
Dear Antonia Chroni
Ah I see, I'll try. Thank You!
Similar questions and discussions
Which should I use, the original tree or the Bootstrap consensus tree in MEGA X?
Kentaro Koga
I have created a phylogenetic tree of multiple genes of an organism using the NJ method with MEGA X.
However, two different results were shown: Original tree and Bootstrap consensus tree.
Which result is preferable to use?
Also, in the bootstrap consensus tree, there is no correlation between the evolutionary distance and the length of the branches of the phylogenetic tree. Is it possible to reflect the evolutionary distance in the branch length?
Thank you
Translated with www.DeepL.com/Translator (free version)
Related Publications
Thesis (Ph. D.)--University of Hong Kong, 2006.
Aqui apresentamos a segunda edição da Revista BIOINFO. Nesta nova edição, disponibilizamos 10 novos artigos relacionados a bioinformática e a biologia computacional. BIOINFO é um projeto amplo que engloba um portal (disponível em https://www.bioinfo.com.br), uma rede de divulgação e uma revista digital focada em publicar conteúdo voltado à divulgaç...