Science method

Phylogenetic Analysis - Science method

Phylogenetic Analysis is an exchange knowledge in the field of molecular systematics, phylogenetic reconstruction and their application to systematics, biogeography and evolutionary studies
Questions related to Phylogenetic Analysis
  • asked a question related to Phylogenetic Analysis
Question
1 answer
I've run it three times to try different parameters, but it stops changing after running for a short while. Please help me understand how to fix this. Is there a problem with the sequence or the parameters?
6471000 -- (-192324.911) [-190903.882] (-191836.670) (-203192.846) * (-192291.336) (-195396.079) (-197058.880) [-190667.824] -- 36:24:32 6472000 -- (-192324.911) [-190903.882] (-191836.670) (-203199.097) * (-192291.336) (-195396.079) (-197058.880) [-190667.824] -- 36:23:56 6473000 -- (-192324.911) [-190903.882] (-191836.670) (-203206.485) * (-192291.336) (-195396.079) (-197058.880) [-190667.824] -- 36:23:22 6474000 -- (-192324.911) [-190903.882] (-191836.670) (-203186.976) * (-192291.336) (-195396.079) (-197058.880) [-190667.824] -- 36:22:47 6475000 -- (-192324.911) [-190903.882] (-191836.670) (-203176.349) * (-192291.336) (-195396.079) (-197058.880) [-190667.824] -- 36:22:12 Average standard deviation of split frequencies: 0.143548 6476000 -- (-192324.911) [-190903.882] (-191836.670) (-203174.778) * (-192291.336) (-195396.079) (-197058.880) [-190667.824] -- 36:21:37 6477000 -- (-192324.911) [-190903.882] (-191836.670) (-203190.326) * (-192291.336) (-195396.079) (-197058.880) [-190667.824] -- 36:21:02 6478000 -- (-192324.911) [-190903.882] (-191836.670) (-203201.748) * (-192291.336) (-195396.079) (-197058.880) [-190667.824] -- 36:20:27 6479000 -- (-192324.911) [-190903.882] (-191836.670) (-203196.704) * (-192291.336) (-195396.079) (-197058.880) [-190667.824] -- 36:19:52 6480000 -- (-192324.911) [-190903.882] (-191836.670) (-203173.040) * (-192291.336) (-195396.079) (-197058.880) [-190667.824] -- 36:19:18 Average standard deviation of split frequencies: 0.143548 6481000 -- (-192324.911) [-190903.882] (-191836.670) (-203178.787) * (-192291.336) (-195396.079) (-197058.880) [-190667.824] -- 36:18:43 6482000 -- (-192324.911) [-190903.882] (-191836.670) (-203151.441) * (-192291.336) (-195396.079) (-197058.880) [-190667.824] -- 36:18:08 6483000 -- (-192324.911) [-190903.882] (-191836.670) (-203159.321) * (-192291.336) (-195396.079) (-197058.880) [-190667.824] -- 36:17:33 6484000 -- (-192324.911) [-190903.882] (-191836.670) (-203156.963) * (-192291.336) (-195396.079) (-197058.880) [-190667.824] -- 36:16:58 6485000 -- (-192324.911) [-190903.882] (-191836.670) (-203142.242) * (-192291.336) (-195396.079) (-197058.880) [-190667.824] -- 36:16:24 Average standard deviation of split frequencies: 0.143548
Relevant answer
Answer
If the avg SE is not going down then the runs are failing to converge which means that there is too little phylogenetic information in the sequence alignment.
Note that the likelihood scores have not changed either over the 11,000 generations you show here,
Six million plus generations should result in multiple lowering plateaus of better fit. But this does not seem to be happening.
If you wish you could send me the alignment and I could assess whether the data is possibly informative.
  • asked a question related to Phylogenetic Analysis
Question
2 answers
I am quite confused because some sources distinguish between 5 types: folded, lamellar, villous, trabecular, labyrinthine (e.g. ; ); whereas some sources distinguish only 3 types: villous, trabecular, labyrinthine (e.g. ; ).
So, who is right? Why the folded and the lamellar types do not appear in some sources? In the sources where they appear, Suidae are said to have folded interdigitations, but trabecular in sources where they do not appear. Carnivora are said to have lamellar interdigitations, but their interdigitations are referred as labyrinthine in sources where the word "lamellar" is not even mentioned. Is the distinction between folded and trabecular spurious? As well as the distinction between lamellar and labyrinthine? This seems odd, because on textbook's diagrams these types are very different.
Relevant answer
Answer
Fetomaternal interdigitations in placentas refer to the intricate connections between fetal and maternal tissues that facilitate nutrient and gas exchange during pregnancy. There are several types:Villous interdigitation: This occurs at the microscopic level within the placental villi, where fetal chorionic villi project into maternal blood spaces, increasing surface area for exchange.Decidual interdigitation: Involves the maternal decidua, which is the lining of the uterus during pregnancy. Decidual cells extend between fetal villi, contributing to the structural integrity of the placenta.Placental barrier interdigitation: This refers to the physical and biochemical barriers between maternal and fetal circulations within the placenta, ensuring selective exchange of nutrients, gases, and waste products while preventing direct mixing of blood.
  • asked a question related to Phylogenetic Analysis
Question
3 answers
Guys, I need help with phylogeny. I am very new to this field and currently trying to position a bacterial strain. I have selected four crucial loci for the genus, but I am encountering limitations in employing a multilocus approach. I attempted to use Paup/Mrmodelblocks - Mrbayes; however, even though the individual tree for each locus yields an acceptable topology (confirmed through marker sequences), concatenating the different genetic sequences results in a significant error in the tree's topology. I have tried various solutions to rectify this issue, but despite seeking assistance from experienced friends, I haven't been able to resolve it.
This led me to explore new approaches, considering a phylogeny based more on genomes using orthofinder. Initially, it worked well, but when I attempted to repeat the process, my PC couldn't support it, even though it is a robust computer. I'm using Linux, and every time I try to use orthofinder again, the system restarts. Consequently, I sought another approach, utilizing iqtree. I repeated the process using a multilocus strategy, but once again, I can create individual trees for each region, yet I struggle to obtain an acceptable topology when concatenating the regions to generate the tree for the four chosen regions.
The program does concatenate the sequences, but when I try to run the command to generate the tree with nucleotide substitution models, it produces an error due to the size of the concatenated sequences.
I am in need of tips and alternatives on how to address this issue. Although I successfully used the new cognac package in R, considering it's a recent approach, I need to validate my data. Please assist me with any possible alternatives. Thanks.
Relevant answer
Answer
You can try amas with rxaml model and evolution perspective you could use advance jtt model with maximum likelihood method
  • asked a question related to Phylogenetic Analysis
Question
5 answers
Suppose we are performing a phylogenetic analysis among different fungal species and we wanted to see how closely our species of interest is related with the other species across different families belonging to the same Order. After performing the multiple sequence alignment, can we extract the aligned regions (which are common across all the species) and use only that much length of nucleotides for constructing the phylogeny (and remove the unaligned regions)?
For e.g. if the length of our sequence of interest is 700bp and only 200-300bp is aligning with the other retrieved sequences, then can we extract 200bp aligned region and use only this much portion for constructing the phylogeny? Or the sequence of the entire length is required for phylogenetic analysis?
Relevant answer
Answer
Yes you can, but do not remove gaps that are generated after alignment within the aligned sequences.
  • asked a question related to Phylogenetic Analysis
Question
3 answers
Hello, I am working with closely related endogenous retrovirus (ERV) sequences. I suspect these might have been the result of several integration events from different but related exogenous retroviruses according to host phylogeny and geography. Basically, distant hosts that share geographical distribution have higher ERV identity between theirs than with closer species. ERVs from a single species that is closely related to a few others and has the same habitat, is however very different from its neighbors.
So two possibilities: each ERV clade comes from a different virus or all of them descend from the same insertion event.
But it remains as a hunch. Is there some sort of statistical test or other kind of test that might at least support or oppose this claim?
Thank you in advance.
Relevant answer
Answer
Thank you for your answer, unfortunately, the taxa I am studying are less studied and species are more distant (20+ my). Checking chromosomal locations among taxa that are not as well studied as Primates goes outside the scope of the paper, plus, the contigs we have located ERVs in are sometimes unplaced in the genome. We have been told by other sources that we might need to do nearest neighbor analysis or a maximum likelihood statistical test.
  • asked a question related to Phylogenetic Analysis
Question
14 answers
Can I decide with no phylogenetic analysis if individuals within a population form an infraspecific taxon?
How can I differentiate between subspecies, variety and forma? Which their formal definitions are?
I appreciate it if you could provide updated bibliography and information about this issue.
Kind regards,
AIJ
Relevant answer
Answer
Hi Adriel,
Yes you can - some excellent discussion and definitons can be found in the following literature (I have used these often in my own taxonomic papers) - see below. These works define subspecies and forma very well and point out the ambiguity of variety - a rank I dislike because its definition is so labile and often clashes with subspecies.
Davis P.H., Heywood V.H. 1963. Principles of Angiosperm Taxonomy. Edinburgh & London, Oliver & Boyd, 556 pp.
Stace CA 1991. Plant taxonomy and biosystematics, 2nd edition. Cambridge University Press, Cambridge. 272 pp.
Stuessy TF 1990. Plant taxonomy. Columbia University Press, New York. 568 pp.
Other literature exists but these are good starting points.
Its also true that there seems to be no international consistency in the application of these ranks. For example in Aotearoa / New Zealand we use subspecies for allopatric segregation within a species yet others use it for variation within a sympatric / syntopic gene pool (see discussion on this problem in Davis & Heywood (1963)).
Hope that helps,
Cheers
Peter
  • asked a question related to Phylogenetic Analysis
Question
6 answers
I have obtained both the forward and reverse sequences of a PCR amplicon, which contain a few ambiguous bases and gaps.
I am considering manually editing the forward sequence using the reverse complementary sequence of the reverse sequence as a reference. Is this a recommended approach for resolving the ambiguous bases?
Furthermore, I intend to conduct a phylogenetic analysis,
But generating a consensus sequence from the forward sequence and reverse complement of the reverse sequence leads to a shortened sequence, with around 50 bases deleted from the beginning and 70 bases from the end.
In light of this, would it be more appropriate to merge the forward sequence and the reverse complement of the reverse sequence into a single CONTIG sequence, instead of creating a consensus sequence?
Can this contig sequence be submitted to the NCBI database, and is it acceptable to use for phylogenetic analysis?
Relevant answer
Answer
John-Erich Haight Thank you.
My PCR amplicon length was 349 bp.
My forward sequence ambiguous bases within 59th position from starting. So, I trimmed the sequence from start to 59th base.
The reverse sequence has ambiguous bases within first 81 bases. So I trimmed it from the start to 81st base. After that I generated reverse complement and did alignment of Forward and reverse complement.
Now, after alignment, I have
F: ---(77)to(349)
R: (1)to(248)---
And generating consensus sequence by Bioedit, it gives me a consensus sequence of only 172 bp (only the overlapped region).
However, generating consensus by EMBOSS Cons gives the total 349 bp sequence.
Should I take the EMBOSS Cons sequence or should I take the Bioedit one ?
  • asked a question related to Phylogenetic Analysis
Question
5 answers
I want to do a Phylogenetic analysis to classify organisms using following sequences.
I have download some sequences (ITS region) from NCBI database and aligned them but because of few sequences long gaps have been generated see picture.
My question is what to do with these kind of sequences? Can i remove the middle portion of these sequences to reduce gap or i should just leave it like that. Can i exclude those sequences from my studies because of these long gaps??
Relevant answer
Answer
I don't think there is a single answer to that as it will depend on what question(s) you are trying to answer. I would not generally discard a single sequence just because it has inserts others dont. Also, this alignment just does not look very homologous (ie from bases 180-210), so maybe these sequences just cannot be aligned well (can also try different algorithms or settings). If in doubt you can go back to the original article and see if this particular sequence was unique in any way.
  • asked a question related to Phylogenetic Analysis
Question
3 answers
I wish to use NONA or Hennig86 through Winclada to perform cladistic analysis based on morpho-cladistics characters of Coleoptera families. I am unable to find the two anywhere. I tried searching for the same but did not come across anything useful. What should I do? Thanks for your help.
Relevant answer
Answer
You're welcome Omkar Damle
  • asked a question related to Phylogenetic Analysis
Question
4 answers
Hi, so I am trying to follow the steps in the MrBayes manual to create a tree. I start MrBayes with mb. Then I type execute <file_name>. The manual says (and some youtube videos as well) show that it should only upload my file and then give me another prompt. However, it doesn't ask for anything else and just goes straight into the analysis. The manual also shows that every 100th generation I should be asked if I want to continue with the analysis but that doesn't happen either. Why am I not seeing these prompts? I know I'd like to eventually set "sumt burnin=2500". Do I need to do that before I do the execute command?
Lastly, how do I see what my final standard deviation is besides the terminal screen? I closed out of my first run and don't know what the number was and am not sure which output file it would be in.
Thanks!
Relevant answer
Answer
Hello, is it possible to specify in the .nex file so that mrbayes does not ask if I want to continue with the analysis, regardless if the standard deviation is less than 0.01?
Regards,
  • asked a question related to Phylogenetic Analysis
Question
2 answers
I have been given 4 organisms (insect) and need to manually construct the max possible trees and then choose the most parsimonious and back this up by research. they are arthropods.
first How can i verify what are the number of possible trees, I have already drawn 12, but got feedback that that is not enough. I am using 10 characters.
Relevant answer
Answer
So you are short three trees. You must have another three ways to root the tree.
  • asked a question related to Phylogenetic Analysis
Question
2 answers
Im trying to find an oligoprimer set for HA gene amplification of H5N1?
I will appreciate if you could point out a primer set for phylohenetic analysis by H5N1 HA gene sequencing.
Relevant answer
  • asked a question related to Phylogenetic Analysis
Question
2 answers
I am currently working on a paper and we need to discuss the relationship of these two variables. I already saw someone ask about this (May 2022) but the only answer provided was not that helpful.
Relevant answer
These are fundamentally different variables and there is no formal relationship between the two. Generally, lower nodes (closer to the root) have lower bootstrap support, but this is just an empirical observation that may or may not be true for a given dataset.
If I were in your position, I'd try plotting the support vs order for some trees of interest and see if it correlates with something.
  • asked a question related to Phylogenetic Analysis
Question
3 answers
Why do we partition our aligned molecular data in phylogenetic analysis?
Also I would like to ask what are the different ways of partitioning data in phylogenetic analysis and what are the best tools for data partitioning?
Thank you for your time and consideration
Relevant answer
Answer
Good evening,
In phylogenetic analyses, when we work with DNA we have to find the best DNA susbstitution model for our data, right? But in many cases our data is a mixture of different genes with different evolutionary histories; different mutation ratios, selection pressure, or even heritability (mitochondrial versus nuclear). This is also true even in a single gene, since the genetic code establish that a mutation in the third codon position is more probably to be synonymous compared with the first and second codon positions, and therefore more likely to be fixed in the genome.
Thus, a single model could not fit this variation. The way to solve this problem is to partitioning your data in "similar-evolutionary" units and look for the best model for each one of them.
how we check this? I suggest you to use PartitionFinder https://www.robertlanfear.com/partitionfinder/
This open source software will search for the best partitioning scheme for your data, and the best model for each of the different partitions.
Good luck!
  • asked a question related to Phylogenetic Analysis
Question
1 answer
here is the code:
library(babette)
library(seqinr)
library(BeastJar)
library(beastier)
fasta <- read.fasta("nuc.fasta")
get_default_beast2_bin_path(
beast2_folder = get_default_beast2_folder(),
os = rappdirs::app_dir()$os
)
fasta_filename <- "nuc.fasta"
output <- bbt_run(fasta_filename)
##############after this run it gives following error:
Error in beastier::check_input_filename_validity(beast2_options) :
'input_filename' must be a valid BEAST2 XML file. File 'C:\Users\User\AppData\Local\beastier\beastier\Cache\beast2_9ec3aae162d.xml' is not a valid BEAST2 file. FALSE
Relevant answer
Answer
You appear to have something wrong with your input file. Perhaps the name is not valid in Beast2. BTW Next time realize that not everyone reading your question is a phylogeneticist. Regards David Booth
  • asked a question related to Phylogenetic Analysis
Question
2 answers
Why does it need to converge in phylogenetic analysis?
What does a bootstrapping value of 70 mean?
Which part of the result shows the bootstrapping value after running the following script?
!raxml-ng --bsconverge --bs-trees T2.raxml.bootstraps --prefix Test --threads 2 --bs-cutoff 0.03
Relevant answer
Answer
Thank you very much for the answer J. Leonardo Moreno-Gallego
  • asked a question related to Phylogenetic Analysis
Question
4 answers
Currently I don't have the scope of sequencing and DNA extraction.....wet lab facility.
I want to reconstruct the phylogeny of four subfamilies of Malvaceae using matK, rbcL, ITS.
Can I use public data from GenBank, like downloading the genes (FASTA) of different plants, then using software to analyze the clusters (MEGA/Mr.Bayes/other software), then comparing the trees of different genes, to see if there is any similarity/dissimilarity, then drawing a conclusion based on my observation.
Will that flow of work be supported scientifically??
Please help. Sincerely-Sunzid.
Relevant answer
Answer
Dear Osabutey Samuel Sir, I have understood the point that you might said, that is, comparison. Comparison between my DNA isolates (suppose I have extracted DNA) and the DNA isolates stored in GenBank. Then building tree for both. Then compare between the two to see the similarity/dissimilarity. You might wanted to say that, if I am not wrong.
As far as I know when proposing new system of classification, authors take few representative specimens to construct phylogeny.
We have Angiosperm Phylogeny IV currently widely accepted. So in that, similarly few representative specimens were taken to build the tree/justify positions of existing clades or to solve previous debates regarding positioning.
So in that sense it is not possible to test different genes (rbcL, matK....etc) at the same time, also intergenic spacers (ITS, rpoC1, trnH-psbA etc). So I wanted to validate the use of matK gene/other genes, if it can justify the systemic position of the members of my working family. Like did they follow the classification or not, if not where are the dissimilarities, are there any morphological variations that can be related with that variation, if yes, then considering morphology it can be proposed to transfer the position of an existing clade.........all these things. The key vital point is that, in the classification they used small set of data, like they took, as for e.g. 5/10 specimens in consideration when they built new tree. But in my family there are more than 40+ members, so their internal variation might not be justified in that sense, in the tree. Also it is not possible to extract DNA and sequence 3,60,000 angiosperms while proposing new system of classification, so I think that in every system of classification there might be some errors and speculations. So if we can justify that using public data from GenBank in a large dataset in my family. That was the objective.
Thank you Sir.
  • asked a question related to Phylogenetic Analysis
Question
14 answers
I am a beginner and studying phylogenetic analysis, for analyzing my angiosperm family Sterculiaceae how should I approach for outgroup selection?? Should it be related closely?? Like members of Tiliaceae/Malvaceae??
I have learned using MEGA 11 to some extent.
Also I need some resources / papers/ slides/ anything that might be helpful for me to start as a beginner.
Thank you so much.
Relevant answer
Answer
Any plant from a sister lineage of your study group can be selected as an outgroup. Since you are working on a large group of plants, i.e. a family, it is advisable to use more than one species of a related family or families to be used as outgroups. As Roman Bohdan Hołyński has mentioned, these are used to “root” your phylogenetic tree, and hence very important to use in phylogenetic analysis. There are a lot of papers published over many years, and many lectures and tutorials in youtube that you can use to understand how you should proceed with your analyses. I would however suggest you to use R packages instead of Mega, since they are more flexible and will give you more option to play around with your phylogenetic tree.
  • asked a question related to Phylogenetic Analysis
Question
3 answers
Hello all, I have a sequence alignment of ~2000 sequences, which is likely more than is necessary. If I begin to remove sequences manually or using some software program I'm sure I can reduce the number of gaps, but this will of course reduce the size of the alignment (and may introduce some amount of bias/subjectivity). Is it better to keep the larger dataset at the expense of greater gap character? Is there a rough criteria for minimum amount of gaps an alignment should contain for reconstruction? Thanks very much.
Relevant answer
There is a lot of things to consider. You have ~2000 sequences, but: are these sequences orthologs? Are there paralogs? Do you know the species tree for the organisms from which these sequences were obtained? Is the taxonomic sampling of sequences balanced across the species tree? What is the mean sequence length? What is the MSA length? What is your ASR approach (maximum-likelihood, Bayesian, or parsimony)? Those gap positions are due to true insertions and deletions (indels), or are some of those gap positions caused by incomplete sequences? If they are indels, have you considered to perform a MSA using a statistical model for indels (Poison Indel Process) as implemented in ProPIP MSA software? Have you compared different MSA softwares? Are these gaps present in important and conserved sites or are they only in variable sites?
I believe that the approach to dealing with the trade-off will vary depending on the answer to each of the above questions.
For instance, if you will use maximum-likelihood or Bayesian approaches for ASR, you should be careful with gap positions because the stochastic substitution models used in these approaches do not account for indels. If there are many incomplete sequences, I would remove them as much as possible. If the taxonomic sampling is biased toward one specific clade, and if many sequences of this clade are gappy, I would prune this clade.
I think that perhaps the closest thing to a general rule of thumb is to worry about the quality of your taxonomic sampling rather than the quantity.
Best wishes,
Pedro
  • asked a question related to Phylogenetic Analysis
Question
3 answers
Hi, I have previously heard in a conference someone said "The isolates within group A are similar to group B with R > 0.2"
Is there a way for me to calculate the R value of two different groups of isolates based on their nucleotide/amino acid sequences or based on their sequence homology, so that in the end I could reach a conclusion just like the example I've provided?? Thank you so much!
Relevant answer
Answer
Dear Felix Hojaya,
Yes you can do it (Calculating the R: Transition/ Transversion index) by using MEGA X software. You just have to include all the sequences that you need to analyze and then go models section and then click on Estimate Transition/Transversion Bias). You will get the results based on the calculation of the all included sequences.
Wish you good luck
  • asked a question related to Phylogenetic Analysis
Question
2 answers
Hi everyone,
I am working with 14074 PAVs (presence/absence variations) in a panel of 99 accessions.
I converted the PAVs table into a binary matrix to construct a PAV-based NJ. Now I would like to test its stability by bootstrapping the tree.
Someone can inform me about the R code to do that?
Attached is the CSV input file.
Here below the commands that I run in R:
Data <- read.csv("Table.csv", sep=";", row.name=1)
dist_mat <- dist(Data, method = 'binary')
tree <- nj(dist_mat)
Relevant answer
Answer
Hii Gaia, try aboot function in poppr package.
Best of luck
  • asked a question related to Phylogenetic Analysis
Question
2 answers
We need to sequence markers to perform phylogenetic analysis of a snake but the only material we currently have available is a dried blood sample. I would like to know if it is feasible or possible to obtain information from this type of sample.
Thank you.
Relevant answer
Answer
Quality and Quantity of DNA will also depend on time and conditions of storage (UV light, etc). Extraction is key, so I suggest to carefully consider your options and chose the best extraction method available.
We had success with dried blood samples using Blood&Tissue Kit from qiagen (maybe enhancing the initial volumne depeding on the support medium) but also precipitation based methods (dissolving the dried blood in buffer medium).
  • asked a question related to Phylogenetic Analysis
Question
5 answers
Hi. I have acquired 177 original (yet similar) sequences in my taxonomy research. In order to produce a concatenated background dataset for phylogenetic analysis, I did BLAST searches for each original sequence and ended up with ~7000 sequences in one file.
There is much redundancy in the BLAST results as expected, however, manually removing the repeat sequences is frustratingly tedious and not reliable.
Please advise how I can reduce my background dataset to only unique sequences? That is, the file must have only one identifier for each sequence found in it- whether my original sequence or one from the BLAST search.
Thank you for any advice offered.
Regards, Tamiko
Relevant answer
Answer
Our ElimDups program can be set any any level of % identity you like. 100% will remove true duplicates, 98% removes all but one of groups of highly similar but not identical, etc...
  • asked a question related to Phylogenetic Analysis
Question
8 answers
I am greeting you all. I wonder if anyone here faces the same problem with me that I often cannot access a TreeBASE website? Also, can anyone suggest an alternative database that I can use to deposit my phylogenic tree? Thank you very much for your time and your support.
Relevant answer
  • asked a question related to Phylogenetic Analysis
Question
1 answer
I am currently working on a project that aims to characterise in R on a pool of 500 bird species the traits that may be at the origin of their introduction outside their natural habitat and thus allowing them to become invasive or not.
Thus, out of my pool of 500 species, I ended up with 150 bird species that were introduced elsewhere (introduction = 1) versus 350 others that were not introduced (introduction = 0), with approximately 80 life history traits for each of them.
My idea was therefore to use PGLS (linear models correcting for the phylogenetic effect of species on their traits) on my pool of 500 species and see which traits could explain the "introduction" variable.
The problem is that by doing this my results are biased by the presence of many more non-introduced birds than introduced birds. My initial idea was to use bootstrapping to resample my n=350 birds to n=150 and run my PGLS on this new pool of 300 species (n=150 introduced and n=150 unintroduced), repeat it and then do some model averaging.
However by doing this my final models obtained in this way are completely different at each of my R sessions. I have tried increasing the number of bootstrap runs to 10,000 but this does not solve the problem. When I do this with basic GLMs I do not encounter this problem of non-repeatability.
Would you have a solution to solve this problem of repeatability with the PGLS in my process?
Relevant answer
Answer
I would approach this with logistic regression which I assume is what you did . After encountering your problem, I would guess that you might have rare cases. If so I would suggest Firth regression a modification that deals with this problem. Google Firth regression for full details.
Best wishes David Booth
  • asked a question related to Phylogenetic Analysis
Question
6 answers
Hello everyone, I'm working on phylogenetic analysis and sequence retrieval by tBLASTn tool of NCBI. But the page is showing a maximum of 100 alignments and I wish to see all the alignments i.e. at least more than 100. How can I do so?
I'm seeking your quick answers.
Thanks.
Relevant answer
Answer
Manually you may change it by clicking output "options" at the bottom of the blast.
  • asked a question related to Phylogenetic Analysis
Question
2 answers
Traditional search in TNT in my case is retaining 7500 trees with the best score of 382. How do I find the particular tree with the best score for saving it?
Relevant answer
Answer
This means that 7500 trees have the same "best" parsimony tree. You can see if these 7500 have any information by doing a 50% consensus tree on them. If that does not give you a nice tree then your data are too sparse to further resolve the tree. One rarely gets just one best tree using parsimony.
  • asked a question related to Phylogenetic Analysis
Question
5 answers
Is there any option, so that we can get the results very faster for 1 million generations?
It takes one month to finish the phylogenetic analysis for the same.
Relevant answer
Answer
If you co-install beagle it allows MrBayes to run on multiple CPU's. This will (possibly) make the analysis much faster. Of course exact speed depends on your laptop specifications. (The system behind the online platforms usually has strong calculating power, so probably it will always be much faster). Usually you can use the multi-threading strategy also on the online platforms or supercomputers as beagle (or similar?) are co-installed. It is possible that you have to specify the option to do so (not sure how it is on CIPRES).
  • asked a question related to Phylogenetic Analysis
Question
5 answers
I am analyzing population structure of a fungal organism collected from various locations using RADseq data. For this purpose I ran few analyses, but want to focus on maximum likelihood phylogeny and estimation of inbreeding coefficient (Fis) here. Phylogenetic analysis identified three well supported clades within one population (similar branching was observed with phylogeny based on protein coding loci). When I estimate inbreeding coefficient using the same dataset as for the phylogeny I get a value of 0.009. These two results do not seem to agree with each other, since phylogeny seems to detect signature of subpopulation structure within this population, but Fis doesn't. What may explain this pattern?
Thanks for the ideas!
Olga
Relevant answer
If Population genetic indicators, such as allele and genotype frequencies, observed and expected heterozygosity, were calculated using the POPGENE VERSION 1.31.
The Wright Fixation Index (FIS), which is a measure of the difference between observed and expected heterozygosity. 0.009 is a relative indicator of population heterozygosity.
The tendency of the indicator to 0 also reflects the excess of heterozygotes (unrelated mating).
  • asked a question related to Phylogenetic Analysis
Question
1 answer
Hello everyone!
I'm trying to replicate analysis 1.3 from the paper: "Phylogenetic analysis of a new morphological dataset elucidates the evolutionary history of Crocodylia and resolves the long-standing gharial problem" of J.P.Rio and P.D.Mannion. With their supplementary files "Raw data: TNT file with continuous and discrete morphological characters (used in Analysis 1 and 3)." and the following protocol :
piwe=
open the matrix
piwe=12 ; (weight of 12)
xpiwe= ; (active EIW)
piwe; (see if all is ok weight of 12 and EIW ON)
Then NewTechnologySearch with :
- on letf side : Sect Search + Ratchet + Drift + Tree fusing (on default setting)
- on right side : Driven search with init addseqs of 5 + Stabilize consensus 5 times with 75 factor
Random seed of 1 and Auto-constrain and replace existing trees ON
Then TBR with Tree on ram.
But I get a tree with the best score of 8900 something whereas in the paper the best score is 8181.9.
Does someone have an idea where I mistaken?
Relevant answer
Answer
Hi Jonathan, it is quite difficult to determine which is the cause of the difference in score by just reading the commands you used. I think that you should try to contact with the authors.
best,
Santiago
  • asked a question related to Phylogenetic Analysis
Question
3 answers
I have performed a phylogenetic analysis using CLUSTALW and obtained a phylogenetic tree. I know how to read a phylogenetic tree and how to relate the sequence similarity between two sequences by looking at the tree, but what I am concerned about is how one can decipher the father-daughter relationship by looking at the tree? Also, if I want to isolate the sequence of the father corresponding to a daughter sequence how can I do that?
I have attached the corresponding image. In the image, I know that sequences "QXN18196" and "QXN18436" are closely related. But will it be safe to say that sequence "QXN18520" is the father sequence for both the sequences? Or if I want to establish a father-daughter relationship for the whole tree, what will be the relation between the sequences?
Relevant answer
Answer
Dear Satyam,
as David pointed out you CANNOT infer parent-child relationships from a phylogenetic tree. All tips represent taxa that can be in a sister-group relationship, but their parent (usually referred to as mosther, not father) are the respectively previous nodes, not the closests sister group. So to answer your question: no it is not possbile. There are some programs that will allow to compute the hypothetical ancestral sequence, but these should always be treated with caution.
  • asked a question related to Phylogenetic Analysis
Question
7 answers
I’m a master student and I’m going to start to work with genomic data in the field of population genomic. I need to buy a new laptop and I was wondering if the new Macbooks with the M1 processor are suitable for genomic and population genomics analysis. I’m mainly afraid of incompatibility issues with the most common used programs. I would like also to know if it’s possible to run on a M1 MacBook programs for phylogenetic analysis such as BEAST, MEGA ecc..
Thank you so much for the help!
Andrea.
Relevant answer
Answer
Dear Andrea,
Do you know if you'll have access to any computer cluster? It might be helpful to ask in your institution.
If so, do not worry too much about your laptop specs. You may be only using it for plotting results, running some things in R/python and doing the last steps of your analyses, and, most of the time, accessing the mentioned cluster via your bash terminal (so it would also be helpful to see how to submit jobs on it).
If you have to run everything locally, in my own experience, they do perform well with population genetics and phylogenetics software.
  • asked a question related to Phylogenetic Analysis
Question
1 answer
Im trying to partition a set of characters, is there a way of doing this by specifying the interval instead of the whole sequence of numbers (like 0-1044)
Relevant answer
Answer
Are you using xgroup?
xgroup =1 0.9
puts the first 10 characters in Group_1. Can be repeated on one line:
xgroup =1 0.9 15.20 35.50
puts all the specified intervals in Group_1.
  • asked a question related to Phylogenetic Analysis
Question
7 answers
I'm very new doing phylogenetic analysis specially using bayesian inferences. I'm building a phylogenetic tree with 3 mitochondrial markers and after running partition finder I obtained two partitions which their best substitution models are HKY+I+G+X and HKY+G+X. I would like to receive some advice on how to set up these models, either directly from the BEAUTI interface or by modifying the XML file.
Thank you in advance
Relevant answer
Answer
I am not sure if BEAST has an option for a +X model. You could try to use PartitionFinder2 (https://www.robertlanfear.com/partitionfinder/) and specify that you want models available in BEAST.
  • asked a question related to Phylogenetic Analysis
Question
4 answers
CLOUD COMPUTING
Relevant answer
Answer
Yes, it is highly possible
  • asked a question related to Phylogenetic Analysis
Question
12 answers
Based on 16S rRNA gene sequencing
Relevant answer
Answer
You can use mega software.
You can also have a look at this paper:
Molecular characterization of lactobacilli isolated from fermented idli batter
DOI:10.1590/S151783822013000400025
  • asked a question related to Phylogenetic Analysis
Question
5 answers
I would like to add a data matrix of morphological data, assembled in the software Mesquite, to a manuscript. I would either add an electronic supplement (MS Excel format) or a table as *.txt or *.dic file. Anyone with experience around? I find Mesquite to be a bit user-unfriendly with this regard.
Relevant answer
Answer
you can just copy paste the selected range using ctrl+v and ctrl+v as long as the selected range is selected on the mesquite matrix. So for example, if you want to copy a range of 3 rows and 3 columns from the excel, you have to copy that range and go to the mesquite matrix and select 3 rows and 3 columns then paste it.
  • asked a question related to Phylogenetic Analysis
Question
3 answers
I am trying to calculate the origin time of some bacteria lineages, and testing the beast2 with a very sample dataset with only 12 taxa and 1 protein sequence with 1000 AAs, with wag model. I used the prior root age with 3500 MA and one cyanobacteria lineage 1200 MA with normal distribution at "priors" at BEAUti, and calibrated yule model using a fixed starting tree (with 4 parameters turn to 0). However, I keep getting the results that have very short branch length and the ESS is always low even I set the chain length to 40000000. Could anyone provide me some suggestions? Thanks a lot!
Relevant answer
Answer
Hi Jessy,
My first guess is that there is nothing wrong with it, only the proportional differences between the two main groups are hugely higher than the difference among sequences within it. So if you zoom it in you will see taxa relationship as expected. You can also try to plot each group separately.
I would try that first if I was you.
I hope it helps.
Best
  • asked a question related to Phylogenetic Analysis
Question
3 answers
I'm currently learning the how's and why's of bioinformatics, this is the third three I build and I noticed that the more sequences I add, the lower is the bootstrap.
My sequences were trimmed using trimAl I choose to obtain them with no gaps. The evolutionary model was chosen using the Jmodeltest software and the phylogenetic tree was generated on Mega.
This is a Maximum Likelihood tree, my Log Likelihood is -24465.21, the three were built using the General Time Reversible model, G+I, gamma 6, NNI, 25 threads, and 500 Boostrap reps.
Can anyone elucidate why this is happening and what I should have in mind when constructing the next three?
Thank you very much
Relevant answer
Answer
Keep in mind that for a molecular phylogenetic tree to make sense, homologous nucleotide positions (those that descend from a common ancestor) need to be aligned with each other. While we can't directly "see" homology, the degree of nucleotide similarity is the best guide, and is the tool that alignment programs such as Clustal or Muscle use. If you constructed your tree with no gaps, as you put it, it sounds like you've made no attempt to do this.
You don't describe what locus (or loci) your sequences are from or the length of the sequences. However, I'm guessing these are coronavirus sequences. If a lot of these are SARS-CoV-2, the number of nucleotide differences between correctly aligned sites is still pretty small, so a high-confidence support for most branches simply can't be achieved.
Finally, the cladogram display you're using throws away most of the information from maximum-likelihood tree construction on just how much change has taken place on each branch. A phylogram display would give you that information. I suspect that an aligned set of sequences would show a lot of very short branches, consistent with the things I mention in the previous paragraph.
  • asked a question related to Phylogenetic Analysis
Question
3 answers
Hi everyone, do you know any phylogenetic analysis to assess the dependence between both categorical and continuous independent variables (plant traits in this case) and phylogenetic relationships?
I have looked at 'phytools' but I'm not sure if it has a function for such a variety of variables at the same time.
Relevant answer
Answer
Hi Willian, many indices of phylogenetic signals have been published so far, but we still have a lack of consensus about what indices quantify, how redundant they are and which ones are most recommended. Abouheif’s C-mean and Pagel’s λ have high annalistic power for complex models of trait evolution, as well as Fritz and Purvis’ D, and Blomberg’s K. In this context, I recommend you to take a look at the guidelines proposed by Münkemüller et al. (2012) (DOI: 10.1111/j.2041-210X.2012.00196.x). It can be very useful for assessing the trait evolution processes based on multiple and discrete traits. Best!
  • asked a question related to Phylogenetic Analysis
Question
2 answers
I saw that some papers use both cpDNA and nrDNA in a single phylogenetic study. Why not just choose one? What are the functions of each type of DNA? What is the benefit of using both?
Relevant answer
Answer
In the most of green plants, chloroplast (DNA) transmits , usually, to the progenies through the maternal gamete. Then, we can determine the (cytoplasmic) maternal plant (by the cp DNA haplotype) in the hybrids when both parents have the different cpDNA haplotypes. Nuclear DNA transmits to the progenies from both parents. We can determine, luckily, the both parents of the hybrids and the direction of crossings when you examine both the cpDNA & nrDNA.
  • asked a question related to Phylogenetic Analysis
Question
3 answers
We have a phylogenetic tree of 16s rRNA sequence of many bacteria, and we also have a phylogenetic tree of protein sequences of those bacteria. Now, how can we correlate these two types of sequence in terms of evaluation?
Relevant answer
Answer
Hi there,you can assess the Mantel correlation between the phylogenetic distance matrices.
  • asked a question related to Phylogenetic Analysis
Question
15 answers
I am eager to be a researcher in an international project about statistical, mutational, phylogenetic analysis management topics.
Relevant answer
Answer
Thanks,
Ozodbek Abduraimov
for your compliment.
  • asked a question related to Phylogenetic Analysis
Question
2 answers
Dear RG community,
For robust phylogenetic analysis, it is often necessary to base a species tree on concatenated alignments for many genes rather than for just one gene such as 18S rDNA etc. Finding multi-gene sequences that are available for every species in the tree, with one's species of interest as a starting point, is the most time-consuming step in this procedure.
I wonder if there is any commercial or free software that can conduct such a search automatically?
For example, Wiegmann et al. (2009) ( ), reconstructed phylogeny of holometabolous insects using six genes: AATS, CAD, TPI, SNF,
PGD, and RNA POL II. This group obtained sequences of these genes from 29 species based on their own sequencing data, which s very cool. When one needs to get such information manually from GenBank, the task becomes very tedious and time consuming, because it is almost impossible to find species for which sequences of all the six genes are available.
Thank you.
Relevant answer
Answer
Thank you very much, Chan Kin Onn.
  • asked a question related to Phylogenetic Analysis
Question
3 answers
What are the important points one must keep in mind while doing beastv2 analysis in connection with molecular dating?
Relevant answer
Answer
1) Quality of your phylogeny
2) Quality and/or number of calibration points
3) Proper setting of priors
  • asked a question related to Phylogenetic Analysis
Question
3 answers
Dear All,
I am struggling with a constant problem with a csv extension while preparing data for MuSSE model analysis: have tried to do a bunch of stuff to fix a problem but no success - always the same thing ("All names must be length 1"). I would very grateful for your help! :)
library(diversitree) dat="MuSSE_hosts.csv" dat<- read.table("MuSSE_hosts.csv", header=TRUE, dec=".", sep=",", row.names=1) mat <- dat[,2:ncol(dat)] lik.0 <- make.musse.multitrait(tree, mat, depth=0) Error in check.states.musse.multitrait(tree, states, strict = strict, : All names must be length 1
Thank you a lot in advance!
Relevant answer
Answer
Thank you a lot for your help!
  • asked a question related to Phylogenetic Analysis
Question
2 answers
How many average genetic distance values of mtDNA control region Dloop for indicative conspecific populations or valid species in Chiroptera?. Thank you
Relevant answer
Answer
Hi Husni,
as far I could see, there is not a "magic number" for Chiroptera. Only three species have records in genbank for the dloop region. Therefore, it's really tricky to get a safe threshold, if we can say that exist such a thing.
I'm personally not a fan of thresholds, but you could try to explore the intraspecific genetic distance of the sequences on the database. It will give you a clue about how much this sequence is instraspecifically polimorphic as well as it should overlap the between species distances.
I should also recommend you to take a look at the coalesce analysis like GMYC or the PTP (Poisson's Process Tree) and their bayesian implementation.
Good luck
  • asked a question related to Phylogenetic Analysis
Question
7 answers
Greetings to all,
Anybody please suggests me. How important is Bayesian posterior probability in phylogenetic analysis. Is MP and ML sufficient for phylogenetic analysis. I didn't get consistent PP values for each analysis I performed using Mr Bayes. What could be the reason? Also, the SD of split frequencies never falls below 0.01 even after adding number of generations.
Relevant answer
Answer
Posterior probabilities (PPs) are measures of confidence that a particular inference is true, given the data and the model. In phylogenetics, that inference is the bifurcating node - PPs close to 1 support the node as very probable. PPs below 0.5 indicate there is weak support for that node being true, but in published trees these nodes with PPs < 0.5 are not shown by convention, but instead are collapsed into a polytomy. To draw a bifurcating node with PP less than 0.5 is a contradiction in terms: you are saying both "this node is probably true" (the node), and "this node is probably not true" (the PP) simultaneously.
How "important" they are is up to you. If you are making conclusions on results using Bayesian statistics, they are absolutely critical, like a p-value is to other statistics. If you are skeptical of Bayesian methodology, you might instead compare PPs with bootstraps of other methods, to question the most reasonable approach, given your research question. Whether parsimony, maximum likelihood or Bayesian are "sufficient" is not really the right question, and depends on context. For example, parsimony is unsupportable as a meaningful method if you are using genetic sequence information, but might be suitable for categorical morphological data.
If a Bayesian process does not converge (as your split frequencies and inconsistent PPs show), it is often because there is not enough signal in your data. You would investigate this by looking at PPs across the unresolved phylogeny, whether it is a single node or a general polytomy. Adding data is often a good idea, whether through more species / genes / OTUs. Or you might infer phylogenies for subsets of your matrix (say, for some genes or some species), to try to identify where the lack of convergence comes from. Keep in mind that your dataset does not owe you a phylogeny. And that any phylogeny that does emerge, even with high PPs, could be biased for a number of factors. If anything, the higher my confidence statistics are, the more I suspect the inference method to be biased.
Hope that help :)
  • asked a question related to Phylogenetic Analysis
Question
9 answers
Rather than using sequence alignment data, I wanted to have phylogenetic tree from distance matrix and bootstrap as part of statistical analysis. Anyone to tell me how to execute this analysis?
Relevant answer
Answer
Dear all,
I have been working with MEGA and distance matrix. I am attching a mini-tutorial to compute UPGMA dendrograms.
I have tried with MEGA-X and it perfectly works.
#############################
1-Calculate your distance matrix with your favourite software (e.g. in ".xls" format)
2-Prepare the matrix as explained by MEGA developers (Figure 1) in a ".txt" file. I show my own datafile (Figure 2). Warning: Keep the structure, including blanks.
3-Save this file as ".meg" file. It will be ready to use in MEGA.
4-Open ".meg" file with MEGA, select Pairwise distance > Lower left matrix > OK
5-Check data in the file read by MEGA. The matrix should be identical that your previous one (".xls" or similar; Step 1)
6-Select data and click "Phylogeny". You could compute UPGMA tree
7-MEGA offers a good variety of options to customize your tree
Good luck!
  • asked a question related to Phylogenetic Analysis
Question
1 answer
I am using Unipro UGENE for sequence alignment. after alignment, I usually export the file as mega format and it always works. But for some reasons its not working now. Any of you facing the same problem? any alternative? or online website to fix the line errors?
Thanks in Advance
Relevant answer
Answer
Either remove the invalid bases manually or use this script
  • asked a question related to Phylogenetic Analysis
Question
4 answers
How would this appear on a tree if COI only resolved those closely related species and not more distantly related species? Thank you
Relevant answer
Answer
support values would be low for deep branches and/or they would be unresolved (as Artur already mentioned). And COI is NOT suited to resolve deep branches. For closely related species it is fine, but also for closely related species you always have the problem that if you analyze a single gene, you will only receive a gene tree. Not the underlying species tree and that might be quite different
  • asked a question related to Phylogenetic Analysis
Question
10 answers
I have a question, when I used the picante function to analyze the phylogenetic community structure and phylogenetic signals in R, it appeared an error:"'phylo' is not rooted and fully dichotomous", I don't understand what's the problem, the attached file is my phylogenetic information, please check it, I am sorry to trouble you all, but I really want to solve this problem, thanks a lot.
Relevant answer
Answer
Yes, i think too
*The unrooted tree should be transformed into rooted tree. Try this code, phy=multi2di(phy1), where phy1 is the unrooted tree.
Thanks Wenjie Wan
  • asked a question related to Phylogenetic Analysis
Question
3 answers
I will do an experimental analysis to perform a comparative genomic analysis among Avian Pathogenic E. coli to explore the properties characteristics of type 3 secretion system loci in APEC. The idea is to isolate and compare APEC strains from chickens' brains with septicaemic symptoms and from chickens without symptoms (control). First, I will use RT-PCR to screen all APEC isolated for the presence of conserved T3SS structural core genes. Then, I will use whole genome sequencing and bioinformatics to screen the genomes for the presence/absence of a full set of genes known to be part of a complete T3SS, MLST, and other genes. To compare the APEC field strains with the control, I will use virulence and AMR gene profiling and phylogenetic analysis (based on MLST and average nucleotide identity). I wondered which is/are the best way/ways to analyze the results and what I should expect in terms of results. Thank you for your thoughts; I appreciate it.
Relevant answer
Answer
I like the CGE website for a quick estimation, whether and which virulence and resistance factors are present. It also works for a gMLST analysis: https://www.genomicepidemiology.org/
For comparative genomics, we use Prokka, Snippy and Roary.
  • asked a question related to Phylogenetic Analysis
Question
3 answers
Hello, everyone,
One of the aims of my current study is to place a particular plant species within the context of the whole genus phylogenetically. The systematic position of these plant species is well known. But the species that I have in hand is rare and endemic and its phylogenetic position is obscured.
I will launch a phylogenetic analysis of the DNA sequences of this plant species. I have a few DNA sequences of 5 markers. Please I need help and answers to the following points:
1- I did a BLASTn search, should I use Mega blast search instead?
2- If the retrieved list includes a plant species with several accessions, should I download all accessions or just one accession for each species is enough for the phylogenetic analysis?
3- What is the threshold of similarity percent to the query sequence I should select?
4- How many accessions needed to cover the diversity of the plant species under investigation?
5- Which phylogenetic analysis methodology suits systematic research, (i.e.) Bayesian analysis, Maximum likelihood, or something else?
Relevant answer
Answer
1- Any homology search tool should be fine, yes it's better to use megablast and even better to use protein sequences than nucleotide sequences in this case i recommend to use DELTA-BLAST (Domain Enhanced Lookup Time Accelerated BLAST).
2- Generally more representative sequences more robust tree you got. Also, more balance group-sequences representatives are recommended.
3- there is no similarity cut off but it's better to go for 90% and more similarity. keeping the sequence coverage in mind.
4- i think this point was covered is #2
5- Both Bayesian analysis and Maximum likelihood is recommended.
  • asked a question related to Phylogenetic Analysis
Question
4 answers
I am interested in the different tools that can be used to create custom databases for targeted sequencing and how to trim the databases based on the amplicon size? Also, should custom databases contain species not assigned to a species level?
Relevant answer
Answer
Not familiar with taxonomy databases, but I can always recommend SDL Multiterm. It is a customisable type of database that is used for storing terms and their explanations, create 'fields' between them etc. Hopefully the program can be of help for you, or at least point you in the right direction.
  • asked a question related to Phylogenetic Analysis
Question
8 answers
I am carrying out a phylogenetic analysis on the relationships of several families within Anura and was planning on using cytb and COI in my analysis but I am not sure whether I should be choosing less conserved genes?
Relevant answer
Answer
There is no single correct answer to this question, and the above three responses all make good points.
In response to Shi, I would say that the matrilineal bias of mtDNA is mainly a problem at the level of populations or phylogeography, and that by the time species have separated and no longer interbreed, the entire genome is an isolated, independently evolving unit from the genomes of other taxa (barring horizontal transfer or other non-phylogenetic phenomena). However, of course, any individual gene or the entire genome might not reveal the "true evolutionary history".
Mustafa is correct that patterns implied by multiple genes can be confusing, and it is arguable that the entire field of phylogenomics has become confused about how to treat gene trees vs. "the species tree" with "deep coalescence" and so forth.
Filipe makes a good practical point, and probably the best advice for Lucy: look at studies of comparable levels of diversity in related groups of frogs - particularly ones that are not belabored with hand-wringing about gene tree incongruence, and see which markers they have used. Using the same markers as colleagues in your field provides you with sequences you can use for outgroups or to enlarge your own data set, and likely means that your paper will be more widely cited than if you develop markers that others have not used before (unless you are extremely lucky and everyone else starts using them, too).
  • asked a question related to Phylogenetic Analysis
Question
10 answers
What are the benefits of concatenating two or more gene sequences in alignments, and is it better to use nucleotide sequences or amino acid sequences in this? Also, does this work for all genes from the same taxa or are there exceptions?
Relevant answer
Answer
The benefit is to increase the power of the test to estimate the Tree and its various parameters. Too little data - short sequences - can have too little phylogenetically informative data to resolve the tree. The downside - as with any statistical analysis - is that such concatenation can join genes with rather different histories of evolution that need to be described by different models of sequence evolution. So you increase the number of parameters (model, tree and branch lengths) needed to resolve the tree in an unbiased manner. The longer the sequence the better chance you have of finding a tree with high support. But if you fail to account the the heterogeneity of the evolutionary process across genes you are likely to get an erroneous tree with high support. Increased confidence in the wrong tree. So you must estimate model parameters for each gene and then do an analysis with the data stratified. Let me know what your software is and I can tell you howto accomplish this.
  • asked a question related to Phylogenetic Analysis
Question
3 answers
Hello!
I now have a complete list of about 23k plant species that I want to check if Genbank has the sequencing data for a phylogenetic analysis later, and I am really new to the topic.
Is there a way to do this using R?
Thanks!
Relevant answer
Answer
You can download the sequences from the NCBI, and for phylogenetic analysis, you can use MEGA or R software
I hope this answer helps
  • asked a question related to Phylogenetic Analysis
Question
6 answers
I am getting negative branches with the Neighbor-joining method, which  I set   to zero. However,   I've read that I should transfer negative distances somewhere else and I do not know how. Does any have an script/method to transfer negative distances to the corresponding branches?. Thanks in advance for any help.
Relevant answer
Answer
In addition to
Michael B Black
's comments, here is the fix negative length function in case anyone uses the ape package:
fix_negative_edge_length <- function(nj.tree) {
edge_infos <- cbind(nj.tree$edge, nj.tree$edge.length) %>% as.data.table
colnames(edge_infos) <- c('from', 'to', 'length')
nega_froms <- edge_infos[length < 0, sort(unique(from))]
nega_froms
for (nega_from in nega_froms) {
minus_length <- edge_infos[from == nega_from, ][order(length)][1, length]
edge_infos[from == nega_from, length := length - minus_length]
edge_infos[to == nega_from, length := length + minus_length]
}
nj.tree$edge.length <- edge_infos$length
nj.tree
}
  • asked a question related to Phylogenetic Analysis
Question
3 answers
I have a wide range of freshwater microalgae collected from various districts of Tamil Nadu. I am looking for a collaborator who could help me with molecular identification of some rare unknown taxa. Please ping me for further details.
Relevant answer
Answer
Hi,
rbcL and ITS are the most used ones.
Good luck.
  • asked a question related to Phylogenetic Analysis
Question
12 answers
Normally dN/dS ratios are calculated and interpreted as below one negative selection, above 1 is positive selection and 1 means neutral selection. How to interpret dS/dN ratios? Programs like SNAP provides dS/dN graphs and ratios.
For example:   
Averages of all pairwise comparisons: ds = 0.1678, dn = 0.4090, ds/dn = 0.4072, ps/pn = 0.4707
Please see image as well.
Can somebody explains it in simple words as I am not much familiar with this?
Relevant answer
Answer
Can anyone interpret the dN/dS=.099, which is defined as defualt in PAML when dS=0 or both dN=dS=0?
  • asked a question related to Phylogenetic Analysis
Question
11 answers
Hello all,
I am trying to obtain information about the specific conditions for the method of preserving feces samples for mammals ( preservative type, preservative concentrate, preservation temperature, ....) in order to perform a phylogenetic analysis.
Thank you
Relevant answer
Answer
Thank you so much for this response
  • asked a question related to Phylogenetic Analysis
Question
2 answers
I am conducting a phylogenetic analysis on LuxR solo of alpha, beta and gammaproteobacteria. I am building three separate trees for each. What would serve as the best outgroup in all three trees?
Relevant answer
Answer
Thank you very much!!
  • asked a question related to Phylogenetic Analysis
Question
7 answers
I found some bases with disagreement in the assembly of the contig. I read that because it is a nuclear marker, this disagreement may not be due to a failure in sequencing but due to heterozygosity, and therefore I should not change the base in the contig by the highest-quality criterion, but set it as an IUPAC ambiguity code.
On the other hand, I was told that when it is heterozygous, the peaks are usually the same size. But it is not always easy to determine how equal it is.
So I would like to know if there is a cut-off point or defined protocol to determine if a mismatch is due to heterozygosity or an error in sequencing on one of the strands?
Relevant answer
Answer
Look for read coverage over the sites in question and observe allelic frequencies before deciding how to proceed. With coverage you can gain confidence and establish a PValue which you then use to determine if the polymorphic site you are observing is likely to be the result of some random process - most likely sequencing errors. Assuming your PValue is <= 0.05 (typical threshold) then look at the allele frequencies and only then decide as to the type of allelic variation you are observing. Read base quality scores should be treated as guides only, they only relate to the actual sequencer base calls and do not account for errors arising during sample library preparation. Personally I ignore quality scores completely when aligning reads as the relationship between base call quality scores and alignment substitution rates are usually swamped by library preparation errors.
  • asked a question related to Phylogenetic Analysis
Question
1 answer
What is cophenetic correlation coefficient and how is it computed for a clustering method.
Also how can it be used to compare two different distance matrices?
Relevant answer
Answer
A clustering method operates on some measure of resemblance (similarity/dissimilarity/distance) among objects. It uses those resemblances to produce a result, be it a dendrogram or some other result. Cophenetic correlation is a measure of how well the clustering result matches the original resemblances. So, as an example, similarities among samples are clustered using a method like UPGMA to produce a dendrogram. The distances among samples are calculated through the dendrogram (actually to a common node, but the idea holds) to give cophenetic distances. The between-sample original resemblances are correlated with the cophenetic distances to give cophenetic correlation. If the value is high (near 1) the clustering result is an excellent representation of the original distances, if it is <<1 then it is not. See Clarke KR, Somerfield PJ, Gorley RN (2016) Clustering in non-parametric multivariate analyses. J. Exp. Mar. Biol. Ecol. 483: 147-155 doi: 10.1016/j.jembe.2016.07.010 which incidentally contains an error as 2 figures are swapped, but it illustrates the idea.
  • asked a question related to Phylogenetic Analysis
Question
3 answers
Hi,
I'm currently trying to do a multiple sequence alignment for a gene family using the coding nucleotide sequences. I want to create an alignment based on the amino acid sequences but that is in the form of nucleotides, so that I'm able to account for synonymous mutations in my subsequent analyses. The ClustalW (codons) and MUSCLE (codons) alignment options in MEGA-X sound like they fit my needs. However, I am unable to select them in the drop down menu and they are greyed out (see picture attached).
Does anyone know of any reasons why these options might not be available to me?
Thanks in advance,
Jacob
Relevant answer
First option: Have you tried translating your data from genome sequences to amino-acids? That may help MEGA to interpret the data as proteins.
Second option: When you open your file, MEGA "asks" if the sequences are protein-coding (and even asks the origin of them). Try saving your data as a new FASTA file and then opening like that in MEGA.
I hope any of both may help.
*Although I believe you may be able to align your data as genetic sequences, I do not know much of proteomics and what would be the adequate method (via ACTG or amino-acids) to perform a phylogenetic analysis with that specific objective.
  • asked a question related to Phylogenetic Analysis
Question
2 answers
Hi,
In my study I performed a coalescent model phylogenetic analysis, but before to identify the probable number of ancestors I did an admixture analysis. I want to connect my admixture results after the coalescent model tree. I want to derive a connection between these two analyses. If someone knows an answer to this, please help.
Relevant answer
Answer
While the coalescent model phylogenetic analysis tells you about the phylogenetic signals between individuals/species, (i.e. who is related to whom), admixture responds to the assignation of individuals to a certain cluster by somewhat similar to an axle reduction, (i.e. it tells you how many component explain better the distribution of the data and, the probability that an individual corresponds to any component).
If the structure and phylogenetic signals of your data are strong, you may see that phylogeny and admixture/structure results "matches", in order that a phylogenetic tree could conform a monophyletic group in n individuals that belong to an admixture cluster.
  • asked a question related to Phylogenetic Analysis
Question
4 answers
Hi all,
I'm a structural biologist and I'm attempting to trace the evolution of a protein domain family of interest. I have a very large collection of domain sequences, that (we hypothesise) have diverged in sequence and structure over time from a common ancestor to the specialised domains that we're looking (~15 domains in total). This is challenging because the domains are very different from each other in sequence, so we're trying to see if we can trace the evolution by increasing the number of sequences to tease out the phylogenetic signal. To trace the evolution, I acquired a large number of sequences of these domains (~10,000-15,000 each) and performed a super-huge alignment of all of them using PASTA over 20 iterations, coming out at ~150,000 sequences (~300aa long on average).
I now need to perform a phylogenetic analysis, and I did attempt to use PhyML on my university's supercomputer. I get error messages saying that I should not use more than 4000 taxa. My question to you, dear phylogenetic fellows, is: which software do you think is best for performing phylogenetic analysis of such a large number of aligned sequences? Any help would be appreciated!
P.S. as a side question - how many bootstraps should I be aiming for? I've read that ~100 is generally acceptable, and I set PhyML to run 500 (though I fear this may be unrealistic given the size of the input data).
Thanks very much!
Rob
Relevant answer
Answer
Hi Rob,
You may try RAxML and IQ-Tree both of which are also maximum likelihood methods but have been optimized for big dataset. As for 150 k seqences, I guess it still takes time for them to complete. An alternative approach is to use distance-base methods and build a NJ tree. Distance methods are much less intensive computationally but do not guarantee a best tree given the data.
For the number of bootsrap repeats, it depends on the methods being used. I remember there is paper evaluating the optimal bootstrap repeats for RAxML. It found that 100 repeats were not enough, but 1000 repeats were not necesary, and 200 repeats were satisfactory in most cases. As for Ultra-fast bootsraping method used by IQ-tree, at less 1000 repeats were recommonded.
  • asked a question related to Phylogenetic Analysis
Question
8 answers
I am studying two types of Ophidiid fishes having features morphologically similar and would like to genetically classify them.
I have already investigated COI region of mtDNA, but the genetic difference is quite small (2 to 7 mutational steps between nearest and farthest haplotypes), so I am considering of conducting experiments using nuclear DNA.
However, I am not sure which genetic marker will be suitable for analysis to make a marked difference than COI.
So far, I am considering of using the RAG1 region, which has been studied in related species, or ITS, where base substitution is likely to occur because of non-coding region.
Could you tell me if there are better genetic markers that has a fast base substitution rate and is effective in clarifying interspecific or intraspecific differences?
Relevant answer
Answer
Hi Mita,
Given your fish species are closely related as revealed by mtDNA, it is preferable to choose hyper variable regions of nuclear genome. Markers occurring to me are microsattlites (STR) and MHC. STRs have faster mutation rate than mtDNA in animals in general. Some regions of MHC also evovled very fast due to selection. Have a look at review paper below might help.
Zhang DX, Hewitt GM (2003) Nuclear DNA analyses in genetic studies of populations: practice, problems and prospects. Molecular Ecology, 12: 563-584.
  • asked a question related to Phylogenetic Analysis
Question
4 answers
I've tried several programs (e.g. TreeView, FigTree, MEGA, Archaeopteryx, Mesquite) but they just display node numbers and branch lengths. I'm using the software SYMMETREE to infer diversification rate shifts on particular nodes within a tree. Results refer to "branch numbers" which, according to the manual, can be displayed using MacClade. However, I´m not using a Mac OS platform.
Relevant answer
Answer
Yes, ape in R is a good option
For the sake of providing a direct answer to the SPECIFIC question you asked, the following R code will do exactly what you require, and output the tree to a PDFfile in your current working directory
library(ape)
tree = read.nexus(file="path\to\file")
tree_export = "tree.pdf"
pdf(file=tree_export, width=6, height=6)
plot(tree, cex =0.5, use.edge.length=FALSE)
axisPhylo()
edgelabels(cex = 0.25, width = 0.1)
dev.off()
depending on your tree structure just play around with the "cex"and "width" parameters to make the branch numbers readable, the above peramters are reasonable starting points.
The above code assumes your tree is in Nexus format. For the commonly used Newick format do this:
library(ape)
tree = read.tree(file="path\to\file")
tree_export = "tree.pdf"
pdf(file=tree_export, width=6, height=6)
plot(tree, cex =0.5, use.edge.length=FALSE)
axisPhylo()
edgelabels(cex = 0.25, width = 0.1)
dev.off()
i.e. swap "read.nexus" for "read.tree" function
  • asked a question related to Phylogenetic Analysis
Question
3 answers
Is it possible to enter two different types of information into Mesquite?
I have two sets of species data:
  • the divergence times (in millions of years) i.e. how old each species is and the relationships between them i.e. how they are grouped on a tree, using brackets.
  • the sexual system status of each species (e.g. 0 = gonochoristic, 1 = hermaphroditic).
It seems I can only open one of the following to make a tree:
  • a nexus file containing the divergence times/relationship data (begin trees;)
or
  • a nexus file containing character states (begin data;)
I have combined both sets of information (see attachment) into a single nexus file but this doesn't seem to work; I get a brilliant tree with all the species in the correct positions and branching correctly, but the character states are not shown.
It appears the only way to combine them is to open the character nexus file into a tree, re-position every branch/species until they match a reliable existing phylogeny, and then add divergence times manually too.
As the fish families I am studying have upwards of 500 species in them, this is a long, slow and mind-numbing process!
Does anyone know of a better way? I feel I am missing something really obvious that could save me a heap of time!
Many thanks!
Relevant answer
Answer
You have to do it slightly differently. Mesquite needs to know that the taxa on your tree and those in your matrix refer to the same things. It doesn't assume that the way you have represented it. Instead:
  • Create a "taxa" block that has all the names that are both in the tree and the data. If there is only a single taxa block in your file, mesquite will assume that the trees and the characters blocks refer to that taxa block.
  • Encode your characters not in a "data" block (which is deprecated) but in a "characters" block.
If you get this sorted out syntactically correct then what you are trying to accomplish is very much possible. To demonstrate the end result I have done this for you and activated the 'trace character history' window so you can see it all lines up. File attached.
  • asked a question related to Phylogenetic Analysis
Question
9 answers
What is the best way to date a phylogenomic tree using fossil calibration? It is more or less straightforward with a few Sanger loci using programs like BEAST, but it becomes intractable with hundreds of genes, as produced with phylogenomic approaches (e.g., target capture). Just wondering if anyone had any opinions?
Thanks a lot!
Relevant answer
Answer
First, you must accept that it does not exist "the best". Instead, you need to take estimating divergence time of any lineage with any kind of data as a serious endeavour that requires a lot of serious attention and exploration of the data. Secondly, you have to explore both the molecular data and any kind of calibrations (mainly fossil record) available with the required attention to the details. Thirdly you need to take attention to the biology of your organisms. Finally, you can explore different programs such as MCMC but keep in mind that you have to accept none of the established models will be suitable for all cases and the differences are less between programs but between the models of evolution available in the programs.
  • asked a question related to Phylogenetic Analysis
Question
4 answers
which software can be used to prepare nexus file for phylogenetics analysis in Mr Bayes? which form