Technische Universität München
Question
Asked 11 March 2015
When do you consider two proteins to be homologous?
Based on amino acid sequence similarity only, without havin a closer look on structure! Of course I know how important the structure is. Anyway, comparing a lot of proteins you might start with sequence similarity at first! E.g. in case you want to define a core genome or look up for conserved proteins (functions) in different species in order to obtain molecular markers. Different tools use differnt threshold levels (CMG biotools, panseq, etc.)
So i am interested in a perc-identity / qery-coverage combination to use for a first screening for marker-proteins having the same function across species level.
I am interested in your opinion!
Most recent answer
Thank you guys for the valuable input! Especially @ Robert Rentzsch, as his paper is exactly what i needed (-:
1 Recommendation
Popular answers (1)
I think underlying the question and some of the responses, there's a mistaken use of the term Homologous.
Two proteins are homologous if they have a common ancestor, whatever their sequences, structures, or functions. Homology = common ancestry.
Similarity (in any of those levels, sequence, structure, function, etc) may or may not imply homology (e.g. you may have the same function due to convergent evolution, and proteins may have similar architectures just because there're only so many ways for amino acid sequences to fold stably).
Homology may or may not result in Similarity: a single mutation leads to a homologous protein, and yet may drastically change the structure and/or function.
17 Recommendations
All Answers (9)
ProCogia
I think you have an issue with the wording. To be IDENTICAL, they need to have an identity of 100%, and therefore the sequences have to be the same. Now if you are asking about homology, that is very different question. I am not a friend of rules of thumb since normally they do poorly in complex cases. If you are talking about homology, most people feel comfortable if the sequence identity is >40%. Then there is the twilight zone(20-35% sequence identity), where things get a bit more complicated and structure (in my opinion) is needed to truly stablish homology.
16 Recommendations
Postdoctoral Institute for Computational Studies
I agree with Jose answer, and as an addition, I still consider two protein with the same name and came from the same organism yet having one or few point mutation to be identical.
1 Recommendation
Arterna Biosciences
Hi Andreas, I agree with the above comments. If you have 100% amino acid sequence identity, then the proteins are identical.
But many identical proteins have isoforms due to differential splicing, post-translational modifications, etc. Their identity will have to be evaluated on a case-by-case basis.
Comparing a protein sequence across species, you will find sequences that are similar. They are not identical proteins in the strict sense of the term. They are homologous proteins that perform the same function in a different species.
More distant species will have more variable sequences.
3 Recommendations
Technische Universität München
Hi there!
Thanks for all the answers! I am sorry for the missleading utilization of "identical" > i actually, as immediately suggested by Jose Sergio Hleap, meant very similiar on sequence level (but identical in function)! I know that structure, domains etc. > quite important. Still i am only looking for a measure, when to consider 2 or more proteins to be homologous based on sequence similarity only. As we try to optimize a bit of code using blastp in order to find some marker proteines for particular ecotypes. So to define it more clearly: Lets say strains of 10 species use the same mechanism to deal with low Mg content using transporter X, while some of them differ enoough so you wont find that target gene by comparing genes on a nucleotide level. So i am looking for proteins being identical, but on a function level. Anyway, for a first screening one might start with sequenc alignment, so that is why i am looking for reasonable messures! @Jose Sergio Hleap: i assume using 40% similarity in case of a querycoverage of at least 90%? I also was thinking about an appropiate value for query-coverage? So at the moment i am thinking about 70% identity (suggested by others) and 90% query-coverage. I know, that by applying "rule of thumbs", i might loose potential targets and that it is necassary to have a closer look afterwards. Anyway, i need a first rough screening for my data.
Thanks for your help!
2 Recommendations
Fu Jen Catholic University
Try this definition: The orthologous core genes which shared over 50 percent homologous sequences between two strains (50/50 rules).
Ref. : Ussery DW, Wassenaar TM, Borini S: Microbial communities: core and pan-genomics. Comparative Microbial Genomics 2009.
I think underlying the question and some of the responses, there's a mistaken use of the term Homologous.
Two proteins are homologous if they have a common ancestor, whatever their sequences, structures, or functions. Homology = common ancestry.
Similarity (in any of those levels, sequence, structure, function, etc) may or may not imply homology (e.g. you may have the same function due to convergent evolution, and proteins may have similar architectures just because there're only so many ways for amino acid sequences to fold stably).
Homology may or may not result in Similarity: a single mutation leads to a homologous protein, and yet may drastically change the structure and/or function.
17 Recommendations
Institut für Innovation und Technik (iit)
I highly recommend William Pearson's (the man behind FASTA, amongst other things) talks on the homology/sequence similarity subject, also answering the question: 'if there was one last common ancestor - or few - wouldn't all - or most - sequences be homologous?' His concept of 'excess similarity' may help to draw a reasonable line there...esp. with regard to the detectability of homology and distinguishing it from random noise. For example:
Just search for his talks (adding ".ppt" to Google search, or ".pdf") and publications.
HTH,
Rob
5 Recommendations
Institut für Innovation und Technik (iit)
Btw, on a more pragmatic note and possibly more helpful to Andreas short-term: there has been a long line of publications since at least the turn of the century dealing with exactly the question: how much is function conserved depending on sequence similarity or %identity. One of those papers was written by me et al and it briefly reviews the earlier studies: http://www.sciencedirect.com/science/article/pii/S0022283608015660. Other studies followed, however, and these included the now dominant Gene Ontology (which wasn't in a good shape for statistical comparisons when we did our study but is better now), e.g.: http://www.biomedcentral.com/1471-2164/8/222/.
Somewhat related to this: there has been a long discussion recently (over several publications) about the so-called 'orthologue conjecture', i.e. the assumption that orthologues are better conserved functionally than paralogues, which basically culminated in the notion that the current state of the GO, and genome coverage with GO annotations, makes this assumption untestable: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3510086/. Therefore, in essence: don't worry too much about this distinction for now. The average thresholds reported in the above papers are probably not better or worse than doing st more complicated at this point (like orthology and paralogy assignment with phylogenomics, etc). To any rule in biology, there are lots of exceptions...
10 Recommendations
Similar questions and discussions
Related Publications
Protein function prediction based on amino acid sequence alone is an extremely challenging but important task, especially in metagenomics/metatranscriptomics field, in which novel proteins have been uncovered exponentially from new microorganisms. Many of them are extremely low homology to known proteins and cannot be annotated with homology-based...
We present a set of programs and a website designed to facilitate protein structure comparison and protein structure modeling
efforts. Our protein structure analysis and comparison services use the LGA (local-global alignment) program to search for
regions of local similarity and to evaluate the level of structural similarity between compared prote...
Protein structure can provide insights that help biologists to predict and understand protein functions and interactions. However, the number of known protein structures has not kept pace with the number of protein sequences determined by high-throughput sequencing. Current techniques used to determine the structure of proteins are complex and requ...