Question
Asked 11th Mar, 2015

When do you consider two proteins to be homologous?

Based on amino acid sequence similarity only, without havin a closer look on structure! Of course I know how important the structure is. Anyway, comparing a lot of proteins you might start with sequence similarity at first!  E.g. in case you want to define a core genome or look up for conserved proteins (functions) in different species in order to obtain molecular markers.  Different tools use differnt threshold levels (CMG biotools, panseq, etc.)
So i am interested in a perc-identity / qery-coverage combination to use for a first screening for marker-proteins having the same function across species level.
I am interested in your opinion!

Most recent answer

20th Mar, 2015
Andreas J Geissler
Technische Universität München
Thank you guys for the valuable input! Especially @ Robert Rentzsch, as his paper is exactly what i needed (-:
1 Recommendation

Popular Answers (1)

13th Mar, 2015
Julian Echave
National University of General San Martín
I think underlying the question and some of the responses, there's a mistaken use of the term Homologous.
Two proteins are homologous if they have a common ancestor, whatever their sequences, structures, or functions. Homology = common ancestry.
Similarity (in any of those levels, sequence, structure, function, etc) may or may not imply homology (e.g. you may have the same function due to convergent evolution, and proteins may have similar architectures just because there're only so many ways for amino acid sequences to fold stably).
Homology may or may not result in Similarity: a single mutation leads to a homologous protein, and yet may drastically change the structure and/or function.
16 Recommendations

All Answers (9)

11th Mar, 2015
Jose Sergio Hleap
SHARCNET
I think you have an issue with the wording. To be IDENTICAL, they need to have an identity of 100%, and therefore the sequences have to be the same. Now if you are asking about homology, that is very different question. I am not a friend of rules of thumb since normally they do poorly in complex cases. If you are talking about homology, most people feel comfortable if the sequence identity is >40%. Then there is the twilight zone(20-35% sequence identity), where things get a bit more complicated and structure (in my opinion) is needed to truly stablish homology.
15 Recommendations
11th Mar, 2015
Muhammad Radifar
Universitas Gadjah Mada
I agree with Jose answer, and as an addition, I still consider two protein with the same name and came from the same organism yet having one or few point mutation to be identical.
1 Recommendation
Hi Andreas, I agree with the above comments. If you have 100% amino acid sequence identity, then the proteins are identical.
But many identical proteins have isoforms due to differential splicing, post-translational modifications, etc. Their identity will have to be evaluated on a case-by-case basis.
Comparing a protein sequence across species, you will find sequences that are similar. They are not identical proteins in the strict sense of the term. They are homologous proteins that perform the same function in a different species.
More distant species will have more variable sequences.
3 Recommendations
12th Mar, 2015
Andreas J Geissler
Technische Universität München
Hi there!
Thanks for all the answers! I am sorry for the missleading utilization of "identical" > i actually, as immediately suggested by Jose Sergio Hleap, meant very similiar on sequence level (but identical in function)! I know that structure, domains etc. > quite important. Still i am only looking for a measure, when to consider 2 or more proteins to be homologous based on sequence similarity only. As we try to optimize a bit of code using blastp in order to find some marker proteines for particular ecotypes. So to define it more clearly: Lets say strains of 10 species use the same mechanism to deal with low Mg content using transporter X, while some of them differ enoough so you wont find that target gene by comparing genes on a nucleotide level. So i am looking for proteins being identical, but on a function level. Anyway, for a first screening one might start with sequenc alignment, so that is why i am looking for reasonable messures!  @Jose Sergio Hleap: i assume using 40% similarity in case of a querycoverage of at least 90%? I also was thinking about an appropiate value for query-coverage? So at the moment i am thinking about 70% identity (suggested by others) and 90% query-coverage. I know, that by applying "rule of thumbs", i might loose potential targets and that it is necassary to have a closer look afterwards. Anyway, i need a first rough screening for my data.
Thanks for your help!
1 Recommendation
12th Mar, 2015
Vincent Chiang
National Defense Medical Center
Try this definition: The orthologous core genes which shared over 50 percent homologous sequences between two strains (50/50 rules).
Ref. : Ussery DW, Wassenaar TM, Borini S: Microbial communities: core and pan-genomics. Comparative Microbial Genomics 2009.
13th Mar, 2015
Julian Echave
National University of General San Martín
I think underlying the question and some of the responses, there's a mistaken use of the term Homologous.
Two proteins are homologous if they have a common ancestor, whatever their sequences, structures, or functions. Homology = common ancestry.
Similarity (in any of those levels, sequence, structure, function, etc) may or may not imply homology (e.g. you may have the same function due to convergent evolution, and proteins may have similar architectures just because there're only so many ways for amino acid sequences to fold stably).
Homology may or may not result in Similarity: a single mutation leads to a homologous protein, and yet may drastically change the structure and/or function.
16 Recommendations
14th Mar, 2015
Robert Rentzsch
Institut für Innovation und Technik (iit)
I highly recommend William Pearson's (the man behind FASTA, amongst other things) talks on the homology/sequence similarity subject, also answering the question: 'if there was one last common ancestor - or few - wouldn't all - or most - sequences be homologous?' His concept of 'excess similarity' may help to draw a reasonable line there...esp. with regard to the detectability of homology and distinguishing it from random noise. For example:
Just search for his talks (adding ".ppt" to Google search, or ".pdf") and publications.
HTH,
Rob
5 Recommendations
15th Mar, 2015
Robert Rentzsch
Institut für Innovation und Technik (iit)
Btw, on a more pragmatic note and possibly more helpful to Andreas short-term: there has been a long line of publications since at least the turn of the century dealing with exactly the question: how much is function conserved depending on sequence similarity or %identity. One of those papers was written by me et al and it briefly reviews the earlier studies: http://www.sciencedirect.com/science/article/pii/S0022283608015660. Other studies followed, however, and these included the now dominant Gene Ontology (which wasn't in a good shape for statistical comparisons when we did our study but is better now), e.g.: http://www.biomedcentral.com/1471-2164/8/222/.
Somewhat related to this: there has been a long discussion recently (over several publications) about the so-called 'orthologue conjecture', i.e. the assumption that orthologues are better conserved functionally than paralogues, which basically culminated in the notion that the current state of the GO, and genome coverage with GO annotations, makes this assumption untestable: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3510086/. Therefore, in essence: don't worry too much about this distinction for now. The average thresholds reported in the above papers are probably not better or worse than doing st more complicated at this point (like orthology and paralogy assignment with phylogenomics, etc). To any rule in biology, there are lots of exceptions... 
9 Recommendations
Can you help by adding an answer?

Similar questions and discussions

Related Publications

Chapter
Knowledge of the protein three dimensional structure gives the most accurate guide to the significance of individual residues to protein function. Three dimensional structural information is essential for the rational design and interpretation of site directed mutagenesis studies aimed at enhancing stability or catalytic activity of a protein. Unfo...
Data
Amino acid sequence and protein structure analysis of RsrR and RsrS. Every domain is shown in the figure. (A) Amino acid sequence alignment between RsrR and OmpR; (B) Amino acid sequence alignment between RsrS and EnvZ; (C) Protein structure fitting of RsrR and OmpR (blue for RsrR, red for OmpR); (D) Protein structure fitting of RsrS and EnvZ (gray...
Article
Full-text available
Knowledge of protein domain boundaries is critical for the characterisation and understanding of protein function. The ability to identify domains without the knowledge of the structure--by using sequence information only--is an essential step in many types of protein analyses. In this present study, we demonstrate that the performance of DomainDis...
Got a technical question?
Get high-quality answers from experts.