Question
Asked 11 March 2015

When do you consider two proteins to be homologous?

Based on amino acid sequence similarity only, without havin a closer look on structure! Of course I know how important the structure is. Anyway, comparing a lot of proteins you might start with sequence similarity at first!  E.g. in case you want to define a core genome or look up for conserved proteins (functions) in different species in order to obtain molecular markers.  Different tools use differnt threshold levels (CMG biotools, panseq, etc.)
So i am interested in a perc-identity / qery-coverage combination to use for a first screening for marker-proteins having the same function across species level.
I am interested in your opinion!

Most recent answer

Andreas J Geissler
Technische Universität München
Thank you guys for the valuable input! Especially @ Robert Rentzsch, as his paper is exactly what i needed (-:
1 Recommendation

Popular answers (1)

I think underlying the question and some of the responses, there's a mistaken use of the term Homologous.
Two proteins are homologous if they have a common ancestor, whatever their sequences, structures, or functions. Homology = common ancestry.
Similarity (in any of those levels, sequence, structure, function, etc) may or may not imply homology (e.g. you may have the same function due to convergent evolution, and proteins may have similar architectures just because there're only so many ways for amino acid sequences to fold stably).
Homology may or may not result in Similarity: a single mutation leads to a homologous protein, and yet may drastically change the structure and/or function.
17 Recommendations

All Answers (9)

I think you have an issue with the wording. To be IDENTICAL, they need to have an identity of 100%, and therefore the sequences have to be the same. Now if you are asking about homology, that is very different question. I am not a friend of rules of thumb since normally they do poorly in complex cases. If you are talking about homology, most people feel comfortable if the sequence identity is >40%. Then there is the twilight zone(20-35% sequence identity), where things get a bit more complicated and structure (in my opinion) is needed to truly stablish homology.
16 Recommendations
Muhammad Radifar
Postdoctoral Institute for Computational Studies
I agree with Jose answer, and as an addition, I still consider two protein with the same name and came from the same organism yet having one or few point mutation to be identical.
1 Recommendation
Steingrimur Stefansson
Arterna Biosciences
Hi Andreas, I agree with the above comments. If you have 100% amino acid sequence identity, then the proteins are identical.
But many identical proteins have isoforms due to differential splicing, post-translational modifications, etc. Their identity will have to be evaluated on a case-by-case basis.
Comparing a protein sequence across species, you will find sequences that are similar. They are not identical proteins in the strict sense of the term. They are homologous proteins that perform the same function in a different species.
More distant species will have more variable sequences.
3 Recommendations
Andreas J Geissler
Technische Universität München
Hi there!
Thanks for all the answers! I am sorry for the missleading utilization of "identical" > i actually, as immediately suggested by Jose Sergio Hleap, meant very similiar on sequence level (but identical in function)! I know that structure, domains etc. > quite important. Still i am only looking for a measure, when to consider 2 or more proteins to be homologous based on sequence similarity only. As we try to optimize a bit of code using blastp in order to find some marker proteines for particular ecotypes. So to define it more clearly: Lets say strains of 10 species use the same mechanism to deal with low Mg content using transporter X, while some of them differ enoough so you wont find that target gene by comparing genes on a nucleotide level. So i am looking for proteins being identical, but on a function level. Anyway, for a first screening one might start with sequenc alignment, so that is why i am looking for reasonable messures!  @Jose Sergio Hleap: i assume using 40% similarity in case of a querycoverage of at least 90%? I also was thinking about an appropiate value for query-coverage? So at the moment i am thinking about 70% identity (suggested by others) and 90% query-coverage. I know, that by applying "rule of thumbs", i might loose potential targets and that it is necassary to have a closer look afterwards. Anyway, i need a first rough screening for my data.
Thanks for your help!
2 Recommendations
Vincent Chiang
Fu Jen Catholic University
Try this definition: The orthologous core genes which shared over 50 percent homologous sequences between two strains (50/50 rules).
Ref. : Ussery DW, Wassenaar TM, Borini S: Microbial communities: core and pan-genomics. Comparative Microbial Genomics 2009.
I think underlying the question and some of the responses, there's a mistaken use of the term Homologous.
Two proteins are homologous if they have a common ancestor, whatever their sequences, structures, or functions. Homology = common ancestry.
Similarity (in any of those levels, sequence, structure, function, etc) may or may not imply homology (e.g. you may have the same function due to convergent evolution, and proteins may have similar architectures just because there're only so many ways for amino acid sequences to fold stably).
Homology may or may not result in Similarity: a single mutation leads to a homologous protein, and yet may drastically change the structure and/or function.
17 Recommendations
Robert Rentzsch
Institut für Innovation und Technik (iit)
I highly recommend William Pearson's (the man behind FASTA, amongst other things) talks on the homology/sequence similarity subject, also answering the question: 'if there was one last common ancestor - or few - wouldn't all - or most - sequences be homologous?' His concept of 'excess similarity' may help to draw a reasonable line there...esp. with regard to the detectability of homology and distinguishing it from random noise. For example:
Just search for his talks (adding ".ppt" to Google search, or ".pdf") and publications.
HTH,
Rob
5 Recommendations
Robert Rentzsch
Institut für Innovation und Technik (iit)
Btw, on a more pragmatic note and possibly more helpful to Andreas short-term: there has been a long line of publications since at least the turn of the century dealing with exactly the question: how much is function conserved depending on sequence similarity or %identity. One of those papers was written by me et al and it briefly reviews the earlier studies: http://www.sciencedirect.com/science/article/pii/S0022283608015660. Other studies followed, however, and these included the now dominant Gene Ontology (which wasn't in a good shape for statistical comparisons when we did our study but is better now), e.g.: http://www.biomedcentral.com/1471-2164/8/222/.
Somewhat related to this: there has been a long discussion recently (over several publications) about the so-called 'orthologue conjecture', i.e. the assumption that orthologues are better conserved functionally than paralogues, which basically culminated in the notion that the current state of the GO, and genome coverage with GO annotations, makes this assumption untestable: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3510086/. Therefore, in essence: don't worry too much about this distinction for now. The average thresholds reported in the above papers are probably not better or worse than doing st more complicated at this point (like orthology and paralogy assignment with phylogenomics, etc). To any rule in biology, there are lots of exceptions... 
10 Recommendations
Andreas J Geissler
Technische Universität München
Thank you guys for the valuable input! Especially @ Robert Rentzsch, as his paper is exactly what i needed (-:
1 Recommendation

Similar questions and discussions

Why do my SDS-PAGE gels run so slowly?
Question
2 answers
  • Laura LeightonLaura Leighton
My SDS-PAGE gels run on Bio-Rad Mini Protean equipment are taking much longer to run than I expect; about 4x the amount of time that online info and experienced friends say they should take. The resolution is adequate but not fantastic, and the excessive run times are annoying - can anyone spot a problem?
I cast the gels myself using 1mm spacer plates, so the gels are quite small: 7cm length (including wells and stacking gel) and approx 7mL volume between the plates. The gels contain 0.1% SDS and most of my recent runs have been 12.5% polyacrylamide. I usually cast the gels the day before the run and store them overnight in the fridge, sometimes I store them 2-4 days but no longer. My gel buffers (Tris) are old but the pH is correct. The acrylamide (29:1), APS and TEMED are fresh.
I set up my runs in a cold room (4C) with pre-chilled running buffer (1X TGS, prepared fresh). I always rinse the wells with running buffer. I monitor run with prestained protein standards (Bio-Rad precision plus). If I run at 130 volts, it takes approx 4-5 hours for the dye front to reach the bottom of the gel and the prestained standards to separate. I just tried running overnight at 20 volts, and after 15 hours the dye front is less than halfway down the gel. I remember that when I did this years ago, running at 120-140V would produce heat even in the cold room, and would run in 1-2 hours. Now, the buffer doesn't warm up at all, and it takes 4x this long.
So far I have tried the following and nothing changed:
- 2 different electrophoresis power supplies (both run agarose gels at expected rate)
- 2 different tanks/lids
- 3 different gel chamber assemblies (the part that has the wires and creates the inner buffer chamber with the gel).
- Double checked there are no leaks and that buffer covers the wells
- Run an entire gel with prestained standards only to eliminate sample-related issues
I would love any suggestions - I can't work out what I'm doing wrong!

Related Publications

Article
Full-text available
Protein function prediction based on amino acid sequence alone is an extremely challenging but important task, especially in metagenomics/metatranscriptomics field, in which novel proteins have been uncovered exponentially from new microorganisms. Many of them are extremely low homology to known proteins and cannot be annotated with homology-based...
Article
Full-text available
We present a set of programs and a website designed to facilitate protein structure comparison and protein structure modeling efforts. Our protein structure analysis and comparison services use the LGA (local-global alignment) program to search for regions of local similarity and to evaluate the level of structural similarity between compared prote...
Article
Full-text available
Protein structure can provide insights that help biologists to predict and understand protein functions and interactions. However, the number of known protein structures has not kept pace with the number of protein sequences determined by high-throughput sequencing. Current techniques used to determine the structure of proteins are complex and requ...
Got a technical question?
Get high-quality answers from experts.