Context-specific amino acid substitution matrices and their use in the detection of protein homologs

Laboratory of Molecular Biology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892-4264, USA.
Proteins Structure Function and Bioinformatics (Impact Factor: 2.63). 05/2008; 71(2):910-9. DOI: 10.1002/prot.21775
Source: PubMed


The sequence homology detection relies on score matrices, which reflect the frequency of amino acid substitutions observed in a dataset of homologous sequences. The substitution matrices in popular use today are usually constructed without consideration of the structural context in which the substitution takes place. Here, we present amino acid substitution matrices specific for particular polar-nonpolar environment of the amino acid. As expected, these matrices [context-specific substitution matrices (CSSMs)] show striking differences from the popular BLOSUM62 matrix, which does not include structural information. When incorporated into BLAST and PSI-BLAST, CSSM outperformed BLOSUM matrices as assessed by ROC curve analyses of the number of true and false hits and by the accuracy of the sequence alignments to the hit sequences. These findings are also of relevance to profile-profile-based methods of homology detection, since CSSMs may help build a better profile. Profiles generated for protein sequences in PDB using CSSM-PSI-BLAST will be made available for searching via RPSBLAST through our web site

Download full-text


Available from: BK Lee, Jul 29, 2014
  • Source
    • "On average, the conservation score for SSE packing clusters (5.0) is significantly higher than that for other regions (4.1) (t-test, P52.2e-16). Similar results were also observed when we attempted another two different amino acid substitution matrices, BLOSUM62 matrix (Henikoff and Henikoff, 1992) and CSSM matrices (contextspecific amino acid substitution matrices) (Goonesekere and Lee, 2008) (Supplementary Fig. S5). The higher degree of conservation for SSE packing clusters further indicates their important role in domain organization. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Protein domains are fundamental units of protein structure, function and evolution, thus it is critical to gain a deep understanding of protein domain organization. Previous works have attempted to identify key residues involved in organization of domain architecture. Since one of the most important characteristics of domain architecture is the arrangement of secondary structure elements (SSEs), here we present a picture of domain organization through an integrated consideration of SSE arrangements and residue contact networks. In this work, by representing SSEs as main-chain scaffolds and side-chain interfaces and through construction of residue contact networks, we have identified the SSE interfaces well packed within protein domains as SSE packing clusters. 17334 SSE packing clusters were recognized from 9015 SCOP domains of less than 40% sequence identity. The similar SSE packing clusters were observed not only among domains of the same folds, but also among domains of different folds, indicating their roles as common scaffolds for organization of protein domains. Further analysis of 14 small single-domain proteins reveals a high correlation between the SSE packing clusters and the folding nuclei. Consistent with their important roles in domain organization, SSE packing clusters were found to be more conserved than other regions within the same proteins. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
    Full-text · Article · May 2014 · Bioinformatics
  • Source
    • "The BLAST results (Altschul et al., 1990) and multiple alignments were performed using the COBALT or MUSCLE algorithms (Thompson et al., 1994) with Geneious v5.5.7 software (http:// The NCBI CDD results were processed using RPSBLAST (Goonesekere and Lee, 2008). The alignment results were summarised using Geneious v5.5.7 scripts based on the alignment percent identity (PID). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The attachment to host skin by Rhipicephalus microplus larvae induces a series of physiological events at the attachment site. The host-parasite interaction might induce a rejection of the larvae, as is frequently observed in Bos taurus indicus cattle, and under certain conditions in Bos taurus taurus cattle. Ticks deactivate the host rejection response by secreting specific proteins and lipids that play an essential role in manipulation of the host immune response. The available genomic information on the R. microplus tick was mined using bioinformatics approaches to identify R. microplus lipocalins (LRMs). This in silico examination revealed a total of 12 different putative R. microplus LRMs (LRM1 - LRM12). The identity of the LRM family showed high sequence variability: from 6% between LRM7 and LRM8 to 55.9% between LRM2 and LRM6. However, the three-dimensional structure of the lipocalin family was conserved in the LRMs. The B and T cell epitopes in these lipocalins were then predicted, and six of the LRMs (5, 6, 9, 10, 11 and 12) were used to examine the host immune interactions with sera and peripheral blood mononuclear cells (PBMCs) collected from tick-susceptible and tick-resistant cattle challenged with R. microplus. On days 28 - 60 after tick infestation, the anti-LRM titres were higher in the resistant group compared with the susceptible cattle. After 60 day, the anti-LRM titres (except LRM9 and LRM11) decreased to zero in the sera of both the tick-resistant and tick-susceptible cattle. Using cell proliferation assays, the PBMCs challenged with some of the predicted T cell epitopes (LRM1_T1, T2; LRM_T1, T2 and LRM12_T) exhibited a significantly higher number of IFN-γ-secreting cells (Th1) in tick-susceptible Holstein-Friesians compared with tick-resistant Brahman cattle. In contrast, expression of the Th2 cytokine (IL-4) was lower in Holstein-Friesians cattle compared with Brahman cattle. Moreover, this study found that LRM6, LRM9 and LRM11 play important roles in the mechanism by which R. microplus interferes with the host's haemostasis mechanisms.
    Full-text · Article · Jun 2013 · International journal for parasitology
  • Source
    • "Since protein sequences of folded domains are constrained by the necessity to maintain a stable structure, the substitution probabilities for a given residue are largely determined by the structural context within which it resides. Substitution matrices have therefore been trained for particular structural contexts, for example depending on the residue's secondary structure, solvent accessibility, or polarity (Overington et al., 1992; Rice and Eisenberg, 1997; Shi et al., 2001; Goonesekere and Lee, 2008). Methods that infer substitution probabilities of amino acids solely from their local sequence context have the advantage that they do not require the structure of the query protein to be known (Jones et al., 1994; Baussand et al., 2007; Huang and Bystroff, 2006). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Motivation: Protein sequence searching and alignment are fundamental tools of modern biology. Alignments are assessed using their similarity scores, essentially the sum of substitution matrix scores over all pairs of aligned amino acids. We previously proposed a generative probabilistic method that yields scores that take the sequence context around each aligned residue into account. This method showed drastically improved sensitivity and alignment quality compared with standard substitution matrix-based alignment. Results: Here, we develop an alternative discriminative approach to predict sequence context-specific substitution scores. We applied our approach to compute context-specific sequence profiles for Basic Local Alignment Search Tool (BLAST) and compared the new tool (CS-BLASTdis) to BLAST and the previous context-specific version (CS-BLASTgen). On a dataset filtered to 20% maximum sequence identity, CS-BLASTdisis was 51% more sensitive than BLAST and 17% more sensitive than CS-BLASTgenin, detecting remote homologues at 10% false discovery rate. At 30% maximum sequence identity, its alignments contain 21 and 12% more correct residue pairs than those of BLAST and CS-BLASTgen, respectively. Clear improvements are also seen when the approach is combined with PSI-BLAST and HHblits. We believe the context-specific approach should replace substitution matrices wherever sensitivity and alignment quality are critical.
    Full-text · Article · Oct 2012 · Bioinformatics
Show more