Article

Secondary Structure Characterization Based on Amino Acid Composition and Availability in Proteins

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The importance of thorough analyses of the secondary structures in proteins as basic structural units cannot be overemphasized. Although recent computational methods have achieved reasonably high accuracy for predicting secondary structures from amino acid sequences, a simple and fundamental empirical approach to characterize the amino acid composition of secondary structures was performed mainly in 1970s, with a small number of analyzed structures. To extend this classical approach using a large number of analyzed structures, here we characterized the amino acid sequences of secondary structures (12 154 alpha-helix units, 4592 3(10)-helix units, 16 787 beta-strand units, and 30 811 "other" units), using the representative three-dimensional protein structure records (1641 protein chains) from the Protein Data Bank. We first examined the length and the amino acid compositions of secondary structures, including rank order differences and assignment relationships among amino acids. These compositional results were largely, but not entirely, consistent with the previous studies. In addition, we examined the frequency of 400 amino acid doublets and 8000 triplets in secondary structures based on their relative counts, termed the availability. We identified not only some triplets that were specific to a certain secondary structure but also so-called zero-count triplets, which did not occur in a given secondary structure at all, even though they were probabilistically predicted to occur several times. Taken together, the present study revealed essential features of secondary structures and suggests potential applications in the secondary structure prediction and the functional design of protein sequences.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... To understand how the human immune system discriminates SARS-CoV-2 proteins (nonself) from its own proteins (self), we have been using the concept of short constituent sequences (SCSs) of amino acids in proteins [77][78][79][80][81][82][83]. The idea of information extraction from protein amino acid sequences based on SCSs is as old as Chou and Fasman (1974) [84] and Garnier et al. (1978) [85]. ...
... In addition to ours, there are similar but independent approaches in the literature [86][87][88][89][90][91][92][93][94][95][96][97][98][99]. The number of amino acids in an SCS unit can vary, but we primarily use five amino acids (5-aa SCSs) [77][78][79][80][81][82][83]. The 5-aa window is technically convenient because there are only 20 5 combinations of 5-aa SCSs (also called pentats in our system but called pentapeptides or pentamers in others), which is not computationally demanding. ...
... On the contrary, our search for 5-aa SCSs can identify not only 5-aa SCSs but also longer ones simultaneously as consecutive 5-aa SCSs. Previously, our protein studies demonstrated that SCS analyses are useful for identifying protein characteristics, including secondary structures [81,82] and functional sites [79,80], and human-specific proteins [83]. Alignment-independent SCS analyses are useful for identifying commonalities (instead of similarities) that have been overlooked by conventional alignment-dependent approaches. ...
Article
Full-text available
Spike protein sequences in SARS-CoV-2 have been employed for vaccine epitopes, but many short constituent sequences (SCSs) in the spike protein are present in the human proteome, suggesting that some anti-spike antibodies induced by infection or vaccination may be autoantibodies against human proteins. To evaluate this possibility of “molecular mimicry” in silico and in vitro, we exhaustively identified common SCSs (cSCSs) found both in spike and human proteins bioinformatically. The commonality of SCSs between the two systems seemed to be coincidental, and only some cSCSs were likely to be relevant to potential self-epitopes based on three-dimensional information. Among three antibodies raised against cSCS-containing spike peptides, only the antibody against EPLDVL showed high affinity for the spike protein and reacted with an EPLDVL-containing peptide from the human unc-80 homolog protein. Western blot analysis revealed that this antibody also reacted with several human proteins expressed mainly in the small intestine, ovary, and stomach. Taken together, these results showed that most cSCSs are likely incapable of inducing autoantibodies but that at least EPLDVL functions as a self-epitope, suggesting a serious possibility of infection-induced or vaccine-induced autoantibodies in humans. High-risk cSCSs, including EPLDVL, should be excluded from vaccine epitopes to prevent potential autoimmune disorders.
... Knowledge of amino acid propensities throughout known beta hairpin sub-structures could inform such design principles but existing catalogs are too broadly focused on beta sheets, outdated, or limited in scope. 16,20,[28][29][30][31] An up-to-date characterization of amino acid distributions at specific positions within beta hairpins does not exist. ...
... It has long been known that different secondary structural elements tend to favor the inclusion of certain amino acids over others. 29,30,34,35 This is exactly what we see with our analysis of beta hairpin motifs (Figure 2), with a clear difference in average amino acid frequencies between beta strands, the turn region, and background levels (i.e., universal average frequencies for amino acids across all included protein structures). Our analysis agrees with previous work illustrating a strong preference for glycine, asparagine, and aspartic acid in flexible turn regions. ...
... Our analysis agrees with previous work illustrating a strong preference for glycine, asparagine, and aspartic acid in flexible turn regions. 29,30 While proline is also more common in the turn region than in either beta strand, we see no difference in turn region prevalence when compared to background levels. This is in contrast to previous findings that saw significant enrichment of proline in turn regions. ...
Article
Full-text available
The beta hairpin motif is a ubiquitous protein structural motif that can be found in molecules across the tree of life. This motif, which is also popular in synthetically designed proteins and peptides, is known for its stability and adaptability to broad functions. Here, we systematically probe all 49,000 unique beta hairpin substructures contained within the Protein Data Bank (PDB) to uncover key characteristics correlated with stable beta hairpin structure, including amino acid biases and enriched interstrand contacts. We find that position specific amino acid preferences, while seen throughout the beta hairpin structure, are most evident within the turn region, where they depend on subtle turn dynamics associated with turn length and secondary structure. We also establish a set of broad design principles, such as the inclusion of aspartic acid residues at a specific position and the careful consideration of desired secondary structure when selecting residues for the turn region, that can be applied to the generation of libraries encoding proteins or peptides containing beta hairpin structures.
... Besides, protein structure decoding and characterization have been made, mostly, based on statistical data derived from experimental structures solved at high resolutions and stored in PDB database (Bernstein et al., 1977). This resulted in the assessment of amino acid occurrence in secondary structure elements as well as conformational preferences of amino acid residues that yielded in propensity scales and data on short sequence availabilities in proteins (Koehl & Levitt, 1999;Otaki, Tsutsumi, Gotoh, & Yamamoto, 2010). For example, Otaki et al. proposed a linguistic approach to reveal short constituent sequences with the optimal length of 3-7 residues favored in α-helices and β-strands (Motomura, Nakamura, & Otaki, 2013). ...
... Statistical data obtained by the group of Otaki (Motomura et al., 2013;Otaki et al., 2010) with the use of experimentally resolved crystal structures extracted from PDB database have resulted in the evaluation of conformational propensities of amino acid residues and frequencies of their occurence in a definite type of secondary structure. For example, A, L, E, Q, K, M, and R occurred to be the most frequent in right-handed α-helix, while β-structures were enriched in V, I, T, Y, F, and C residues. ...
... For example, A, L, E, Q, K, M, and R occurred to be the most frequent in right-handed α-helix, while β-structures were enriched in V, I, T, Y, F, and C residues. Additionally, 3 10 -helices contained a large amount of D, N, S, H, and W residues, while disordered regions were enriched in P, G, and D (Otaki et al., 2010). Despite the importance of these statistical data, they do not provide understanding how a protein primary sequence encodes its secondary and tertiary structure. ...
Article
Short linear motifs (SLiMs) have been recognized to perform diverse functions in a variety of regulatory proteins through the involvement in protein-protein interactions, signal transduction, cell cycle regulation, protein secretion, etc. However, detailed molecular mechanisms underlying their functions including roles of definite amino acid residues remain obscure. In our previous studies, we demonstrated that conformational dynamics of amino acid residues in oligopeptides derived from regulatory proteins such as alpha-fetoprotein (AFP), carcino-embryonic antigen (CEA) and pregnancy specific β1-glycoproteins (PSGs) contributes greatly to their biological activities. In the present work, we revealed the 22-member linear modules composed of direct and reverse AFP14-20-like heptapeptide motifs linked by CxxGY/FxGx consensus motif within epidermal growth factor (EGF), growth factors of EGF family and numerous regulatory proteins containing EGF-like modules. We showed, first, the existence of similarity in amino acis signatures of both direct and reverse motifs in terms of their physicochemical properties. Second, molecular dynamics (MD) simulation study demonstrated that key receptor-binding residues in human EGF in the aligned positions of the direct and reverse motifs may have similar distribution of conformational probability densities and dynamic behaviour despite their distinct physicochemical properties. Third, we found that the length of a polypeptide chain (from 7 to 53 residues) has no effect, while disulfide bridging and backbone direction significantly influence the conformational distribution and dynamics of the residues. Our data may contribute to the atomic level structure-function analysis and protein structure decoding; additionally, they may provide a basis for novel protein/peptide engineering and peptide-mimetic drug design.
... Additionally, secondary structure characterization is one of the important applications of the frequency-based word analysis [18, 19, 31, 32, 35]. Through the construction and analysis of secondary-structure-specific databases, we have shown that some SCSs are favored in α-helices and others in β-strands [31]. ...
... Additionally, secondary structure characterization is one of the important applications of the frequency-based word analysis [18, 19, 31, 32, 35]. Through the construction and analysis of secondary-structure-specific databases, we have shown that some SCSs are favored in α-helices and others in β-strands [31]. These structure-specific SCSs may be used as markers or discriminant sequences for particular secondary structures. ...
... But what is the optimal SCS length? Rare or non-existent SCSs in a given database of interest (such as a secondary structure database) can be found relatively easily if a set of 5-aa SCSs is used [14, 15, 31, 32]. This is because the repertoire of 5-aa SCSs (205 = 3.2 × 106) is large enough to describe the sequence complexity of proteins and small enough to find similarities among different proteins. ...
Article
Full-text available
Protein structure and function information is coded in amino acid sequences. However, the relationship between primary sequences and three-dimensional structures and functions remains enigmatic. Our approach to this fundamental biochemistry problem is based on the frequencies of short constituent sequences (SCSs) or words. A protein amino acid sequence is considered analogous to an English sentence, where SCSs are equivalent to words. Availability scores, which are defined as real SCS frequencies in the non-redundant amino acid database relative to their probabilistically expected frequencies, demonstrate the biological usage bias of SCSs. As a result, this frequency-based linguistic approach is expected to have diverse applications, such as secondary structure specifications by structure-specific SCSs and immunological adjuvants with rare or non-existent SCSs. Linguistic similarities (e.g., wide ranges of scale-free distributions) and dissimilarities (e.g., behaviors of low-rank samples) between proteins and the natural English language have been revealed in the rank-frequency relationships of SCSs or words. We have developed a web server, the SCS Package, which contains five applications for analyzing protein sequences based on the linguistic concept. These tools have the potential to assist researchers in deciphering structurally and functionally important protein sites, species-specific sequences, and functional relationships between SCSs. The SCS Package also provides researchers with a tool to construct amino acid sequences de novo based on the idiomatic usage of SCSs.
... English sentences are also composed of an organized collection of words, and one could examine proteins based on a working hypothesis that amino acid sequences can be considered to be collections of short constituent sequences (SCSs), or ''words'', such as triplets (three-amino-acid stretches), quartets (fouramino-acid stretches), and pentats (five-amino-acid stretches), that are meaningfully localized by a set of rules, i.e., ''grammar''. We previously demonstrated the importance of short stretches in proteins by showing that length of secondary structures peaked at 5 or 6 amino acids [11,12], which could justify our protein analysis based on SCSs. Our SCSs are basically identical to k- tuples13141516, but we consider SCSs more like real linguistic words for general decoding, not merely a collection of simple analysis units for alignment-free sequence comparisons. ...
... Therefore, there is a possibility that protein sequences evolved based on the principle of least effort and hence at least partially follow Zipf's law, or more generally, a power law. In this study, we examined whether the SCSs in proteins [11,12,252627, which could be called ''words'', exhibit a similar or dissimilar distribution to a power-law distribution. We note that variations of SCSs are very large but exactly limited in number. ...
... Together, we can state that high-rank protein amino acid sequences tend to exhibit a scale-free distribution but low-rank tails do not, and this fact suggests that they may have language-like characteristics at least partly. The positive discriminant R value in the pentat system may be consistent with the fact that the length of secondary structures peaked at 5 or 6 amino acids [11,12]. We observed a few characteristics that may be unique to protein distributions, which may be as important as similarity to English. ...
Article
Full-text available
The amino acid sequences of proteins determine their three-dimensional structures and functions. However, how sequence information is related to structures and functions is still enigmatic. In this study, we show that at least a part of the sequence information can be extracted by treating amino acid sequences of proteins as a collection of English words, based on a working hypothesis that amino acid sequences of proteins are composed of short constituent amino acid sequences (SCSs) or "words". We first confirmed that the English language highly likely follows Zipf's law, a special case of power law. We found that the rank-frequency plot of SCSs in proteins exhibits a similar distribution when low-rank tails are excluded. In comparison with natural English and "compressed" English without spaces between words, amino acid sequences of proteins show larger linear ranges and smaller exponents with heavier low-rank tails, demonstrating that the SCS distribution in proteins is largely scale-free. A distribution pattern of SCSs in proteins is similar among species, but species-specific features are also present. Based on the availability scores of SCSs, we found that sequence motifs are enriched in high-availability sites (i.e., "key words") and vice versa. In fact, the highest availability peak within a given protein sequence often directly corresponds to a sequence motif. The amino acid composition of high-availability sites within motifs is different from that of entire motifs and all protein sequences, suggesting the possible functional importance of specific SCSs and their compositional amino acids within motifs. We anticipate that our availability-based word decoding approach is complementary to sequence alignment approaches in predicting functionally important sites of unknown proteins from their amino acid sequences.
... We then used IUPRED3 [40] to evaluate their structural disorder content and identified 1263 sequences with at least one IDR of more than 50 consecutive residues ( Figure 2C). More than half of these galectin-3-like proteins (57.9%; 716/1263) contain more than 9% of aromatic residues (the average content of the proteome as a whole: [41] tryptophan, 1.5%; tyrosine, 3.5%; phenylalanine, 4.0%) in their IDRs ( Figure 2D). An increased prevalence of aromatic residues is uncommon in IDRs, [42][43] but is observed in RNA-binding proteins, [44] suggesting potentially conserved functions, [44] and our previous findings for human galectin-3 suggest they have a functional role in agglutination. ...
... An increased prevalence of aromatic residues is uncommon in IDRs, [42][43] but is observed in RNA-binding proteins, [44] suggesting potentially conserved functions, [44] and our previous findings for human galectin-3 suggest they have a functional role in agglutination. [31] Additionally, 720/1263 sequences (57.0%) have IDRs with a net negative charge, of which nearly two-thirds (480/720) have <3% of negatively charged residues ( Figure 2E), compared with 12.5% in the proteome as a whole, [41] which is striking because negatively charged residues are typically more prevalent in IDRs. [42] The increased prevalence of aromatic residues and the low prevalence of negatively charged residues are expected, therefore, to be an evolutionarily conserved pattern in the IDRs of galectin-3-like proteins. ...
Article
Full-text available
Proteins with intrinsically disordered regions (IDRs) often undergo phase separation to control their functions spatiotemporally. Changing the pH alters the protonation levels of charged sidechains, which in turn affects the attractive or repulsive force for phase separation. In a cell, the rupture of membrane‐bound compartments, such as lysosomes, creates an abrupt change in pH. However, how proteins’ phase separation reacts to different pH environments remains largely unexplored. Here, using extensive mutagenesis, NMR spectroscopy, and biophysical techniques, it is shown that the assembly of galectin‐3, a widely studied lysosomal damage marker, is driven by cation‐π interactions between positively charged residues in its folded domain with aromatic residues in the IDR in addition to π–π interaction between IDRs. It is also found that the sole two negatively charged residues in its IDR sense pH changes for tuning the condensation tendency. Also, these two residues may prevent this prion‐like IDR domain from forming rapid and extensive aggregates. These results demonstrate how cation‐π, π–π, and electrostatic interactions can regulate protein condensation between disordered and structured domains and highlight the importance of sparse negatively charged residues in prion‐like IDRs.
... The advantage of the alignment-free approach is that any collections of proteins can be compared quantitatively. Although various types of alignment-free approaches have been developed [24,25], including our previous attempts to use membrane topology [26] and a self-organizing map [27], the alignment-free approach in the present study is based on the "availability" (frequency bias) of short constituent sequences (SCSs) of amino acids (aa) in proteins [28][29][30][31][32][33]. The length of SCSs can be 2 aa (doublet), 3 aa (triplet), 4 aa (quartet), 5 aa (pentat), and more in a given protein. ...
... Using this simple concept of availability score, secondary structure characterization has been performed; SCS frequencies (and thus availability scores) are different among different secondary structures [30]. Availability scores are also different between parallel and antiparallel β-strands [31]. ...
Chapter
Full-text available
Little is known about protein sequences unique in humans. Here, we performed alignment-free sequence comparisons based on the availability (frequency bias) of short constituent amino acid (aa) sequences (SCSs) in proteins to search for human-specific proteins. Focusing on 5-aa SCSs (pentats), exhaustive comparisons of availability scores among the human proteome and other nine mammalian proteomes in the nonredundant (nr) database identified a candidate protein containing WRWSH, here called FAM75, as human-specific. Examination of various human genome sequences revealed that FAM75 had genomic DNA sequences for either WRWSH or WRWSR due to a single nucleotide polymorphism (SNP). FAM75 and its related protein FAM205A were found to be produced through alternative splicing. The FAM75 transcript was found only in humans, but the FAM205A transcript was also present in other mammals. In humans, both FAM75 and FAM205A were expressed specifically in testis at the mRNA level, and they were immunohistochemically located in cells in seminiferous ducts and in acrosomes in spermatids at the protein level, suggesting their possible function in sperm development and fertilization. This study highlights a practical application of SCS-based methods for protein searches and suggests possible contributions of SNP variants and alternative splicing of FAM75 to human evolution.
... In this respect, the latter are similar to transit peptides of rice and Chlamydomonas, while Arabidopsis transit peptides are exceptional (Kleffmann et al., 2007;Patron and Waller, 2007). Because alanine tends to form alpha-helices, it is conceivable that alpha-helical structures might be functional determinants of Cyanophora paradoxa transit peptides (Otaki et al., 2010). However, alanine is overrepresented along the entire transit peptide (Figure 2B), speaking against the formation of specific regulatory helical structures. ...
... The hydrophobic N-terminal part and the less hydrophobic middle region of cyanelle transit peptides are separated by several proline residues between amino acid positions 10-20. Proline is present in loops and in regions between secondary structures (Otaki et al., 2010). It is conceivable that the frequent occurrence of proline up to amino acid 20 constitutes a structural separation of the N-terminal 10-20% of the cyanelle transit peptide from the rest. ...
Article
Full-text available
Glaucophyta, rhodophyta, and chloroplastida represent the three main evolutionary lineages that diverged from a common ancestor after primary endosymbiosis. Comparative analyses between members of these three lineages are a rich source of information on ancestral plastid features. We analyzed the composition and the cleavage site of cyanelle transit peptides from the glaucophyte Cyanophora paradoxa by terminal amine labeling of substrates (TAILS), and compared their characteristics to those of representatives of the chloroplastida. Our data show that transit peptide architecture is similar between members of these two lineages. This entails a comparable modular structure, an overrepresentation of serine or alanine and similarities in the amino acid composition around the processing peptidase cleavage site. The most distinctive difference is the overrepresentation of phenylalanine in the N-terminal 1–10 amino acids of cyanelle transit peptides. A quantitative proteome analysis with periplasm-free cyanelles identified 42 out of 262 proteins without the N-terminal phenylalanine, suggesting that the requirement for phenylalanine in the N-terminal region is not absolute. Proteins in this set are on average of low abundance, suggesting that either alternative import pathways are operating specifically for low abundance proteins or that the gene model annotation is incorrect for proteins with fewer EST sequences. We discuss these two possibilities and provide examples for both interpretations.
... The validity and significance of the inhibitory effects under nonoptimum conditions may be arguable, but the PIA results under nonoptimum conditions should be considered a first step in identifying functional sites, and thus in the development of novel drugs. Another support for PIA comes from a group of studies that stresses the important contribution of short constituent sequences (SCSs) of proteins to secondary structures and the functionality of proteins [42][43][44][45][46][47][48][49][50][51][52][53][54]. A well-known structural prediction approach, ROSETTA, is also based on data collection of SCSs from the PDB [55,56]. ...
Article
Full-text available
Functionally important amino acid sequences in proteins are often located at multiple sites. Three-dimensional structural analysis and site-directed mutagenesis may be performed to allocate functional sites for understanding structure‒function relationships and for developing novel inhibitory drugs. However, such methods are too demanding to comprehensively cover potential functional sites throughout a protein chain. Here, a peptide inhibitor assay (PIA) was devised to allocate functionally important accessible sites in proteins. This simple method presumes that protein‒ligand interactions, intramolecular interactions, and dimerization interactions can be partially inhibited by high concentrations of competitive “endogenous” peptides of the protein of interest. Focusing on the restriction endonuclease EcoRI as a model protein system, many endogenous peptides (6mer-14mer) were synthesized, covering the entire EcoRI protein chain. Some of them were highly inhibitory, but interestingly, the nine most effective peptides were located outside the active sites, with the exception of one. Relatively long peptides with aromatic residues (F, H, W, and Y) corresponding to secondary structures were generally effective. Because synthetic peptides are flexible enough to change length and amino acid residues, this method may be useful for quickly and comprehensively understanding structure‒function relationships and developing novel drugs or epitopes for neutralizing antibodies.
... In addition to the β-strands, the α-helices exhibit distinctive characters. Despite glycine being one of the most destabilizing residues for an α-helix due to the wider range of backbone dihedral angles, it is frequently observed in membranous proteins, especially in the region of helix crossings [23][24][25]. This allows helix packing, providing favorable hydrophobic interaction in the confined space between adjacent helices. ...
Article
Full-text available
Staphylococcus aureus remains a public health threat with the WHO classifying the pathogen as a high priority in the development of new antimicrobial agents. Whole genome sequencing has revealed a number of conserved genes that may be essential for cell viability and infection. Characterising the structure and function of these proteins will inevitably aid development of new antimicrobials. Therefore, this study elucidated the structure of hypothetical protein DUF3055 from S. aureus stain Mu50. The protein possesses an as yet undefined function and a unique fold. The size of DUF3055 made it an ideal candidate for NMR characterisation which in conjunction with circular dichroism revealed the protein to be folded. Crystallisation and structural solution found that the overall dimer fold has a negatively charged surface formed by a β-bulge and tightly crossed α-helices, with a complementary size to a DNA single turn. Our structural observations suggest that hypothetical protein DUF3055 from S. aureus has a role in DNA binding and gene regulation.
... [Otaki et al., 2010] that this primary sequence dictates the so-called secondary structure of the protein. The secondary structural elements can be thought of as the building blocks of the protein curve. ...
... In fact, N, Q, D, and especially P are more frequent in loops than in other secondary structures. 45 Furthermore, the usage of P is remarkable in positions 114 and 114 of CDRκ3-10aa and 113 and 114 in CDRκ3-11aa. ...
Article
Full-text available
The human immune system uses antibodies to neutralize foreign antigens. They are composed of heavy and light chains, both with constant and variable regions. The variable region has six hypervariable loops, also known as complementary-determining regions (CDRs) that determine antibody diversity and antigen specificity. Knowledge of their significance, and certain residues present in these areas, is vital for antibody therapeutics development. This study includes an analysis of more than 11,000 human antibody sequences from the International Immunogenetics information system (IMGT). The analysis included parameters such as length distribution, overall amino acid diversity, amino acid frequency per CDR and residue position within antibody chains. Overall, our findings confirm existing knowledge, such as CDRH3‘s high length diversity and amino acid variability, increased aromatic residue usage, particularly tyrosine, charged and polar residues like aspartic acid, serine, and the flexible residue glycine. Specific residue positions within each CDR influence these occurrences, implying a unique amino acid type distribution pattern. We compared amino acid type usage in CDRs and non-CDR regions, both in globular and transmembrane proteins, which revealed distinguishing features, such as increased frequency of tyrosine, serine, aspartic acid, and arginine. These findings should prove useful for future optimization, improvement of affinity, synthetic antibody library design, or the creation of antibodies de-novo in silico.
... The sequence of amino acids which form a protein is its primary structure and it is 19 always identifiable. Researchers often visualise a proteins global structure via its 20 backbone curve, the discrete 3-dimensional curve whose points represent the central 21 α-carbon atom of each amino acid residue, in the form of a ribbon diagram as seen in 1. 22 It has been shown [2] that the specific sequence of amino acids determine the secondary 23 structure of a protein. This secondary structure represents the shape of local segments 24 of the protein's backbone curve. ...
Preprint
Full-text available
We present fast and simple-to-implement measures of the entanglement of protein tertiary structures which are appropriate for highly flexible structure comparison. These quantities are based on the writhing and crossing numbers heavily utilised in DNA topology studies which and which have shown some promising results when applied to proteins recently. Here we show how they can be applied in a novel manner across various scales of the protein’s backbone to identify similar topologies which can be missed by more common RMSD, secondary structure or primary sequence based comparison methods. We derive empirical bounds on the entanglement implied by these measures and show how they can be used to constrain the search space of a protein for solution scattering, a method highly suited to determining the likely structure of proteins in solution where crystal structure or machine learning based predictions often fail to match experimental data. In addition we identify large scale helical geometries present in a large array of proteins, which are consistent across a number of different protein structure types and sequences. This is used in one specific case to demonstrate significant structural similarity between Rossmann fold and TIM Barrel proteins, a link which is potentially significant as attempts to engineer the latter have in the past produced the former. Finally we provide the SWRITHE python notebook to calculate these metrics. Author summary There is much interest in developing quantitative methods to compare different protein structures or identify common sub-structures across protein families. We present novel methods for studying and comparing protein structures based on the entanglement of their amino-acid backbone and demonstrate a number of their critical properties. First, they are shown to be especially useful in identifying similar protein entanglement for structures which may be seen as distinct via more established methods. Second, by studying the distribution of entanglement across a wide sample of proteins, we show that there exists a minimum expected amount (a lower bound) of entanglement given the protein’s length. This bound is shown to be useful in ensuring realistic predictions from experimental structural determination methods. Third, using fundamental properties of this entanglement measure, we identify two common classes of protein sub-structure. The first are large scale helices, which provide stability to the structure. These helical structures indicate strong structural similarity of two protein families usually regarded as differing significantly. The second class of substructure is one which, though complex, has a small net entanglement. This configuration is physically useful in other disciplines, but its function in proteins is not yet clear. Finally, we provide an interactive python notebook to compute these measures for a given protein.
... These two libraries were intentionally designed to facilitate the development of stable antiparallel β strands with amphipathic faces generated via the periodic alternation of aliphatic and charged/polar residues. We identified and applied a preference for glycine, asparagine, and aspartic acid in or near the turn regions (14)(15)(16) and excluded proline in contrast to canonical β-turn design (17)(18)(19)(20). We also excluded aromatic residues because they promote mammalian toxicity despite their positive effects on β-hairpin structure (21). ...
Article
Full-text available
Peptide macrocycles are a rapidly emerging class of therapeutic, yet the design of their structure and activity remains challenging. This is especially true for those with β-hairpin structure due to weak folding properties and a propensity for aggregation. Here, we use proteomic analysis and common antimicrobial features to design a large peptide library with macrocyclic β-hairpin structure. Using an activity-driven high-throughput screen, we identify dozens of peptides killing bacteria through selective membrane disruption and analyze their biochemical features via machine learning. Active peptides contain a unique constrained structure and are highly enriched for cationic charge with arginine in their turn region. Our results provide a synthetic strategy for structured macrocyclic peptide design and discovery while also elucidating characteristics important for β-hairpin antimicrobial peptide activity.
... These libraries were intentionally designed to facilitate the development of stable antiparallel β-strands with amphipathic faces generated via the periodic alternation of aliphatic and charged/polar residues. We identified and applied a preference for glycine, asparagine, and aspartic acid in or near the turn regions (15)(16)(17) and excluded proline in contrast to canonical βturn design (18)(19)(20)(21). We also excluded aromatic residues because they promote mammalian toxicity despite their positive effects on β-hairpin structure (22). ...
Preprint
Full-text available
Peptide macrocycles are a rapidly emerging new class of therapeutic, yet the design of their structure and activity remains challenging. This is especially true for those with β-hairpin structure due to weak folding properties and a propensity for aggregation. Here we use proteomic analysis and common antimicrobial features to design a large peptide library with macrocyclic β-hairpin structure. Using an activity-driven high-throughput screen we identify dozens of peptides killing bacteria through selective membrane disruption and analyze their biochemical features via machine learning. Active peptides contain a unique constrained structure and are highly enriched for cationic charge with arginine in their turn region. Our results provide a synthetic strategy for structured macrocyclic peptide design and discovery, while also elucidating characteristics important for β-hairpin antimicrobial peptide activity. Brief Summary We design, screen, and computationally analyze a synthetic macrocyclic β-hairpin peptide library for antibiotic potential.
... The SCS concept is simple, and its applications are diverse [36]. This field of study has expanded in silico [35][36][37][67][68][69][70][71][72][73][74][75], but the SCS concept has not yet been sufficiently explored to understand physiological systems. To our knowledge, the first physiological application in this field is the use of a group of peptides as immunological adjuvants [76,77]. ...
Article
Full-text available
Current SARS-CoV-2 vaccines take advantage of the viral spike protein required for infection in humans. Considering that spike proteins may contain both “self” and “nonself” sequences (sequences that exist in the human proteome and those that do not, respectively), nonself sequences are likely to be better candidate epitopes than self sequences for vaccines to efficiently eliminate pathogenic proteins and to reduce the potential long-term risks of autoimmune diseases. This viewpoint is likely important when one considers that various autoantibodies are produced in COVID-19 patients. Here, we comprehensively identified self and nonself short constituent sequences (SCSs) of 5 amino acid residues in the proteome of SARS-CoV-2. Self and nonself SCSs comprised 91.2% and 8.8% of the SARS-CoV-2 proteome, respectively. We identified potentially important nonself SCS clusters in the receptor-binding domain of the spike protein that overlap with previously identified epitopes of neutralizing antibodies. These nonself SCS clusters may serve as functional epitopes for effective, safe, and long-term vaccines against SARS-CoV-2 infection. Additionally, analyses of self/nonself status changes in mutants revealed that the SARS-CoV-2 proteome may be evolving to mimic the human proteome. Further SCS-based proteome analyses may provide useful information to predict epidemiological dynamics of the current COVID-19 pandemic.
... The FTIR spectroscopy analysis additionally reflects the possible secondary structure of CPI peptides [34]. The abundance of valine, isoleucine, and threonine, along with the FTIR peaks at 1,635 and 1,517 cm -1 , are correlated with the possible involvement of β-sheet conformations in the CPI. ...
Article
Full-text available
Naturally-derived proteins or peptides are promising biopolymers for tissue engineering applications owing to their health-promoting activity. Herein, we extracted proteins (~90%) from two-spotted cricket (Gryllus bimaculatus) and evaluated their osteoinductive potential in human bone marrow-derived mesenchymal stem cells (hBMSCs) under in vitro conditions. The extracted protein isolate was analyzed for the amino acid composition and the mass distribution of the constituent peptide fraction. Fourier transform infrared (FTIR) spectroscopy was used to determine the presence of biologically significant functional groups. The cricket protein isolate (CPI) exhibited characteristic protein peaks in the FTIR spectrum. Notably, an enhanced cell viability was observed in the presence of the extracted proteins, showing their biocompatibility. The CPI also exhibited antioxidant properties in a concentration-dependent manner. More significant mineralization was observed in the CPI-treated cells than in the control, suggesting their osteoinductive potential. The upregulation of the osteogenic marker genes (Runx2, ALP, OCN, and BSP) in CPI treated media compared with the control supports their osteoinductive nature. Therefore, cricket-derived protein isolates could be used as functional protein isolate for tissue engineering applications, especially for bone regeneration.
... There is a well-documented prevalence of certain amino acid for specific secondary structures. This has been observed since the 1970s in the pioneer work of Chou and Fasman (1974) or Garnier (1978), on large survey of PDB proteins (Otaki et al. 2010) as well as among cSP92 protein set (De Meutter and Goormaghtigh 2020). In cSP92, correlation coefficients between secondary structure content and each amino acid content can reach 0.4-0.5 (Fig. S1). ...
Article
Full-text available
Prediction of protein secondary structure from FTIR spectra usually relies on the absorbance in the amide I–amide II region of the spectrum. It assumes that the absorbance in this spectral region, i.e., roughly 1700–1500 cm ⁻¹ is solely arising from amide contributions. Yet, it is accepted that, on the average, about 20% of the absorbance is due to amino acid side chains. The present paper evaluates the contribution of amino acid side chains in this spectral region and the potential to improve secondary structure prediction after correcting for their contribution. We show that the β-sheet content prediction is improved upon subtraction of amino acid side chain contributions in the amide I–amide II spectral range. Improvement is relatively important, for instance, the error of prediction of β-sheet content decreases from 5.42 to 4.97% when evaluated by ascending stepwise regression. Other methods tested such as partial least square regression and support vector machine have also improved accuracy for β-sheet content evaluation. The other structures such as α-helix do not significantly benefit from side chain contribution subtraction, in some cases prediction is even degraded. We show that co-linearity between secondary structure content and amino acid composition is not a main limitation for improving secondary structure prediction. We also show that, even though based on different criteria, secondary structures defined by DSSP and XTLSSTR both arrive at the same conclusion: only the β-sheet structure clearly benefits from side chain subtraction. It must be concluded that side chain contribution subtraction benefit for the evaluation of other secondary structure contents is limited by the very rough description of side chain absorbance which does not take into account the variations related to their environment. The study was performed on a large protein set. To deal with the large number of proteins present, we worked on protein microarrays deposited on BaF 2 slides and FTIR spectra were acquired with an imaging system.
... Among all the lysine residues, most of them were found at the helix and coil structures while the percentage of lysine at the β-strand structure was the lowest (Figure 4b). This is consistent with the previous result that hydrophilic lysine less frequently occurred in the β-strand [58]. The trends were the same with those of the glycated lysine residues in the three cell lines. ...
Article
Glycation as a type of non-enzymatic protein modification is related to aging and chronic diseases, especially diabetes. Global analysis of protein glycation will aid in a better understanding of its formation mechanism and biological significance. In this work, we comprehensively investigated protein glycation in human cells (HEK293T, Jurkat, and MCF7 cells). The current results indicated that this non-enzymatic modification was not random, and protein at the extracellular regions and the nucleus were more frequently glycated. Systematic and site-specific analysis of glycated proteins allowed us to study the effect of the primary sequences and secondary structures of proteins on glycation. Furthermore, nearly every enzyme in the glycolytic pathway was found to be glycated and a possible mechanism was proposed. Many glycation sites were also previously reported as acetylation and ubiquitination sites, which strongly suggested that this non-enzymatic modification may disturb protein degradation and gene expression. The current results will facilitate further studies of protein glycation in biomedical and clinical research.
... Classification results for AAindex1 properties are shown in Table 7. Especially, most of these descriptors are Alpha and Turn propensities, which is a conformational index of amino acids. The amino acid conformational bias can affect the secondary structures of protein interaction interface, and the frequency of occurrence of amino acids in different secondary structures is also different [42]. A few descriptors are physcioemcial properties, such as pH. ...
Article
Full-text available
Background: Hot spot residues are functional sites in protein interaction interfaces. The identification of hot spot residues is time-consuming and laborious using experimental methods. In order to address the issue, many computational methods have been developed to predict hot spot residues. Moreover, most prediction methods are based on structural features, sequence characteristics, and/or other protein features. Results: This paper proposed an ensemble learning method to predict hot spot residues that only uses sequence features and the relative accessible surface area of amino acid sequences. In this work, a novel feature selection technique was developed, an auto-correlation function combined with a sliding window technique was applied to obtain the characteristics of amino acid residues in protein sequence, and an ensemble classifier with SVM and KNN base classifiers was built to achieve the best classification performance. Conclusion: The experimental results showed that our model yields the highest F1 score of 0.92 and an MCC value of 0.87 on ASEdb dataset. Compared with other machine learning methods, our model achieves a big improvement in hot spot prediction. Availability: http://deeplearner.ahu.edu.cn/web/HotspotEL.htm .
... Propensity scales that predict the probabilities of amino acids to be found in α-helices (P α ), β-sheets (P β ) or coils (P c ) based on the analysis of 2216 proteins have been developed using three different criteria [43] and we have generated an averaged propensity scale derived from these values ( Table 1). The averaged propensity scale only deviated by 1.2% from the three individual scales [43] and was in very good agreement with other thermodynamic studies that have predicted the propensity of amino acids to form α-helices [44,45] or β-sheets [46,47] using ΔΔG values, as well as a survey study which analyzed the frequency of amino acids that were found in α-helices, β-sheets or coils from 1590 proteins [48]. Each missense mutant was analyzed for how the amino acid change affected the P α , P β or P c value of the wild-type amino acid. ...
Article
Full-text available
The isolation and characterization of 42 unique nonfunctional missense mutants in the bacterial cytosolic β-galactosidase and catechol 2,3-dioxygenase enzymes allowed us to examine some of the basic general trends regarding protein structure and function. A total of 6 out of the 42, or 14.29% of the missense mutants were in α-helices, 17 out of the 42, or 40.48%, of the missense mutants were in β-sheets and 19 out of the 42, or 45.24% of the missense mutants were in unstructured coil, turn or loop regions. While α-helices and β-sheets are undeniably important in protein structure, our results clearly indicate that the unstructured regions are just as important. A total of 21 out of the 42, or 50.00% of the missense mutants caused either amino acids located on the surface of the protein to shift from hydrophilic to hydrophobic or buried amino acids to shift from hydrophobic to hydrophilic and resulted in drastic changes in hydropathy that would not be preferable. There was generally good consensus amongst the widely used algorithms, Chou-Fasman, GOR, Qian-Sejnowski, JPred, PSIPRED, Porter and SPIDER, in their ability to predict the presence of the secondary structures that were affected by the missense mutants and most of the algorithms predicted that the majority of the 42 inactive missense mutants would impact the α-helical and β-sheet secondary structures or the unstructured coil, turn or loop regions that they altered.
... Verschiedene Aminosäuren besitzen vor allem aufgrund ihrer physikochemischen Eigenschaften (wie z.B. der strukturbestimmenden Torsionswinkel phi und psi, siehe Abb.1.1) unterschiedliche Präferenzen für die Ausbildung von α-Helices, β-Faltblatt-Strukturen, β-Kehren und Schleifen und kommen deshalb unterschiedlich häufig in diesen Sekundärstrukturelementen vor (Otaki et al., 2010 (Pace et al., 1996, Yan et al., 2008. Somit falten sich Proteine natürlicherweise immer in die thermodynamisch stabilste Konformation mit der geringsten Energie (Kumar und Nussinov, 2001, Pace et al., 1996 (Dawson und Kent, 2000, Stryer, Alberts). ...
Thesis
Die überraschende Neuentdeckung intramolekularer Isopeptidbindungen innerhalb Immunglobulin-ähnlicher Domänen von adhäsiven, extrazellulären Proteinen gram-positiver Bakterien hat die Biotechnologie und Synthetische Biologie revolutioniert. Das aus der Spaltung der CnaB2-Domäne des Fibronektinbindeproteins von Streptococcus pyogenes resultierende ,,Superkleber”-System, bestehend aus dem Protein SpyCatcher und seinem spezifischen Peptidpartner SpyTag, ist ein häufig und vielfältig verwendetes molekulares Werkzeug für Biokonjugationen. Die über dieses SpyCatcher-SpyTag System vermittelte post-translationale Kontrolle über Proteine (Verknüpfung und Immobilisierung von Proteinen, Funktionalisierung von Materialien) profitiert dabei von dem kovalenten, irreversiblen, stabilen, spezifischen und robusten Charakter der genetisch kodierten Catcher-Tag Reaktionspartner, die unter Ausbildung einer autokatalytischen Isopeptidbindung sehr effizient miteinander reagieren. Um die Anwendungsmöglichkeiten dieser Technologie vor allem durch Programmierbarkeit und Kontrollierbarkeit zu erweitern und zu verbessern, wurden in dieser Arbeit alternative, neue Catcher-Tag Systeme, basierend auf Isopeptidbindungen, identifiziert, charakterisiert und optimiert. Über strukturbasierte Datenbanksuche innerhalb von Zelloberflächenproteinen gram-positiver Bakterien, die aus Ig-ähnlichen Domänen des CnaB-Typs aufgebaut sind, wurden vier potentielle Split-Domänen Kandidaten für genauere Analysen ausgewählt. Die CnaB-Domänen wurden mittels in silico Analysen in zwei Teile, Catcher-Protein und Tag-Peptid, gespalten und anschließend in vitro bezüglich ihrer Fähigkeit zur Split-Domänen Rekonstitution, unter Ausbildung einer intermolekularen Isopeptidbindung, analysiert. Zwei der vier CnaB-Kandidatendomänen, 4oq1 und 3kptC, konnten in dieser Arbeit mittels Rekonstitutionsexperimenten erfolgreich als neue Split-Isopeptid Systeme identifiziert werden. Das 4oq1Catcher-4oq1Tag System basiert auf der D2-Domäne des basalen Pilins RrgC von Streptococcus pneumoniae, während sich das 3kptCCatcher-3kptCTag System von der CNA3-Domäne des Hauptpilins BcpA von Bacillus cereus ableitet. An die Identifizierung schloss sich die genauere Charakterisierung der neuen Catcher-Tag Systeme an. Nach der Verifizierung der Isopeptid-Triade, der Analyse der Rekonstitutionseffizienz und Orthogonalitätstests der neuen Catcher-Tag Systeme, wurden die optimalen Reaktionsbedingungen (optimales Catcher-Tag Verhältnis, Testen bestimmter chemischer Zusätze, Temperatur- und pH-Optimum) ermittelt. Deutlich wurde, dass die neu identifizierten Catcher-Tag Systeme relativ langsam und ineffizient waren, vor allem im direkten Vergleich mit dem hocheffizienten SpyCatcher-SpyTag System. Unter optimalen Bedingungen konnte die Reaktivität der Systeme bereits deutlich verbessert werden. Um die Effizienz der kovalenten Reaktion zwischen entsprechenden Catcher- und Tag-Reaktionspartnern weiter zu optimieren, wurden verschiedene Strategien des ,,Protein Engineerings” angewendet und die Proteinkomponenten genetisch manipuliert. Zuerst wurden die initialen Produkte der CnaB-Domänenspaltungen überprüft, indem N- und C-terminale Proteinvarianten im Hinblick auf eine verbesserte Rekonstitutionseffizienz getestet wurden. Sowohl essentielle als auch für die Reaktion verzichtbare Aminosäuren konnten dadurch identifiziert werden. Vielfältige rationale Mutagenese-Strategien, vor allem zur Stabilisierung der größeren Catcher-Proteine durch gezielte Aminosäuresubstitutionen, folgten. Einige, wenige Mutationen konnten die kovalente Reaktion zwischen Catcher und Tag verbessern, wohingegen die meisten Mutationen keine Verbesserung oder sogar eine Verschlechterung der Rekonstitutionseffizienz mit sich brachten, was die Komplexität des ,,Protein Engineerings” verdeutlicht. Nach Bestätigung der Funktionalität des SpyCatcher-SpyTag Systems und des 4oq1Catcher-4oq1Tag Systems in planta, wurde eine erste pflanzenbiotechnologische Anwendung (Chloroplasten-Isolierung) dieser kovalenten Catcher-Tag Systeme getestet. Mögliche andere und neue Anwendungen der Catcher-Tag Systeme wurden, basierend auf ersten Vorexperimenten (Herstellung eines weißen Trimers für BioLED-Anwendungen, SpyCatcher- oder SpyTag-funktionalisierte Magnetpartikel), diskutiert. Die in dieser Arbeit erforschten Catcher-Tag Systeme erweitern somit das Repertoire an nützlichen, vielseitig einsetzbaren molekularen Werkzeugen zur irreversiblen, post-translationalen Kontrolle über Proteine und eröffnen neue Anwendungsmöglichkeiten, auch in der Pflanzenbiotechnologie.
... На этом основании было высказано предположение о том, что короткие пентапептидные фрагменты являются информационно-структурными единицами низшего уровня в молекуле белка, а домены -элементами высшего уровня.МД расчеты показали, что пентапептидные фрагменты могут иметь преимущественно реализуемую (предпочтительную) конформацию и, следовательно, могут играть роль жестких «армирующих» элементов, ответственных за формирование пространственной структуры белка. Однако анализ первичных струкутур белков, содержащихся в базах данных, осуществленный Отаки и соавт., показал, что не все теоретически возможные пентапептиды (в отличие от трипептидов) могут обнаруживаться в реальных белках(Otaki et al., 2005(Otaki et al., , 2010. Кроме того, существование трипептидов, имеющих функциональное значение (например, RGD) свидетельствует о том, что именно они могут являться структурными единицами минимальной длины в молекулах белков.Нами выдвинуто предположение о том, что конформационнодинамические ограничения играют важную роль в молекулярной эволюции белков. ...
... Parallel and anti-parallel strands differ by the mean values for the dihedral angles Phi and Psi which are respectively about -115° and +115° for parallel strands and about -145° and +145° for anti-parallel strands (Nesloney and Kelly, 1996). Parallel and anti-parallel strands differ also by their amino acid composition (Lifson and Sander, 1979) (Otaki et al., 2010). A rule predicting the anti-parallel character from the beta-strands' sequence was derived (Caudron and Jestin, 2012). ...
Article
article complémentaire de : Guilloux A, Caudron B, Jestin JL (2013) A method to predict edge strands in beta-sheets from protein sequences. Computational and Structural Biotechnology Journal. 7 (9): e201305001. doi: http://dx.doi.org/10.5936/csbj.201305001
Article
Full-text available
Despite extensive worldwide vaccination, the current COVID-19 pandemic caused by SARS-CoV-2 continues. The Omicron variant is a recently emerged variant of concern and is now overtaking the Delta variant. To characterize the potential antigenicity of the Omicron variant, we examined the distributions of SARS-CoV-2 nonself mutations (in reference to the human proteome) as five amino acid stretches of short constituent sequences (SCSs) in the Omicron and Delta proteomes. The number of nonself SCSs did not differ much throughout the Omicron, Delta, and reference sequence (RefSeq) proteomes but markedly increased in the receptor binding domain (RBD) of the Omicron spike protein compared to those of the Delta and RefSeq proteins. In contrast, the number of nonself SCSs decreased in non-RBD regions in the Omicron spike protein, compensating for the increase in the RBD. Several nonself SCSs were tandemly present in the RBD of the Omicron spike protein, likely as a result of selection for higher binding affinity to the ACE2 receptor (and, hence, higher infectivity and transmissibility) at the expense of increased antigenicity. Taken together, the present results suggest that the Omicron variant has evolved to have higher antigenicity and less virulence in humans despite increased infectivity and transmissibility.
Article
Programmed cell death ligand 1 (PD-L1) is considered a major immune checkpoint protein that mediates antitumor immune suppression and response. Effectively regulating PD-L1 expression and dynamic monitoring has become a significant challenge in immunotherapy. Herein, we adopted smart surface-enhanced Raman scattering (SERS) nanoprobes to discriminate and monitor the dynamic expression of PD-L1 under external electrostimulation (ES). The PD-L1 expression levels in three cell lines (MCF-7 cells, HeLa cells, and H8 cells) were assessed before and after ES. The results reveal that ES could effectively and rapidly mediate a transformation in the PD-L1 content (or activity) on the cell membrane. Moreover, the molecular profiles of the cell membrane before and after ES were revealed by using the label-free SERS method with the help of immune plasmonic nanoparticles. The cell membrane protein information presented identifiable conformation changes after ES, showing a significant inhibitory effect on the bridge of PD-L1 and its antibody. This study indicates that ES is superior to chemical drugs due to lesser side effects because ES-based regulation does not depend on intracellular signalling pathways. This strategy is versatile and robust for discriminating and monitoring PD-L1 on cell membranes, thus providing potential clinical application value to PD-L1-mediated systems. This study also offers a practical way to assess the molecular profiles of cell membrane proteins in the presence of an external stimulus, which may be applicable to many membrane protein-related studies.
Chapter
Being responsible for more than 90% of cellular functions, protein molecules are workhorses in all the life forms. In order to cater for such a high demand, proteins have evolved to adopt diverse structures that allow them to perform myriad of functions. Beginning with the genetically directed amino acid sequence, the classical understanding of protein function involves adoption of hierarchically complex yet ordered structures. However, advances made over the last two decades have revealed that inasmuch as 50% of eukaryotic proteome exists as partially or fully disordered structures. Significance of such intrinsically disordered proteins (IDPs) is further realized from their ability to exhibit multifunctionality, a feature attributable to their conformational plasticity. Among the coded amino acids, cysteines are considered to be “order-promoting” due to their ability to form inter- or intramolecular disulfide bonds, which confer robust thermal stability to the protein structure in oxidizing conditions. The co-existence of order-promoting cysteines with disorder-promoting sequences seems counter-intuitive yet many proteins have evolved to contain such sequences. In this chapter, we review some of the known cysteine-containing protein domains categorized based on the number of cysteines they possess. We show that many protein domains contain disordered sequences interspersed with cysteines. We show that a positive correlation exists between the degree of cysteines and disorder within the sequences that flank them. Furthermore, based on the computational platform, IUPred2A, we show that cysteine-rich sequences display significant disorder in the reduced but not the oxidized form, increasing the potential for such sequences to function in a redox-sensitive manner. Overall, this chapter provides insights into an exquisite evolutionary design wherein disordered sequences with interspersed cysteines enable potential modulatory protein functions under stress and environmental conditions, which thus far remained largely inconspicuous.
Article
Full-text available
Quantifying the chemical composition of unstained intact tissue and cellular samples with high spatio-temporal resolution in three dimensions would provide a step change in cell and tissue analytics critical to progress the field of cell biology. Label-free optical microscopy offers the required resolution and non-invasiveness, yet quantitative imaging with chemical specificity is a challenging endeavor. In this work, we show that hyperspectral coherent anti-Stokes Raman scattering (CARS) microscopy can be used to provide quantitative volumetric imaging of human osteosarcoma cells at various stages through cell division, a fundamental component of the cell cycle progress resulting in the segregation of cellular content to produce two progeny. We have developed and applied a quantitative data analysis method to produce volumetric three-dimensional images of the chemical composition of the dividing cell in terms of water, proteins, DNAP (a mixture of proteins and DNA, similar to chromatin), and lipids. We then used these images to determine the dry masses of the corresponding organic components. The attribution of proteins and DNAP components was validated using specific well-characterised fluorescent probes, by comparison with correlative two-photon fluorescence microscopy of DNA and mitochondria. Furthermore, we map the same chemical components under perturbed conditions, employing a drug that interferes directly with cell division (Taxol), showing its influence on cell organization and the masses of proteins, DNAP, and lipids.
Article
Graphical abstract Figure optionsView in workspace
Article
Full-text available
In this work, using molecular dynamics simulation, we study conformational and dynamic properties of biologically active penta- and tetrapeptides derived from fetoplacental proteins such as alpha-fetoprotein, pregnancy specific β1-glycoprotein, and carcinoembryonic antigen. Existence of correlation between flexibility of peptide backbone and biological activity of the investigated peptides was shown. It was also demonstrated that flexibility of peptide backbone depends not only on its length, but also on the presence of reactive functional groups in amino acid side chains that participate in intramolecular interactions. Peptides that demonstrate similar biological effects in regulation of proliferation of lymphocytes and expression of differentiation antigens on their surface (LDSYQCT, PYECE, YECE, and YVCE) are characterized by rigidity of their peptide backbone. Increased backbone flexibility in peptides PYQCE, YQCE, SYKCE, YQCT, YQCS, YVCS, YACS, and YACE is correlated with decreased biological activity. Conformational mobility of amino acid residues does not depend on physicochemical properties only, but also on intramolecular interactions. So, evolutionary restrictions should exist to maintain such interactions in the environment of functionally important sites.
Conference Paper
This paper reports the use of lacunarity analysis of protein sequences as a new method to analyze the distribution of amino acids in a protein sequence. One of the key results is that distribution of hydrophobic amino acids in a protein sequence exhibit fractal like behavior. It is found that lacunarity plots of distribution of hydrophobic amino acids follow similar patterns for a given protein sequence as well as for amino acid sequences that are extracted from the given protein sequence as prefixes with length reduced by half from the original sequence length. Another interesting result is that using the lacunarity values of chaos game representations of amino acid sequences, we can prove the non-random nature of protein sequences. Lacunarity values also help us to classify a set of true and random protein sequences. These two findings affirm lacunarity analysis as a novel and promising bio-sequence analysis method.
Article
One of the important secondary structures in proteins is the β-strand. However, due to its complexity, it is less characterized than helical structures. Using the 1641 representative three-dimensional protein structure data from the Protein Data Bank, we characterized β-strand structures based on strand length and amino acid composition, focusing on differences between parallel and antiparallel β-strands. Antiparallel strands were more frequent and slightly longer than parallel strands. Overall, the majority of β-sheets were antiparallel sheets; however, mixed sheets were reasonably abundant, and parallel sheets were relatively rare. Notably, the nonpolar, aliphatic hydrocarbon amino acids, valine, isoleucine, and leucine were observed at a high frequency in both strands but were more abundant in parallel than in antiparallel strands. The relative amino acid occurrence in β-sheets, especially in parallel strands, was highly correlated with amino acid hydrophobicity. This correlation was not observed in α-helices and 3(10)-helices. In addition, we examined the frequency of 400 amino acid doublets and 8000 amino acid triplets in β-strands based on availability, a measurement of the relative counts of the doublets and triplets. We identified some triplets that were specifically found in either parallel or antiparallel strands. We further identified "zero-count triplets" which did not occur in either parallel or antiparallel strands, despite the fact that they were probabilistically supposed to occur several times. Taken together, the present study revealed essential features of β-strand structures and the differences between parallel and antiparallel β-strands, which can potentially be applied to the secondary structure prediction and the functional design of protein sequences in the future.
Article
An empirical correlation between the fluorine isotropic chemical shifts, measured by ¹⁹F NMR spectroscopy, and the type of fluorine-protein interactions observed in crystal structures is presented. The CF, CF₂, and CF₃ groups present in fluorinated ligands found in the Protein Data Bank were classified according to their ¹⁹F NMR chemical shifts and their close intermolecular contacts with the protein atoms. Shielded fluorine atoms, i.e., those with increased electron density, are observed primarily in close contact to hydrogen bond donors within the protein structure, suggesting the possibility of intermolecular hydrogen bond formation. Deshielded fluorines are predominantly found in close contact with hydrophobic side chains and with the carbon of carbonyl groups of the protein backbone. Correlation between the ¹⁹F NMR chemical shift and hydrogen bond distance, both derived experimentally and computed through quantum chemical methods, is also presented. The proposed "rule of shielding" provides some insight into and guidelines for the judicious selection of appropriate fluorinated moieties to be inserted into a molecule for making the most favorable interactions with the receptor.
Article
Full-text available
Though prediction of protein secondary structures has been an active research issue in bioinformatics for quite a few years and many approaches have been proposed, a new challenge emerges as the sizes of contemporary protein structure databases continue to grow rapidly. The new challenge concerns how we can effectively exploit all the information implicitly deposited in the protein structure databases and deliver ever-improving prediction accuracy as the databases expand rapidly. The new challenge is addressed in this article by proposing a predictor designed with a novel kernel density estimation algorithm. One main distinctive feature of the kernel density estimation based approach is that the average execution time taken by the training process is in the order of O(nlogn), where n is the number of instances in the training dataset. In the experiments reported in this article, the proposed predictor delivered an average Q3 (three-state prediction accuracy) score of 80.3% and an average SOV (segment overlap) score of 76.9% for a set of 27 benchmark protein chains extracted from the EVA server that are longer than 100 residues. The experimental results reported in this article reveal that we can continue to achieve higher prediction accuracy of protein secondary structures by effectively exploiting the structural information deposited in fast-growing protein structure databases. In this respect, the kernel density estimation based approach enjoys a distinctive advantage with its low time complexity for carrying out the training process.
Article
Full-text available
1.(1) Co-operation between a laboratory interested in developing the theory for protein secondary structure prediction methods and a laboratory interested in applying and comparing such methods has led to the development of a simple predictive algorithm.2.(2) Four-state predictions, in which each residue is unambiguously assigned one conformational state of α-helix, extended chain, reverse turn or coil, predict 49% of residue states correctly (in a sample of 26 proteins) when the overall helix and extended-chain content is not taken into account.3.(3) When the relative abundances of helix, extended chain, reverse turn and coil observed by X-ray crystallography are taken into account, a single constant for each protein and type of conformation can be used to bias the prediction. When predictions are optimized in this way, 63% of all residue states are unambiguously and correctly assigned.4.(4) By analysing the nature of the bias required, proteins can be classified into helix-rich types, pleated-sheet-rich types, and so on. It is shown that, if the type of protein can be determined even approximately by circular dichroism, 57% of residue states can be correctly predicted without taking into account the X-ray structure. Further, comparable predictions can be obtained if, instead of circular dichroism, preliminary predictions are made to assess the protein type.5.(5) It is emphasized that the numbers quoted here depend on the method used to assess accuracy, and the algorithm is shown to be at least as good as, and usually superior to, the reported prediction methods assessed in the same way.6.(6) Ways of further enhancing predictions by the use of additional information from hydrophobic triplets and homologous sequences are also explored. Hydro-phobic triplet information does not significantly improve predictive power and it is concluded that this information is used by proteins in the next stage of folding. On the other hand, the use of homologous sequences appears to be very promising.7.(7) The implication of these results in protein folding is discussed.
Article
Full-text available
In recent host-guest studies, the helix-forming tendencies of amino acid residues have been quantified by three groups, each obtaining similar results [Padmanabhan, S., Marqusee, S., Ridgeway, T., Laue, T. M. & Baldwin, R. L. (1990) Nature (London) 344, 268-270; O'Neil, K. T. & DeGrado, W. F. (1990) Science 250, 646-651; Lyu, P. C., Liff, M. I., Marky, L. A. & Kallenbach, N. R. (1990) Science 250, 669-673]. Here, we explore the hypothesis that these measured helix-forming propensities are due primarily to conformational restrictions imposed upon residue side chains by the helix itself. This proposition is tested by calculating the extent to which the bulky helix backbone "freezes out" available degrees of freedom in helix side chains. Specifically, for a series of apolar residues, the difference in configurational entropy, delta S, between each side chain in the unfolded state and in the alpha-helical state is obtained from a simple Monte Carlo calculation. These computed entropy differences are then compared with the experimentally determined values. Measured and calculated values are found to be in close agreement for naturally occurring amino acids and in total disagreement for non-natural amino acids. In the calculation, delta S(Ala) = 0. The rank order of entropy loss for the series of natural apolar side chains under consideration is Ala less than Leu less than Trp less than Met less than Phe less than Ile less than Tyr less than Val. Among these, none favor helix formation; Ala is neutral, and all remaining residues are unfavorable to varying degrees. Thus, applied to side chains, the term "helix preference" is a misnomer. While side chain-side chain interactions may modulate stability in some instances, our results indicate that the drive to form helices must originate in the backbone, consistent with Pauling's view of four decades ago [Pauling, L., Corey, R. B. & Branson, H. R. (1951) Proc. Natl. Acad. Sci. USA 37, 205-210].
Article
Full-text available
Several authors have proposed that predictions of protein secondary structure derived from statistical information about the known structures can be improved when information about neighboring residues participating in short and medium range interactions is included. A substantial improvement shown here indicates that current methods of including this information are not more successful than methods that do not. Evaluations of the Chou and Fasman method (Adv. Enzymol. 47 (1978) 45-148), that does not include information about interactions (except in averaging), have shown it to be about 49% correct for three states (helix, beta-sheet and undefined). In comparison, the method of Garnier et al. (J. Mol. Biol. 120 (1978) 97-120), that explicitly includes information about neighboring residues, has an accuracy of 57% residues correct for three states. However, we have obtained an 8% improvement for predictions of secondary structure based on the algorithm by Chou and Fasman. The improvements are obtained by eliminating many rules and by choosing the best decision constants for structure assignments. The simplified method described here is 57% correct for three states using preference values calculated in 1978.
Article
Full-text available
A predictive rule for protein folding is presented that involves two recurrent glycine-based motifs that cap the carboxyl termini of alpha helices. In proteins, helices that terminated in glycine residues were found predominantly in one of these two motifs. These glycine structures had a characteristic pattern of polar and apolar residues. Visual inspection of known helical sequences was sufficient to distinguish the two motifs from each other and from internal glycines that fail to terminate helices. These glycine motifs--in which the local sequence selects between available structures--represent an example of a stereochemical rule for protein folding.
Article
Full-text available
The pair-coupled amino acid composition is introduced to predict the secondary structure contents of a protein. Compared with the existing methods all based on singlewise amino acid composition as defined in a 20D (dimensional) space, this represents a step forward to the consideration of the sequence coupling effect. The test results indicate that the introduction of the pair-coupled amino acid composition can significantly improve the prediction quality. It is anticipated that the concept of the pair-coupled amino acid composition can be used to simplify the formulation of sequence coupling (or sequence order) effects and to study many other features of proteins as well.
Article
Full-text available
PDB-REPRDB is a database of representative protein chains from the Protein Data Bank (PDB). The previous version of PDB-REPRDB provided 48 representative sets, whose similarity criteria were predetermined, on the WWW. The current version is designed so that the user may obtain a quick selection of representative chains from PDB. The selection of representative chains can be dynamically configured according to the user’s requirement. The WWW interface provides a large degree of freedom in setting parameters, such as cut-off scores of sequence and structural similarity. One can obtain a representative list and classification data of protein chains from the system. The current database includes 20 457 protein chains from PDB entries (August 6, 2000). The system for PDB-REPRDB is available at the Parallel Protein Information Analysis system (PAPIA) WWW server (http://www.rwcp.or.jp/papia/).
Article
Full-text available
A database of 118 non-redundant proteins was examined to determine the preferences of amino acids for secondary structures: alpha-helix, beta-strand and coil conformations. To better understand how the physicochemical properties of amino acid side chains might influence protein folding, several new scales have been suggested for quantifying the electronic effects of amino acids. These include the pKa at the amino group, localized effect substituent constants (esigma), and a composite of these two scales (epsilon). Amino acids were also classified into 5 categories on the basis of their electronic properties: O (strong electron donor), U (weak donor), Z (ambivalent), B (weak electron acceptor), and X (strong acceptor). Certain categories of amino acid appeared to be critical for particular conformations, e.g., O and U-type residues for alpha-helix formation. Pairwise analysis of the database according to these categories revealed significant context effects in the structural preferences. In general, the propensity of an amino acid for a particular conformation was related to the electronic features of the side chain. Linear regression analyses revealed that the electronic properties of amino acids contributed about as much to the folding preferences as hydrophobicity, which is a well-established determinant of protein folding. A theoretical model has been proposed to explain how the electronic properties of the side chain groups might influence folding along the peptide backbone.
Article
Full-text available
A new method that uses support vector machines (SVMs) to predict protein secondary structure is described and evaluated. The study is designed to develop a reliable prediction method using an alternative technique and to investigate the applicability of SVMs to this type of bioinformatics problem. Binary SVMs are trained to discriminate between two structural classes. The binary classifiers are combined in several ways to predict multi-class secondary structure. The average three-state prediction accuracy per protein (Q(3)) is estimated by cross-validation to be 77.07 +/- 0.26% with a segment overlap (Sov) score of 73.32 +/- 0.39%. The SVM performs similarly to the 'state-of-the-art' PSIPRED prediction method on a non-homologous test set of 121 proteins despite being trained on substantially fewer examples. A simple consensus of the SVM, PSIPRED and PROFsec achieves significantly higher prediction accuracy than the individual methods.
Article
Full-text available
This chapter elaborates protein structure prediction using Rosetta. Double-blind assessments of protein structure prediction methods have indicated that the Rosetta algorithm is perhaps the most successful current method for de novo protein structure prediction. In the Rosetta method, short fragments of known proteins are assembled by a Monte Carlo strategy to yield native-like protein conformations. Using only sequence information, successful Rosetta predictions yield models with typical accuracies of 3–6 A˚ Cα root mean square deviation (RMSD) from the experimentally determined structures for contiguous segments of 60 or more residues. For each structure prediction, many short simulations starting from different random seeds are carried out to generate an ensemble of decoy structures that have both favorable local interactions and protein-like global properties. This set is then clustered by structural similarity to identify the broadest free energy minima. The effectiveness of conformation modification operators for energy function optimization is also described in this chapter.
Article
Full-text available
In our previous approach, we proposed a hybrid method for protein secondary structure prediction called HYPROSP, which combined our proposed knowledge-based prediction algorithm PROSP and PSIPRED. The knowledge base constructed for PROSP contains small peptides together with their secondary structural information. The hybrid strategy of HYPROSP uses a global quantitative measure, match rate, to determine whether PROSP or PSIPRED is to be used for the prediction of a target protein. HYPROSP made slight improvement of Q(3) over PSIPRED because PROSP predicted well for proteins with match rate >80%. As the portion of proteins with match rate >80% is quite small and as the performance of PSIPRED also improves, the advantage of HYPROSP is diluted. To overcome this limitation and further improve the hybrid prediction method, we present in this paper a new hybrid strategy HYPROSP II that is based on a new quantitative measure called local match rate. Local match rate indicates the amount of structural information that each amino acid can extract from the knowledge base. With the local match rate, we are able to define a confidence level of the PROSP prediction results for each amino acid. Our new hybrid approach, HYPROSP II, is proposed as follows: for each amino acid in a target protein, we combine the prediction results of PROSP and PSIPRED using a hybrid function defined on their respective confidence levels. Two datasets in nrDSSP and EVA are used to perform a 10-fold cross validation. The average Q(3) of HYPROSP II is 81.8% and 80.7% on nrDSSP and EVA datasets, respectively, which is 2.0% and 1.1% better than that of PSIPRED. For local structures with match rate >80%, the average Q(3) improvement is 4.4% on the nrDSSP dataset. The use of local match rate improves the accuracy better than global match rate. There has been a long history of attempts to improve secondary structure prediction. We believe that HYPROSP II has greatly utilized the power of peptide knowledge base and raised the prediction accuracy to a new high. The method we developed in this paper could have a profound effect on the general use of knowledge base techniques for various predictionalgorithms. The Linux executable file of HYPROSP II, as well as both nrDSSP and EVA datasets can be downloaded from http://bioinformatics.iis.sinica.edu.tw/HYPROSPII/.
Article
Full-text available
Electronic properties of amino acid side chains such as inductive and field effects have not been characterized in any detail. Quantum mechanics (QM) calculations and fundamental equations that account for substituent effects may provide insight into these important properties. PM3 analysis of electron distribution and polarizability was used to derive quantitative scales that describe steric factors, inductive effects, resonance effects, and field effects of amino acid side chains. These studies revealed that: (1) different semiempirical QM methods yield similar results for the electronic effects of side chain groups, (2) polarizability, which reflects molecular deformability, represents steric factors in electronic terms, and (3) inductive effects contribute to the propensity of an amino acid for alpha-helices. The data provide initial characterization of the substituent effects of amino acid side chains and suggest that these properties affect electron density along the peptide backbone.
Article
Full-text available
The accuracy of protein secondary structure prediction has steadily improved over the past 30 years. Now many secondary structure prediction methods routinely achieve an accuracy (Q3) of about 75%. We believe this accuracy could be further improved by including structure (as opposed to sequence) database comparisons as part of the prediction process. Indeed, given the large size of the Protein Data Bank (>35,000 sequences), the probability of a newly identified sequence having a structural homologue is actually quite high. We have developed a method that performs structure-based sequence alignments as part of the secondary structure prediction process. By mapping the structure of a known homologue (sequence ID >25%) onto the query protein's sequence, it is possible to predict at least a portion of that query protein's secondary structure. By integrating this structural alignment approach with conventional (sequence-based) secondary structure methods and then combining it with a "jury-of-experts" system to generate a consensus result, it is possible to attain very high prediction accuracy. Using a sequence-unique test set of 1644 proteins from EVA, this new method achieves an average Q3 score of 81.3%. Extensive testing indicates this is approximately 4-5% better than any other method currently available. Assessments using non sequence-unique test sets (typical of those used in proteome annotation or structural genomics) indicate that this new method can achieve a Q3 score approaching 88%. By using both sequence and structure databases and by exploiting the latest techniques in machine learning it is possible to routinely predict protein secondary structure with an accuracy well above 80%. A program and web server, called PROTEUS, that performs these secondary structure predictions is accessible at http://wishart.biology.ualberta.ca/proteus. For high throughput or batch sequence analyses, the PROTEUS programs, databases (and server) can be downloaded and run locally.
Article
Full-text available
The carboxy termini of proteins are a frequent site of activity for a variety of biologically important functions, ranging from post-translational modification to protein targeting. Several short peptide motifs involved in protein sorting roles and dependent upon their proximity to the C-terminus for proper function have already been characterized. As a limited number of such motifs have been identified, the potential exists for genome-wide statistical analysis and comparative genomics to reveal novel peptide signatures functioning in a C-terminal dependent manner. We have applied a novel methodology to the prediction of C-terminal-anchored peptide motifs involving a simple z-statistic and several techniques for improving the signal-to-noise ratio. We examined the statistical over-representation of position-specific C-terminal tripeptides in 7 eukaryotic proteomes. Sequence randomization models and simple-sequence masking were applied to the successful reduction of background noise. Similarly, as C-terminal homology among members of large protein families may artificially inflate tripeptide counts in an irrelevant and obfuscating manner, gene-family clustering was performed prior to the analysis in order to assess tripeptide over-representation across protein families as opposed to across all proteins. Finally, comparative genomics was used to identify tripeptides significantly occurring in multiple species. This approach has been able to predict, to our knowledge, all C-terminally anchored targeting motifs present in the literature. These include the PTS1 peroxisomal targeting signal (SKL*), the ER-retention signal (K/HDEL*), the ER-retrieval signal for membrane bound proteins (KKxx*), the prenylation signal (CC*) and the CaaX box prenylation motif. In addition to a high statistical over-representation of these known motifs, a collection of significant tripeptides with a high propensity for biological function exists between species, among kingdoms and across eukaryotes. Motifs of note include a serine-acidic peptide (DSD*) as well as several lysine enriched motifs found in nearly all eukaryotic genomes examined. We have successfully generated a high confidence representation of eukaryotic motifs anchored at the C-terminus. A high incidence of true-positives in our results suggests that several previously unidentified tripeptide patterns are strong candidates for representing novel peptide motifs of a widely employed nature in the C-terminal biology of eukaryotes. Our application of comparative genomics, statistical over-representation and the adjustment for protein family homology has generated several hypotheses concerning the C-terminal topology as it pertains to sorting and potential protein interaction signals. This approach to background reduction could be expanded for application to protein motif prediction in the protein interior. A parallel N-terminal analysis is presented as supplementary data.
Article
Full-text available
Recent sequencing projects and the growth of sequence data banks enable oligopeptide patterns to be characterized on a genome or kingdom level. Several studies have focused on kingdom or habitat classifications based on the abundance of short peptide patterns. There have also been efforts at local structural prediction based on short sequence motifs. Oligopeptide patterns undoubtedly carry valuable information content. Therefore, it is important to characterize these informational peptide patterns to shed light on possible new applications and the pitfalls implicit in neglecting bias in peptide patterns. We have studied four classes of pentapeptide patterns (designated POP, NEP, ORP and URP) in the kingdoms archaea, bacteria and eukaryotes. POP are highly abundant patterns statistically not expected to exist; NEP are patterns that do not exist but are statistically expected to; ORP are patterns unique to a kingdom; and URP are patterns excluded from a kingdom. We used two data sources: the de facto standard of protein knowledge Swiss-Prot, and a set of 386 completely sequenced genomes. For each class of peptides we looked at the 100 most extreme and found both known and unknown sequence features. Most of the known sequence motifs can be explained on the basis of the protein families from which they originate. We find an inherent bias of certain oligopeptide patterns in naturally occurring proteins that cannot be explained solely on the basis of residue distribution in single proteins, kingdoms or databases. We see three predominant categories of patterns: (i) patterns widespread in a kingdom such as those originating from respiratory chain-associated proteins and translation machinery; (ii) proteins with structurally and/or functionally favored patterns, which have not yet been ascribed this role; (iii) multicopy species-specific retrotransposons, only found in the genome set. These categories will affect the accuracy of sequence pattern algorithms that rely mainly on amino acid residue usage. Methods presented in this paper may be used to discover targets for antibiotics, as we identify numerous examples of kingdom-specific antigens among our peptide classes. The methods may also be useful for detecting coding regions of genes.
Article
Full-text available
The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource. INTRODUCTION The Protein Data Bank (PDB) was established at Brookhaven National Laboratories (BNL) (1) in 1971 as an archive for biological macromolecular crystal structures. In the beginning the archive held seven structures, and with each year a handful more were deposited. In the 1980s the number of deposited structures began to increase dramatically. This was due to the improved technology for all aspects of the crystallographic process, the addition of structures determined by nuclear magnetic resonance (NMR) methods, and changes in the community views about data sharing. By the early 1990s the majority of journals required a PDB accession code and at le...
Article
The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
Article
For a successful analysis of the relation between amino acid sequence and protein structure, an unambiguous and physically meaningful definition of secondary structure is essential. We have developed a set of simple and physically motivated criteria for secondary structure, programmed as a pattern-recognition process of hydrogen-bonded and geometrical features extracted from x-ray coordinates. Cooperative secondary structure is recognized as repeats of the elementary hydrogen-bonding patterns “turn” and “bridge.” Repeating turns are “helices,” repeating bridges are “ladders,” connected ladders are “sheets.” Geometric structure is defined in terms of the concepts torsion and curvature of differential geometry. Local chain “chirality” is the torsional handedness of four consecutive Cα positions and is positive for right-handed helices and negative for ideal twisted β-sheets. Curved pieces are defined as “bends.” Solvent “exposure” is given as the number of water molecules in possible contact with a residue. The end result is a compilation of the primary structure, including SS bonds, secondary structure, and solvent exposure of 62 different globular proteins. The presentation is in linear form: strip graphs for an overall view and strip tables for the details of each of 10.925 residues. The dictionary is also available in computer-readable form for protein structure prediction work.
Article
A computer program that progressively evaluates the hydrophilicity and hydrophobicity of a protein along its amino acid sequence has been devised. For this purpose, a hydropathy scale has been composed wherein the hydrophilic and hydrophobic properties of each of the 20 amino acid side-chains is taken into consideration. The scale is based on an amalgam of experimental observations derived from the literature. The program uses a moving-segment approach that continuously determines the average hydropathy within a segment of predetermined length as it advances through the sequence. The consecutive scores are plotted from the amino to the carboxy terminus. At the same time, a midpoint line is printed that corresponds to the grand average of the hydropathy of the amino acid compositions found in most of the sequenced proteins. In the case of soluble, globular proteins there is a remarkable correspondence between the interior portions of their sequence and the regions appearing on the hydrophobic side of the midpoint line, as well as the exterior portions and the regions on the hydrophilic side. The correlation was demonstrated by comparisons between the plotted values and known structures determined by crystallography. In the case of membrane-bound proteins, the portions of their sequences that are located within the lipid bilayer are also clearly delineated by large uninterrupted areas on the hydrophobic side of the midpoint line. As such, the membrane-spanning segments of these proteins can be identified by this procedure. Although the method is not unique and embodies principles that have long been appreciated, its simplicity and its graphic nature make it a very useful tool for the evaluation of protein structures.
Article
Despite proline being assumed to be a helix-breaker, a large number of α-helices are found to contain Pro in globular as well as membrane proteins. Proline has no free NH group and therefore cannot form the conventional intra-helical NH⋯O = C hydrogen bond. An analysis of known protein structures has shown that the Cδ protons are involved in C—H⋯O hydrogen bonds, usually two, with the carbonyl groups in the preceding turn of the helix (four and three residues away). These interactions satisfy the hydrogen bond forming potential of the carbonyl groups, which would otherwise, in the case of membrane-bound helices, be unfavorably exposed to hydrophobic surroundings. Depending on the type (based on the location of the carbonyl group, usually three, four or five residues preceding Pro) of C—H⋯O interactions, the kink in the helix may be of different magnitude. The puckering (UP or DOWN) of the pyrrolidine ring of Pro residues is controlled by the type of the C—H⋯O bond present, and the form that provides a better hydrogen bond geometry is preferred.
Article
Helix-capping motifs are specific patterns of hydrogen bonding and hydrophobic interactions found at or near the ends of helices in both proteins and peptides. In an alpha-helix, the first four >N-H groups and last four >C=O groups necessarily lack intrahelical hydrogen bonds. Instead, such groups are often capped by alternative hydrogen bond partners. This review enlarges our earlier hypothesis (Presta LG, Rose GD. 1988. Helix signals in proteins. Science 240:1632-1641) to include hydrophobic capping. A hydrophobic interaction that straddles the helix terminus is always associated with hydrogen-bonded capping. From a global survey among proteins of known structure, seven distinct capping motifs are identified-three at the helix N-terminus and four at the C-terminus. The consensus sequence patterns of these seven motifs, together with results from simple molecular modeling, are used to formulate useful rules of thumb for helix termination. Finally, we examine the role of helix capping as a bridge linking the conformation of secondary structure to supersecondary structure.
Article
We describe predictions made using the Rosetta structure prediction methodology for the Eighth Critical Assessment of Techniques for Protein Structure Prediction. Aggressive sampling and all-atom refinement were carried out for nearly all targets. A combination of alignment methodologies was used to generate starting models from a range of templates, and the models were then subjected to Rosetta all atom refinement. For the 64 domains with readily identified templates, the best submitted model was better than the best alignment to the best template in the Protein Data Bank for 24 cases, and improved over the best starting model for 43 cases. For 13 targets where only very distant sequence relationships to proteins of known structure were detected, models were generated using the Rosetta de novo structure prediction methodology followed by all-atom refinement; in several cases the submitted models were better than those based on the available templates. Of the 12 refinement challenges, the best submitted model improved on the starting model in seven cases. These improvements over the starting template-based models and refinement tests demonstrate the power of Rosetta structure refinement in improving model accuracy.
Article
Protein secondary structure carries information about local structural arrangements. Significant majority of successful methods for predicting the secondary structure is based on multiple sequence alignment. However, the multiple alignment fails to achieve accurate results when a protein sequence is characterized by low homology. To this end, we propose a novel method for prediction of secondary structure content through comprehensive sequence representation. The method is featured by employing a support vector machine (SVM) regressing system and adopting a different pseudo amino acid composition (PseAAC), which can partially take into account the sequence-order effects to represent protein samples. It was shown by both the self-consistency test and the independent-dataset test that the trained SVM has remarkable power in grasping the relationship between the PseAAC and the content of protein secondary structural elements, including alpha-helix, 3(10)-helix, pi-helix, beta-strand, beta-bridge, turn, bend and the rest random coil. Results prior to or competitive with the popular methods have been obtained, which indicate that the present method may at least serve as an alternative to the existing predictors in this area.
Article
The method of secondary structure prediction was presented by Nagano (1973) on the basis of doublet analysis and tested for two proteins, adenylate kinase and phage lysozyme, when they were newly solved by X-ray crystallography. The result for phage lysozyme has revealed that the prediction of a helix which is composed of helix-indifferent residues, e.g. the region 149 to 154 (Val-Ile-Thr-Thr-Phe-Arg) of phage lysozyme (Matthews & Remington, 1974), would be very difficult by the methods based on either singlet or doublet information unless the helical wheel effect (Schiffer & Edmundson, 1967) or some triplet information was included. After such modifications have been made, it may be seen that the percentage of correct residues in helix prediction is not noticeably improved but that the strength of the helix prediction function seems to be improved as a whole, and leads to the reasonable explantation of the formation of supersecondary structures presented in the accompanying paper (Nagano, 1977). The percentages of agreement between prediction and observation for 36 proteins used in the present analysis are listed in Table 1. For the threshold values used for the analysis of super-secondary structures, the values of %res.cor. ( 40 and 41) and correlation coefficients C (Matthews, 1975) are as follows: 78·9% (C=0·528) for helix; 70·7% (C=0·396) for loop or β-turn; 80·8% (C=0·519) for β-structure.
Article
A priori knowledge of secondary structure content can be of great use in theoretical and experimental determination of protein structure. We present a method that uses two computer-simulated neural networks placed in "tandem" to predict the secondary structure content of water-soluble, globular proteins. The first of the two networks, NET1, predicts a protein's helix and strand content given information about the protein's amino acid composition, molecular weight and heme presence. Because NET1 contained more adjustable parameters (network weights) than learning examples, this network experienced problems with memorization, which is the inability to generalize onto new, never-seen-before examples. To overcome this problem, we designed a second network, NET2, which learned to determine when NET1 was in a state of generalization. Together, these two networks produce prediction errors as low as 5.0% and 5.6% for helix and strand content, respectively, on a set of protein crystal structures bearing little homology to those used in network training. A comparison between three other methods including a multiple linear regression analysis, a non-hidden-node network analysis and a secondary structure assignment analysis reveals that our tandem neural network scheme is, indeed, the best method for predicting secondary structure content. The results of our analysis suggest that the knowledge of sequence information is not necessary for highly accurate predictions of protein secondary structure content.
Article
To study the influence of proline residues on three-dimensional structure, an analysis has been made of all proline residues and their local conformations extracted from the Brookhaven Protein Data bank. We have considered the conformation of the proline itself, the relative occurrence of cis and trans peptides preceding proline residues, the influence of proline on the conformation of the preceding residue and the conformations of various proline patterns (Pro-Pro, Pro-X-Pro, etc.). The results highlight the unique role of proline in determining local conformation.
Article
Amino acids have distinct conformational preferences that influence the stabilities of protein secondary and tertiary structures. The relative thermodynamic stabilities of each of the 20 commonly occurring amino acids in the alpha-helical versus random coil states have been determined through the design of a peptide that forms a noncovalent alpha-helical dimer, which is in equilibrium with a randomly coiled monomeric state. The alpha helices in the dimer contain a single solvent-exposed site that is surrounded by small, neutral amino acid side chains. Each of the commonly occurring amino acids was substituted into this guest site, and the resulting equilibrium constants for the monomer-dimer equilibrium were determined to provide a list of free energy difference (delta delta G degree) values.
Article
In this paper the latest protein database consisting of more than a million amino acids is analyzed to characterize the short range regularities in the primary structure. The amino acid distributions along the polypeptide chain and among the proteins have been studied first. Their influence on the amino acid pair statistics was taken into account. We are primarily interested in the distances of the covalent structure, where the amino acid pair frequencies show non-random characters. The amino acid pairs separated by at least 20 residues in the covalent structure exhibit an exact Gaussian distribution. We found that there is a range of non-random pairing in the covalent structure. We conclude that the pair preference characters are different for each of the 20 x 20 amino acid pairs. The range of the non-random pairing varies from pair to pair, and in most cases it does not extend beyond the 9th neighbour. The preferences of a certain pair in a certain position can not be derived from the character of that pair in another position. The preference values of 400 amino acid pairs are listed for up to the pairs in 9th neighbour position. Some fields of potential application of these data have also been discussed.
Article
The alpha helix, first proposed by Pauling and co-workers, is a hallmark of protein structure, and much effort has been directed toward understanding which sequences can form helices. The helix hypothesis, introduced here, provides a tentative answer to this question. The hypothesis states that a necessary condition for helix formation is the presence of residues flanking the helix termini whose side chains can form hydrogen bonds with the initial four-helix greater than N-H groups and final four-helix greater than C-O groups; these eight groups would otherwise lack intrahelical partners. This simple hypothesis implies the existence of a stereochemical code in which certain sequences have the hydrogen-bonding capacity to function as helix boundaries and thereby enable the helix to form autonomously. The three-dimensional structure of a protein is a consequence of the genetic code, but the rules relating sequence to structure are still unknown. The ensuing analysis supports the idea that a stereochemical code for the alpha helix resides in its boundary residues.
Article
A definition based on alpha-carbon positions and a sample of 215 alpha helices from 45 different globular protein structures were used to tabulate amino acid preferences for 16 individual positions relative to the helix ends. The interface residue, which is half in and half out of the helix, is called the N-cap or C-cap, whichever is appropriate. The results confirm earlier observations, such as asymmetrical charge distributions in the first and last helical turn, but several new, sharp preferences are found as well. The most striking of these are a 3.5:1 preference for Asn at the N-cap position, and a preference of 2.6:1 for Pro at N-cap + 1. The C-cap position is overwhelmingly dominated by Gly, which ends 34 percent of the helices. Hydrophobic residues peak at positions N-cap + 4 and C-cap - 4.
Article
We have re-evaluated the information used in the Garnier-Osguthorpe-Robson (GOR) method of secondary structure prediction with the currently available database. The framework of information theory provides a means to formulate the influence of local sequence upon the conformation of a given residue, in a rigorous manner. However, the existing database does not allow the evaluation of parameters required for an exact treatment of the problem. The validity of the approximations drawn from the theory is examined. It is shown that the first-level approximation, involving single-residue parameters, is only marginally improved by an increase in the database. The second-level approximation, involving pairs of residues, provides a better model. However, in this case the database is not big enough and this method might lead to parameters with deficiencies. Attention is therefore given to overcoming this lack of data. We have determined the significant pairs and the number of dummy observations necessary to obtain the best result for the prediction. This new version of the GOR method increases the accuracy of prediction by 7%, bringing the amount of residues correctly predicted to 63% for three states and 68 proteins, each protein to be predicted being removed from the database and the parameters derived from the other proteins. If the protein to be predicted is kept in the database the accuracy goes up to 69.7%.
Article
The disrupting effect of a prolyl residue on an α-helix has been analyzed by means of conformational energy computations. In the preferred, nearly α-helical conformations of Ac-Ala4-Pro-NHMe and of Ac-Ala7-Pro-Ala7-NHMe, only the residue preceding Pro is not α-helical, while all other residues can occur in the α-helical A conformation; i.e., it is sufficient to introduce a conformational change of only one residue in order to accommodate proline in a distorted α-helix. Other low-energy conformations exist in which the conformational state of three residues preceding proline is altered considerably; on the other hand, another conformation in which these three residues retain the near-α-helical A-conformational state (with up to 26° changes of their dihedral angles ϕ and ψ, and a 48° change in one ω from those of the ideal α-helix) has a considerably higher energy. These conclusions are not altered by the substitution of other residues in the place of the Ala preceding Pro. The conformations of the peptide chain next to prolyl residues in or near an α-helix have been analyzed in 58 proteins of known structure, based on published atomic coordinates. Of 331 α-helices, 61 have a Pro at or next to their N-terminus, 21 have a Pro next to their C-terminus, and 30 contain a Pro inside the helix. Of the latter, 16 correspond to a break in the helix, 9 are located inside distorted first turns of the helix, and 5 are parts of irregular helices. Thus, the reported occurrence of prolyl residues next to or inside observed α-helices in proteins is consistent with the computed steric and energetic requirements of prolyl peptides.
Article
The occurrence of all di- and tripeptide segments of proteins was counted in a large data base containing about 119 000 residues. It was found that the abundance of the amino acids does not determine the frequency of the various di- and tripeptide segments. In addition, the frequency of the various tripeptides cannot be predicted from the observed pair-frequency values. The pair-frequency distribution of amino acids is highly asymmetrical, pairs formed from identical residues are generally preferred and amino acids cannot be clustered on the basis of their first neighbour preferences. These data indicate the existence of general short range regularities in the primary structure of proteins. The consequences of these short range regularities were studied by comparing Chou-Fasman parameters with analogous parameters determined from the results of conformational energy calculations of single amino acids. This comparison shows that Chou-Fasman parameters carry significant information about the environment of each amino acid. The success of the Chou-Fasman's prediction and the properties of the pair and triplet distribution of the amino acid residues suggest that every amino acid has a characteristic sequential residue environment in proteins. The observed preferences could be invoked, for example, in protein design or in the study of the evolutionary relationship of proteins.
Article
The folding types of 135 proteins, the three-dimensional structures of which are known, were analyzed in terms of the amino acid composition. The amino acid composition of a protein was expressed as a point in a multidimensional space spanned with 20 axes, on which the corresponding contents of 20 amino acids in the protein were represented. The distribution pattern of proteins in this composition space was examined in relation to five folding types, alpha, beta, alpha/beta, alpha + beta, and irregular type. The results show that amino acid compositions of the alpha, beta, and alpha/beta types are located in different regions in the composition space, thus allowing distinct separation of proteins depending on the folding types. The points representing proteins of the alpha + beta and irregular types, however, are widely scattered in the space, and the existing regions overlap with those of the other folding types. A simple method of utilizing the "distance" in the space was found to be convenient for classification of proteins into the five folding types. The assignment of the folding type with this method gave an accuracy of 70% in the coincidence with the experimental data.
Article
During biosynthesis, a globular protein folds into a tight particle with an interior core that is shielded from the surrounding solvent. The hydrophobic effect is thought to play a key role in mediating this process: nonpolar residues expelled from water engender a molecular interior where they can be buried. Paradoxically, results of earlier quantitative analyses have suggested that the tendency for nonpolar residues to be buried within proteins is weak. However, such analyses merely classify residues as either "exposed" or "buried." In the experiment reported in this article proteins of known structure were used to measure the average area that each residue buries upon folding. This characteristic quantity, the average area buried, is correlated with residue hydrophobicity.
Article
Multiple regression is used to obtain relationships for predicting the amount of secondary structure in a protein molecule from a knowledge of its aminoacid composition. We tested these relations using 18 proteins of known structure, but omitting the protein to be predicted. Independent predictions were made for the two subchains of hemoglobin and insulin. The average errors for these 20 chains or subchains are: helix +/- 7.1%, beta-sheet +/- 6.9%, turn +/- 4.2%, and coil +/- 5.7%. A second set of relations yielding somewhat inferior predictions is given for the case in which Asp and Asn, and Glu and Gln, are not differentiated. Predictions are also listed for 15 proteins for which the aminoacid sequence or tertiary structure is unknown.
Article
A new predictive model for the secondary structure of globular proteins (α helix, β sheet, and β turns) is described utilizing the helix and β-sheet conformational parameters, Pα and Pβ, of the 20 amino acids computed in the preceding paper (Chou and Fasman, 1974). This simple and direct method, devoid of complex computer calculations, utilizes empirical rules for predicting the initiation and termination of helical and β regions in proteins. Briefly stated: when four helix formers out of six residues or three β formers out of five residues are found clustered together in any native protein segment, the nucleation of these secondary structures begins and propagates in both directions until terminated by a sequence of tetrapeptides, designated as breakers. These rules were successful in locating 88% of helical and 95% of β regions, as well as correctly predicting 80% of the helical and 86% of the β-sheet residues in the 19 proteins evaluated. The accuracy of predicting the three conformational states for all residues, helix, β, and coil, is 77% and shows great improvement over earlier prediction methods which considered only the helix and coil states. The β-turn conformational parameters, Pt, for all 20 amino acids are computed. Their use enables the prediction of chain reversal and tertiary folding in proteins. A procedure for predicting conformational changes in specific regions is also outlined. Despite some evidence of long-range interactions in stabilizing protein folding, the present predictive model illustrates that short-range interactions (i.e., single residue information as represented by Pα and Pβ) and medium-range interactions (i.e., neighboring residue information as represented by 〈Pα〉 and 〈Pβ〉) play the predominant role in determining protein secondary structure. Although the three-dimensional structures of only 19 proteins have been elucidated to date via X-ray studies, the amino acid sequences of hundreds of proteins have already been determined. Since the present predictive model is capable of delineating the helix, β, and coil regions of proteins of known sequence with 80% accuracy, application of this method will be of assistance to all those interested in studying the correlation between protein conformation and biological activity as well as an aid to crystallographers in interpreting X-ray data.
Article
Algorithms are suggested for identifying α-helical and β-structural regions in native globular proteins. α-Helical and β-structural regions are predicted, with accuracy of ~80 and ~85% respectively, for 25 proteins, the three-dimensional structures of which have been determined by X-ray diffraction crystallography. Secondary structure is predicted in 25 proteins with unknown three-dimensional structure.
Article
The paper reveals the types of amino acid sequences of polypeptide chain regions of globular protein which form a regular (α or β) or irregular conformation in the native globule. The study was made taking into account general “architectural” principles of packing of polypeptide chains in globular proteins and considering the interactions of proteins with water molecules. An a priori theory is developed which permits the identification, in good agreement with experiment, of α-helical and β-structural regions in globular proteins from their primary structure.
Article
We describe a suite of programs, PROMOTIF, that analyzes a protein coordinate file and provides details about the structural motifs in the protein. The program currently analyzes the following structural features: secondary structure; beta-and gamma-turns; helical geometry and interactions; beta-strands and beta-sheet topology; beta-bulges; beta-hairpins; beta-alpha-beta units and psi-loops; disulphide bridges; and main-chain hydrogen bonding patterns. PROMOTIF creates postscript files showing the examples of each type of motif in the protein, and a summary page. The program can also be used to compare motifs in a group of related structures, such as an ensemble of NMR structures.
Article
The predictive limits of the amino acid composition for the secondary structural content (percentage of residues in the secondary structural states helix, sheet, and coil) in proteins are assessed quantitatively. For the first time, techniques for prediction of secondary structural content are presented which rely on the amino acid composition as the only information on the query protein. In our first method, the amino acid composition of an unknown protein is represented by the best (in a least square sense) linear combination of the characteristic amino acid compositions of the three secondary structural types computed from a learning set of tertiary structures. The second technique is a generalization of the first one and takes into account also possible compositional couplings between any two sorts of amino acids. Its mathematical formulation results in an eigenvalue/eigenvector problem of the second moment matrix describing the amino acid compositional fluctuations of secondary structural types in various proteins of a learning set. Possible correlations of the principal directions of the eigenspaces with physical properties of the amino acids were also checked. For example, the first two eigenvectors of the helical eigenspace correlate with the size and hydrophobicity of the residue types respectively. As learning and test sets of tertiary structures, we utilized representative, automatically generated subsets of Protein Data Bank (PDB) consisting of non-homologous protein structures at the resolution thresholds < or = 1.8A, < or = 2.0A, < or = 2.5A, and < or = 3.0 A. We show that the consideration of compositional couplings improves prediction accuracy, albeit not dramatically. Whereas in the self-consistency test (learning with the protein to be predicted), a clear decrease of prediction accuracy with worsening resolution is observed, the jackknife test (leave the predicted protein out) yielded best results for the largest dataset (< or = 3.0A, almost no difference to the self-consistency test!), i.e., only this set, with more than 400 proteins, is sufficient for stable computation of the parameters in the prediction function of the second method. The average absolute error in predicting the fraction of helix, sheet, and coil from amino acid composition of the query protein are 13.7, 12.6, and 11.4%, respectively with r.m.s. deviations in the range of 8.6 divided by 11.8% for the 3.0 A dataset in a jackknife test. The absolute precision of the average absolute errors is in the range of 1 divided by 3% as measured for other representative subsets of the PDB. Secondary structural content prediction methods found in the literature have been clustered in accordance with their prediction accuracies. To our surprise, much more complex secondary structure prediction methods utilized for the same purpose of secondary structural content prediction achieve prediction accuracies very similar to those of the present analytic techniques, implying that all the information beyond the amino acid composition is, in fact, mainly utilized for positioning the secondary structural state in the sequence but not for determination of the overall number of residues in a secondary structural type. This result implies that higher prediction accuracies cannot be achieved relying solely on the amino acid composition of an unknown query protein as prediction input. Our prediction program SSCP has been made available as a World Wide Web and E-mail service.
Article
The prediction of the secondary structure content (α-helix andΒ-strand content) of a globular protein may play an important complementary role in the prediction of the protein's structure. We propose a new prediction algorithm based on Chou's database [Chou (1995),Proteins Struct. Fund Genet. 21, 319]. The new algorithm is an improved multiple linear regression method, taking the nonlinear and coupling terms of the frequencies of different amino acids into account. The prediction is also based on the structural classes of proteins. A resubstitution examination for the algorithm shows that the average errors are 0.040 and 0.033 for the prediction ofα-helix content andΒ-strand content, respectively. The examination of cross-validation, the jackknife analysis, shows that the average errors are 0.051 and 0.044 for the prediction ofα-helix content andΒ-strand content, respectively. Both examinations indicate the self-consistency and the extrapolative effectiveness of the new algorithm. Compared with the other methods available currently, our method has the merits of simplicity and convenience for use, as well as a high prediction accuracy. By incorporating the prediction of the structural classes, the only input of our method is the amino acid composition of the protein to be predicted.
Article
Much attention is being paid to protein databases as an important information source for proteome research. Although used extensively for similarity searches, protein databases themselves have not fully been characterized. In a systematic attempt to reveal protein-database characters that could contribute to revealing how protein chains are constructed, frequency distributions of all possible combinatorial sets of three, four, and five amino acids ("triplets," "quartets," and "pentats"; collectively called constituent sequences) have been examined in the nonredundant (nr) protein database, demonstrating the existence of nonrandom bias in their "availability" at the population level. Nonexistent short sequences of pentats were found that showed low availability in biological proteins against their expected probabilities of occurrence. Among them, six representative ones were successfully synthesized as peptides with reasonably high yields in a conventional Fmoc method, excluding the possibility that a putative physicochemical energy barrier in forming them could be a direct cause for the low availability. They were also expressed as soluble fusion proteins in a conventional Escherichia coli BL21Star(DE3) system with reasonably high yield, again excluding a possible difficulty in their biological synthesis. Together, these results suggest that information on three-dimensional structures and functions of proteins exists in the context of connections of short constituent sequences, and that proteins are composed of evolutionarily selected constituent sequences, which are reflected in their availability differences in the database. These results may have biological implications for protein structural studies.
Article
One of interesting computational topics in bioinformatics is prediction of secondary structure of proteins. Over 30 years of research has been devoted to the topic but we are still far away from having reliable prediction methods. A critical piece of information for accurate prediction of secondary structure is the helix and strand content of a given protein sequence. Ability to accurately predict content of those two secondary structures has a good potential to improve accuracy of prediction of the secondary structure. Most of the existing methods use composition vector to predict the content. Their underlying assumption is that the vector can be used to provide functional mapping between primary sequence and helix/strand content. While this is true for small sets of proteins we show that for larger protein sets such mapping are inconsistent, i.e. the same composition vectors correspond to different contents. To this end, we propose a method for prediction of helix/strand content from primary protein sequences that is fundamentally different from currently available methods.
Article
The prediction of protein structure from amino acid sequence is a grand challenge of computational molecular biology. By using a combination of improved low- and high-resolution conformational sampling methods, improved atomically detailed potential functions that capture the jigsaw puzzle–like packing of protein cores, and high-performance computing, high-resolution structure prediction (<1.5 angstroms) can be achieved for small protein domains (<85 residues). The primary bottleneck to consistent high-resolution prediction appears to be conformational sampling.
Article
Knowing protein structure and inferring its function from the structure are one of the main issues of computational structural biology, and often the first step is studying protein secondary structure. There have been many attempts to predict protein secondary structure contents. Previous attempts assumed that the content of protein secondary structure can be predicted successfully using the information on the amino acid composition of a protein. Recent methods achieved remarkable prediction accuracy by using the expanded composition information. The overall average error of the most successful method is 3.4%. Here, we demonstrate that even if we only use the simple amino acid composition information alone, it is possible to improve the prediction accuracy significantly if the evolutionary information is included. The idea is motivated by the observation that evolutionarily related proteins share the similar structure. After calculating the homolog-averaged amino acid composition of a protein, which can be easily obtained from the multiple sequence alignment by running PSI-BLAST, those 20 numbers are learned by a multiple linear regression, an artificial neural network and a support vector regression. The overall average error of method by a support vector regression is 3.3%. It is remarkable that we obtain the comparable accuracy without utilizing the expanded composition information such as pair-coupled amino acid composition. This work again demonstrates that the amino acid composition is a fundamental characteristic of a protein. It is anticipated that our novel idea can be applied to many areas of protein bioinformatics where the amino acid composition information is utilized, such as subcellular localization prediction, enzyme subclass prediction, domain boundary prediction, signal sequence prediction, and prediction of unfolded segment in a protein sequence, to name a few.
Article
There are 3,200,000 amino acid sequences of length 5 (penta-peptides). Statistically, we expect to see a distribution of penta-peptides that is determined by the frequency of the participating amino acids. We show, however, that not only are there thousands of such penta-peptides that are absent from all known proteomes, but many of them are coded for multiple times in the non-coding genomic regions. This suggests a strong selection process that prevents these peptides from being expressed. We also show that the characteristics of these forbidden penta-peptides vary among different phylogenetic groups (e.g., eukaryotes, prokaryotes, and archaea). Our analysis provides the first steps toward understanding the "grammar" of the forbidden penta-peptides.
Article
Three-dimensional structure of a protein molecule is primarily determined by its amino acid sequence, and thus the elucidation of general rules embedded in amino acid sequences is of great importance in protein science and engineering. To extract valuable information from sequences, we propose an analytical method in which a protein sequence is considered to be constructed by serial superimpositions of short amino acid sequences of n amino acid sets, especially triplets (3-aa sets). Using the comprehensive nonredundant protein database, we first examined "availability" of all possible combinatorial sets of 8,000 triplet species. Availability score was mathematically defined as an indicator for the relative "preference" or "avoidance" for a given short constituent sequence to be used in protein chain. Availability scores of real proteins were clearly biased against those of randomly generated proteins. We found many triplet species that occurred in the database more than expected or less than expected. Such bias was extended to longer sets, and we found that some species of pentats (5-aa sets) that occurred reasonably frequently in the randomly generated protein population did not occur at all in any real proteins known today. Availability score was dependent on species, potentially serving as a phylogenetic indicator. Furthermore, we suggest possibilities of various biotechnological applications of characteristic short sequences such as human-specific and pathogen-specific short sequences obtained from availability analysis. Availability score was also dependent on secondary structures, potentially serving as a structural indicator. Availability analysis on triplets may be combined with a comprehensive data collection on the varphi and psi peptide-bond angles of the amino acid at the center of each triplet, i.e., a collection of Ramachandran plots for each triplet. These triplet characters, together with other physicochemical data, will provide us with basic information between protein sequence and structure, by which structure prediction and engineering may be greatly facilitated. Availability analysis may also be useful in identifying word processing units in amino acid sequences based on an analogy to natural languages. Together with other approaches, availability analysis will elucidate general rules hidden in the primary sequences and eventually contributes to rebuilding the paradigm of protein science.