ArticlePDF Available

Abstract and Figures

Computational approaches to the disulphide bonding state and its connectivity pattern prediction are based on various descriptors. One descriptor is the amino acid sequence motifs flanking the cysteine residue motifs. Despite the existence of disulphide bonding information in many databases and applications, there is no complete reference and motif query available at the moment. Cysteine motif database (CMD) is the first online resource that stores all cysteine residues, their flanking motifs with their secondary structure, and propensity values assignment derived from the laboratory data. We extracted more than 3 million cysteine motifs from PDB and UniProt data, annotated with secondary structure assignment, propensity value assignment, and frequency of occurrence and coefficiency of their bonding status. Removal of redundancies generated 15875 unique flanking motifs that are always bonded and 41577 unique patterns that are always nonbonded. Queries are based on the protein ID, FASTA sequence, sequence motif, and secondary structure individually or in batch format using the provided APIs that allow remote users to query our database via third party software and/or high throughput screening/querying. The CMD offers extensive information about the bonded, free cysteine residues, and their motifs that allows in-depth characterization of the sequence motif composition.
Content may be subject to copyright.
Hindawi Publishing Corporation
Advances in Bioinformatics
Volume 2012, Article ID 849830, 5pages
Research Article
CMD: A Database to Store the Bonding States of Cysteine Motifs
with Secondary Structures
Hamed Bostan,1Naomie Salim,2Zeti Azura Hussein,3
Peter Klappa,4and Mohd Shahir Shamsir1
1Faculty of Biosciences and Bioengineering, Universiti Teknologi Malaysia, 81310 Johor Bahru, Johor, Malaysia
2Faculty of Computer Science and Information Systems, Universiti Teknologi Malaysia, 81310 Johor Bahru, Johor, Malaysia
3School of Bioscience and Biotechnology, Faculty of Science and Technology, Universiti Kebangsaan Malaysia,
43600 Bangi, Selangor, Malaysia
4School of Biosciences, University of Kent, Canterbury, Kent CT2 7NJ, UK
Correspondence should be addressed to Mohd Shahir Shamsir,
Received 30 July 2012; Accepted 6 September 2012
Academic Editor: Huixiao Hong
Copyright © 2012 Hamed Bostan et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Computational approaches to the disulphide bonding state and its connectivity pattern prediction are based on various descriptors.
One descriptor is the amino acid sequence motifs flanking the cysteine residue motifs. Despite the existence of disulphide bonding
information in many databases and applications, there is no complete reference and motif query available at the moment. Cysteine
motif database (CMD) is the first online resource that stores all cysteine residues, their flanking motifs with their secondary
structure, and propensity values assignment derived from the laboratory data. We extracted more than 3 million cysteine motifs
from PDB and UniProt data, annotated with secondary structure assignment, propensity value assignment, and frequency of
occurrence and coeciency of their bonding status. Removal of redundancies generated 15875 unique flanking motifs that are
always bonded and 41577 unique patterns that are always nonbonded. Queries are based on the protein ID, FASTA sequence,
sequence motif, and secondary structure individually or in batch format using the provided APIs that allow remote users to query
our database via third party software and/or high throughput screening/querying. The CMD oers extensive information about
the bonded, free cysteine residues, and their motifs that allows in-depth characterization of the sequence motif composition.
1. Background
Disulphide bonds are formed by oxidation of two cysteine
residues in a protein and are significant to a proteins
conformational stability as they confer greater thermal and
chemical stability as well as stabilizing structural interme-
diates to ensure the correct folding pathway. However, the
connectivity of the disulphide bonds in protein sequences
can only be determined experimentally. Given this dif-
ficulty, the ability to evaluate or predict the disulphide
bonding state and connectivity from the sequence would
prove to be highly valuable in engineering proteins for
biotechnological and medical applications. Computational
approaches towards disulphide connectivity prediction have
been based on various descriptors. One of these descriptors
is the sequence motifs generated by combining the flanking
residues on the either side of the the cysteine residue [1,
2]. These immediate residues flanking the cysteine have
been shown to influence the cysteine’s redox potential and
the cysteine’s steric accessibility [3]. These sequence motifs
have been fed into various prediction methods [4]suchas
machine learning approaches (i.e., statistical methods, neural
networks (NNs) [5], and support vector machine (SVM) [6
8]suchasDiaNNA[3], DISULFIND [9], DCON [10], and
CysView [11]. Currently, all the cysteine motifs are extracted
by parsing data from protein databases and feeding them into
the prediction tools. Motivated by the absence of a database
and usefulness of the cysteine flanking motifs in predicting
the cysteine bonding state and connectivity prediction, we
have developed cysteine motif database (CMD) as a tool to
mine and store these motifs. The creation of CMD allows the
motif extraction and facilitates the study of their secondary
2Advances in Bioinformatics
structures, bonding and connectivity propensities. In this
paper, we present CMD as a publicly available tool that
complements existing prediction tools.
2. Construction and Content
2.1. Content. The CMD data was compiled from Pro-
tein Data Bank (PDB) ( and UniProt
( For each databank, two dierent
datasets were created; a complete protein dataset and a
second 100% nonhomologous unique sequence dataset
(100% similar sequences were omitted). We have featured
CMD with both datasets for each PDB and UniProt, allowing
researchers to utilize the database in its entirety (73656
structures for PDB and 531462 structures for UniProt) or to
include only unique sequences (33874 for PDB and 140723
for UniProt). Using these datasets, we extracted 878,000
cysteine motifs based on 1st, 2nd, 3rd, 4th, and 5th flanking
residues of the cysteine as these immediate residues are
within proximity to exert influence on the cysteine (Tab l e 1).
The assignment of the bonding state of cysteine residues
and their bonding partners is based on the SSBOND and
DISULPHIDE BOND tags in each PDB and UniProt files.
The motifs were clustered according to the occurrence of the
bonding state, that is, always bonded, always nonbonded,
and both bonded and nonbonded (nonbonded state with
another cysteine or to other atoms such as metals). Each of
the bonded cysteine is also mapped to each inter and intra-
chain disulphide bond cysteine partner.
The motifs were categorized between inter and intrado-
main with the secondary structure assignments for each
motif sequence (if available) determined using secondary
structure reference files retrieved from PDB.
2.2. Construction. The data contained in CMD is stored
in Microsoft SQL server 2005 data storage architecture.
Cysteine motif pattern tables are indexed based on Protein
ID, motifs, chain number, and secondary structure to
enhance the eciency of the querying performance. Table-
based partitioning was used to increase the flexibility and
performance on Motif data tables. In these tables, over
three million motifs are stored which can be queried and
processed. All preprocessing, data extraction, and injection
for motif sequences and their secondary structure were
carried out in Net 4.0 platform using C# programming
language. The web interface of CMD is based on ASP.
Net extension integrated with Ajax technology to provide a
strong, simple, and user friendly environment for end users.
The web application is hosted on an Internet Information
Services (IIS) HTTP server version 7.5.7600.16385. CMD
will be updated automatically with latest data from PDB and
In addition, several APIs available in CMD enable
developers to query our database remotely and embed
the results in their own system independently. A complete
list of available APIs together with the method of inline
implementation is available in the FAQ section of the CMD
Tab l e 1
Proteins 73656 33874 531462 140723
Patterns 535544 230213 2509611 966374
Bonded motifs 148505 64246 189238 113365
Nonbonded motifs 387039 165967 2320373 853009
Intrachain 84591 36473
Interchain 4013 1900
NH: Nonhomologous unique sequences which have been aected by 100%
similarity removal.
3. Data Update
Using RCSB and UniProt API’s, the software will retrieve
all the Protein IDs available in the mentioned resources.
A query will list all the existing Protein IDs in our local
dataset. All new Protein IDs will be identified using both
above references. Using RCSB and UniProt ftp services, all
the newly identified protein files will be downloaded using
the Protein ID’s to our local server. As in our method of
preprocessing and data set preparation, all SEQRESS and
All cysteine motifs based on the 1st, 2nd, 3rd, 4th, and
5th number of flanking residue on each side (neighboring
residues) will be captured and extracted to the records of
data with cysteine at the meddle. Each record contains the
motif sequence, Chain ID, cysteine residue position in the
sequence, bonding status of cysteine residue and the Protein
ID as the reference. Each record will be inserted into our
database. A log will be generated for the successful procedure
or any run time error.
4. Utility and Discussion
4.1. User Interface. The CMD website features an interactive
and comprehensive cysteine Motif query engine by support-
ing dierent search keywords, such as Protein IDs and motif
sequences in the FASTA format. Users can filter according
to proteins which are mutated and engineered proteins. All
results can be downloaded as text and CSV for further
analysis (Figures 1,2,and3).
4.2. Utility: Example Applications. CMD facilitate studies
focused on cysteine disulphide bonding status prediction
and analysis by processing the data. Here we present two
applications of our system that illustrate the potential of
CMD in greater details.
4.2.1. Application 1: Statistical Analysis of Bonding State.
To analyze the predictive power of CFMD, we investigated
the cysteine bonding pattern of human protein disulphide
isomerase (PDII, P07237 [UniParc]). PDI catalyses the
formation (oxidation) and rearrangement (isomerisation)
of disulphide bonds during the folding of secretory and
Advances in Bioinformatics 3
Figure 1: Annotated diagram describing the search options for
“Search By ID” section. (A) Users can choose either PDB or Swis-
sProt. (B) Users can enter single or multiple ProteinIDs separated
by comma (,) as keyword. (C) Users can choose which of the results
to appear in the output.
membrane-bound proteins (for review see [12]), thus sta-
bilising the native structure of these proteins. PDI contains
two domains with high sequence homology to thioredoxin.
One of these thioredoxin motives is found at position 52–
55, while the second motif is located at position 396–399.
The active site cysteine residues in the thioredoxin motives
are essential for the oxidase/isomerase activity of PDI. In
each motif the two cysteine residues within the sequence—
WCGHC—can potentially form a disulphide bond.
To investigate whether both thioredoxin motives have
similar disulphide bond propensities, that is, whether both
thioredoxin motives are in the same bonded form, we
analysed the disulphide bonding pattern with the CFMD
(Figure 4 and Tab l e 2). Our analysis predicted that the first
thioredoxin motif around residues 52–55 indeed forms an
intradomain disulphide bond; the second cysteine residue
in the sequence CGHCKAL has a very high propensity of
forming a disulphide bond with the first cysteine residue.
However, the second thioredoxin motif is not predicted to
be disulphide bonded, since the second cysteine residue in
the sequence CGHCKQL has zero propensity of forming a
disulphide bond with the first cysteine residue in this motif.
We therefore predict that the two thioredoxin motives in PDI
are in dierent bonding states; while the first—WCGHC—
motif is in the oxidized and thus disulphide bonded form,
the second thioredoxin motif is in the reduced form. From
this analysis we conclude that the two thioredoxin motives
inPDIhavedierent reduction potentials. This result is in
excellent agreement with the findings of Chambers and co-
workers [13], who showed that the two thioredoxin motives
react dierently to Ero1a, the in vivo oxidant of PDI.
4.2.2. Application 2: Protein Identification and Motif Explo-
ration. Catalytic functionalities of some enzymatic proteins
are dependant on the oxidation and reduction of state of
their cysteine residues. The oxidation of cysteine residues
and formation of disulphide bonds take place in a reducing
environment. In prokaryotes, disulphide bonds are mainly
formed in the periplasmic space outside the membrane.
In contrast, the formation of disulphide bonds takes place
in endoplasmic reticulum (ER) in eukaryotes. As a result,
proteins with stable disulfide bonds rarely reside in the
Figure 2: Annotated diagram of “Search By FASTA Sequence”
section showing all search options and filtering criteria. (A) Users
can choose either PDB or SwissProt. (B) Users can enter single or
multiple FASTA sequences to be investigated for each motif inside.
(C) Users can also upload a FASTA format file to be investigated.
(D) Users can choose the number of amino acid residues on each
side of cysteine for motif extraction process within the FASTA
sequence. (E) Users can filter the proteins in which the motif will be
investigated. User can specify whether the protein was engineered or
mutated and choose whether the protein contains any DNA or RNA
link. They can also filter out the similar proteins and keep only one
identical copy of them for advanced investigations.
Figure 3: Annotated diagram describing the result’s annotation
for the “Search By Molecule Name” section. (A) Showing the
motifs, secondary structure, cysteine position in the sequence, and
the chain name. (B) Showing the propensity values of the motif
sequence. (C) The navigation pane facilitating accessing ProteinIDs
having common and similar features. (D) Listing the pair patterns
existing in the protein in details. (E) The summary of bonding for
the selected protein.
cytoplasm. This knowledge would apply on a larger scale,
making the local and global profile of each protein environ-
ment, its folding localization, and classification becoming a
potential contribution on the disulphide bonding prediction
CMD oers the user a unique ability to identify and
mine all known proteins using specific motif sequence, and
explore their classification, motif sequences, structure, and
bonding status. During the creation of the datasets, we
discovered 15875 unique motifs that are always bonded
4Advances in Bioinformatics
Figure 4: Quer y forfull length human protein disulphide isomerase
(PDII, P07237 [UniParc]). (A) Screenshot of parameters for CFMD.
Tab l e 2: Edited output from (A). The bold rows indicate the
second active site cysteine residues in the respective thioredoxin
motif. Column 1 (Thioredoxin motif) was added for additional
clarification. The cysteine residue in italics indicates the queried
cysteine residue, the respective position of which is given in the
second column.
motif Position Motif Total Bond Coecient
00 0
152 APWCGHC 12 5 0.417
155 CGHCKAL 1 1 1
00 0
00 0
2396 APWCGHC 12 5 0.417
2399 CGHCKQL 2 0 0
(EATLRCWALGF with the highest occurrence) and 41577
unique patterns that are always nonbonded (ALSVPCSDSKA
with the highest occurrence) for the five flanking residues
that can be utilized for cysteine state prediction. The number
of these unique motifs is considerably higher than prior
number of motifs used in cysteine bond prediction [3,14]
and not limited to specific genomes [15].
4.3. Data Availability. The CMD databases are accessible
through a web portal at
entire database with annotations is available for download
in the SQL format, describing the relations between classes
and fragments. As an additional service for programmers and
third party developers, all queries available in CMD are freely
accessible using available web services and web application
programming interfaces (API). Also for automated high-
throughput querying, all information contained in the CMD
database can be downloaded using ftp services.
5. Discussion
The CMD combined data of bonded and free cysteine motifs
aims to fill a gap in the knowledge query that will allow in-
depth characterization of the composition propensity, and its
role in determining the bonding state. Despite the bonding
information regarding cysteine residues in proteins available
in many databases and several applications focused on
disulphide bridge formation prediction, there is no complete
reference with a proper form of representation and analysis
available at the moment. This database is automatically
updated from the PDB and UniProt that currently contain
878000 cysteine motifs with more than 77,000 unique
cysteine motifs and cysteine pairing motifs. Compilation
of these cysteine motifs together with their secondary
structures and propensity value assignments, and the ability
to query using Protein IDs and motif sequences is a novel
and significant feature over prior prediction works which
use considerably smaller datasets [3]. In addition to the
novelty of the motif query tool, CMD has several novelties
such as inclusion of UniProt data, the distinction between
inter or intrachain disulphide bonds, inter or intradomain
bonds, and an application programming interfaces (APIs)
for interfacing with other bioinformatics tools.
6. Conclusion
The creation of CMD is useful when analyzing cysteine/
disulfide bond formation and its motif sequence compo-
sition analysis by providing (1) a query tool for cysteine
motifs based upon a comprehensive cysteine motif database
curated from PDB and UniProt, (2) secondary structure
and propensity values assignments of each motif sequence,
and (3) datasets of detailed information of the motifs such
as occurrence frequency and their amino acids propensity
value. We believe that CMD’s usefulness will be the query
tool that will complement other protein 3D structural
databases and similarly motif-based prediction tools.
Availability and Requirements
The CMD database is available to the public for free at Contact:
Ministry of Science, Technology and Innovation (MOSTI)
Grant no. 07-05-MGI-GMB007.
Conflict of Interests
The authors declare that they have no conflict of interests.
The authors would like to acknowledge Chew Teong Han for
the support throughout the development of CMD.
[1] S. M. Muskal, S. R. Holbrook, and S. H. Kim, “Prediction
of the disulfide-bonding state of cysteine in proteins,Protein
Engineering, vol. 3, no. 8, pp. 667–672, 1990.
[2] M.H.Mucchielli-Giorgi,S.Hazout,andP.Tu´
ery, “Predicting
the disulfide bonding state of cysteines using protein descrip-
tors,” Proteins, vol. 46, no. 3, pp. 243–249, 2002.
Advances in Bioinformatics 5
[3] F. Ferr`
e and P. Clote, “DiANNA 1.1: an extension of the
DiANNA web server for ternary cysteine classification,Nuc-
leic Acids Research, vol. 34, pp. W182–W185, 2006.
[4] R. Singh, “A review of algorithmic techniques for disulfide-
bond determination,Briefings in Functional Genomics and
Proteomics, vol. 7, no. 2, pp. 157–172, 2008.
[5] J.Song,Z.Yuan,H.Tan,T.Huber,andK.Burrage,“Predicting
disulfide connectivity from protein sequence using multiple
sequence feature vectors and secondary structure,Bioinfor-
matics, vol. 23, no. 23, pp. 3147–3154, 2007.
[6] Y. C. Chen, Y. S. Lin, C. J. Lin, and J. K. Hwang, “Prediction
of the bonding states of cysteines using the support vector
machines based on multiple feature vectors and cysteine state
sequences,Proteins, vol. 55, no. 4, pp. 1036–1042, 2004.
[7] P. L. Martelli, P. Fariselli, and R. Casadio, “Prediction of
disulfide-bonded cysteines in proteomes with a hidden neural
network,Proteomics, vol. 4, no. 6, pp. 1665–1671, 2004.
[8] C.H.Tsai,B.J.Chen,C.H.Chan,H.L.Liu,andC.Y.Kao,
“Improving disulfide connectivity prediction with sequential
distance between oxidized cysteines,Bioinformatics, vol. 21,
no. 24, pp. 4416–4419, 2005.
[9] A. Ceroni, A. Passerini, A. Vullo, and P. Frasconi, “Disulfind:
a disulfide bonding state and cysteine connectivity prediction
server,Nucleic Acids Research, vol. 34, pp. W177–W181, 2006.
[10] A. Vullo and P. Frasconi, “Disulfide connectivity prediction
using recursive neural networks and evolutionary informa-
tion,Bioinformatics, vol. 20, no. 5, pp. 653–659, 2004.
[11] J. Lener,P.Lai,W.ElMejaberetal.,“CysView:protein
classification based on cysteine pairing patterns,Nucleic Acids
Research, vol. 32, supplement, pp. W350–W355, 2004.
[12] F. Hatahet and L. W. Ruddock, “Protein disulfide isomerase: a
critical evaluation of its function in disulfide bond formation,”
Antioxidants and Redox Signaling, vol. 11, no. 11, pp. 2807–
2850, 2009.
[13] J. E. Chambers, T. J. Tavender, O. B. V. Oka, S. Warwood,
D. Knight, and N. J. Bulleid, “The reduction potential of the
active site disulfides of human protein disulfide isomerase
limits oxidation of the enzyme by Ero1α,” Journal of Biological
Chemistry, vol. 285, no. 38, pp. 29200–29207, 2010.
[14] P. Baldi, J. Cheng, and A. Vullo, “Large-scale prediction of
disulphide bond connectivity,Advances in Neural Information
Processing Systems, no. 17, pp. 97–104, 2005.
[15] B. D. O’Connor and T. O. Yeates, “GDAP: a web tool for
genome-wide protein disulfide bond prediction,Nucleic Acids
Research, vol. 32, pp. W360–W364, 2004.
... Modification of GO-202 with C-for-Q substitution. According to PDB and UniProt analyses64 , RCxCR sequence with x = C has highest propensity to form disulfide bridges (via the flanking cysteines).Table 1. Additional peptides containing CxC or CxxC motifs that have been investigated in this work. ...
Full-text available
Antitumor GO peptides have been designed as dimerization inhibitors of prominent oncoprotein mucin 1. In this study we demonstrate that activity of GO peptides is independent of the level of cellular expression of mucin 1. Furthermore, these peptides prove to be broadly cytotoxic, causing cell death also in normal cells such as dermal fibroblasts and endometrial mesenchymal stem cells. To explore molecular mechanism of their cytotoxicity, we have designed and tested a number of new peptide sequences containing the key CxC or CxxC motifs. Of note, these sequences bear no similarity to mucin 1 except that they also contain a pair of proximal cysteines. Several of the new peptides turned out to be significantly more potent than their GO prototypes. The results suggest that cytotoxicity of these peptides stems from their (moderate) activity as disulfide oxidoreductases. It is expected that such peptides, which we have termed DO peptides, are involved in disulfide-dithiol exchange reaction, resulting in formation of adventitious disulfide bridges in cell proteins. In turn, this leads to a partial loss of protein function and rapid onset of apoptosis. We anticipate that coupling DO sequences with tumor-homing transduction domains can create a potentially valuable new class of tumoricidal peptides.
Reversible cysteine oxidation is an emerging class of protein post-translational modification (PTM) that regulates catalytic activity, modulates conformation, impacts protein–protein interactions, and affects subcellular trafficking of numerous proteins. Redox PTMs encompass a broad array of cysteine oxidation reactions with different half-lives, topographies, and reactivities such as S-glutathionylation and sulfoxidation. Recent studies from our group underscore the lesser known effect of redox protein modifications on drug binding. To date, biological studies to understand mechanistic and functional aspects of redox regulation are technically challenging. A prominent issue is the lack of tools for labeling proteins oxidized to select chemotype/oxidant species in cells. Predictive computational tools and curated databases of oxidized proteins are facilitating structural and functional insights into regulation of the network of oxidized proteins or redox proteome. In this chapter, we discuss analytical platforms for studying protein oxidation, suggest computational tools currently available in the field to determine redox sensitive proteins, and begin to illuminate roles of cysteine redox PTMs in drug pharmacology.
Conference Paper
Full-text available
The formation of disulphide bridges among cysteines is an important fea- ture of protein structures. Here we develop new methods for the predic- tion of disulphide bond connectivity. We first build a large curated data set of proteins containing disulphide bridges and then use 2-Dimensional Recursive Neural Networks to predict bonding probabilities between cys- teine pairs. These probabilities in turn lead to a weighted graph matching problem that can be addressed efficiently. We show how the method con- sistently achieves better results than previous approaches on the same validation data. In addition, the method can easily cope with chains with arbitrary numbers of bonded cysteines. Therefore, it overcomes one of the major limitations of previous approaches restricting predictions to chains containing no more than 10 oxidized cysteines. The method can be applied both to situations where the bonded state of each cysteine is known or unknown, in which case bonded state can be predicted with 85% precision and 90% recall. The method also yields an estimate for the total number of disulphide bridges in each chain.
Full-text available
Disulfide formation in newly synthesized proteins entering the mammalian endoplasmic reticulum is catalyzed by protein disulfide isomerase (PDI), which is itself thought to be directly oxidized by Ero1α. The activity of Ero1α is tightly regulated by the formation of noncatalytic disulfides, which need to be broken to activate the enzyme. Here, we have developed a novel PDI oxidation assay, which is able to simultaneously determine the redox status of the individual active sites of PDI. We have used this assay to confirm that when PDI is incubated with Ero1α, only one of the active sites of PDI becomes directly oxidized with a slow turnover rate. In contrast, a deregulated mutant of Ero1α was able to oxidize both PDI active sites at an equivalent rate to the wild type enzyme. When the active sites of PDI were mutated to decrease their reduction potential, both were now oxidized by wild type Ero1α with a 12-fold increase in activity. These results demonstrate that the specificity of Ero1α toward the active sites of PDI requires the presence of the regulatory disulfides. In addition, the rate of PDI oxidation is limited by the reduction potential of the PDI active site disulfide. The inability of Ero1α to oxidize PDI efficiently likely reflects the requirement for PDI to act as both an oxidase and an isomerase during the formation of native disulfides in proteins entering the secretory pathway.
Full-text available
CysView is a web-based application tool that identifies and classifies proteins according to their disulfide connectivity patterns. It accepts a dataset of annotated protein sequences in various formats and returns a graphical representation of cysteine pairing patterns. CysView displays cysteine patterns for those records in the data with disulfide annotations. It allows the viewing of records grouped by connectivity patterns. CysView's utility as an analysis tool was demonstrated by the rapid and correct classification of scorpion toxin entries from GenPept on the basis of their disulfide pairing patterns. It has proved useful for rapid detection of irrelevant and partial records, or those with incomplete annotations. CysView can be used to support distant homology between proteins. CysView is publicly available at
Disulfide bond formation is probably involved in the biogenesis of approximately one third of human proteins. A central player in this essential process is protein disulfide isomerase or PDI. PDI was the first protein-folding catalyst reported. However, despite more than four decades of study, we still do not understand much about its physiological mechanisms of action. This review examines the published literature with a critical eye. This review aims to (a) provide background on the chemistry of disulfide bond formation and rearrangement, including the concept of reduction potential, before examining the structure of PDI; (b) detail the thiol-disulfide exchange reactions that are catalyzed by PDI in vitro, including a critical examination of the assays used to determine them; (c) examine oxidation and reduction of PDI in vivo, including not only the role of ERo1 but also an extensive assessment of the role of glutathione, as well as other systems, such as peroxide, dehydroascorbate, and a discussion of vitamin K-based systems; (d) consider the in vivo reactions of PDI and the determination and implications of the redox state of PDI in vivo; and (e) discuss other human and yeast PDI-family members.
The bonding states of cysteine play important functional and structural roles in proteins. In particular, disulfide bond formation is one of the most important factors influencing the three-dimensional fold of proteins. Proteins of known structure were used to teach computer-simulated neural networks rules for predicting the disulfide-bonding state of a cysteine given only its flanking amino acid sequence. Resulting networks make accurate predictions on sequences different from those used in training, suggesting that local sequence greatly influences cysteines in disulfide bond formation. The average prediction rate after seven independent network experiments is 81.4% for disulfide-bonded and 80.0% for non-disulfide-bonded scenarios. Predictive accuracy is related to the strength of network output activities. Network weights reveal interesting position-dependent amino acid preferences and provide a physical basis for understanding the correlation between the flanking sequence and a cysteine's disulfide-bonding state. Network predictions may be used to increase or decrease the stability of existing disulfide bonds or to aid the search for potential sites to introduce new disulfide bonds.
Knowledge of the disulfide bonding state of the cysteines of proteins is of major interest in designing numerous molecular biology experiments, or in predicting their three-dimensional structure. Previous methods using the information gained from aligned sets of sequences have reached up to 82% of success in predicting the oxidation state of cysteines. In the present study, we assess the relative efficiency of different descriptors in predicting the cysteine disulfide bonding states. Our results suggest that the information on the residues flanking the cysteines is less informative about the disulfide bonding state than about the amino acid content of the whole protein. Using a combination of logistic functions learned with subsets of proteins homogeneous in terms of their amino acid content, we propose a simple prediction approach, starting from a single sequence, that reaches success rates close to 84%. This score can be improved by avoiding predictions regarding cysteines for which the decision is not well marked. For example, we obtain a score close to 87% correct prediction when we exclude predicting 10% of the cysteines.
The support vector machine (SVM) method is used to predict the bonding states of cysteines. Besides using local descriptors such as the local sequences, we include global information, such as amino acid compositions and the patterns of the states of cysteines (bonded or nonbonded), or cysteine state sequences, of the proteins. We found that SVM based on local sequences or global amino acid compositions yielded similar prediction accuracies for the data set comprising 4136 cysteine-containing segments extracted from 969 nonhomologous proteins. However, the SVM method based on multiple feature vectors (combining local sequences and global amino acid compositions) significantly improves the prediction accuracy, from 80% to 86%. If coupled with cysteine state sequences, SVM based on multiple feature vectors yields 90% in overall prediction accuracy and a 0.77 Matthews correlation coefficient, around 10% and 22% higher than the corresponding values obtained by SVM based on local sequence information.
A hidden neural network-based method is used to predict the bonding state of cysteines starting from the residue sequence of the protein chain. The method scores as high as 89% and 86% per cysteine residue and per protein, respectively, and in this overcomes other predictors of the same category. We then explore the efficacy of our predictor in computing the disulfide content of the whole proteome of Escherichia coli (K12 and O157), Aeropirum pernix, Thermotoga maritima, and Homo sapiens. We find that the percentage of extracellular disulfide containing proteins is higher than that of intracellular one, and that the human proteome is by far the one with the highest content of sulfur-sulfur linkages in proteins.
The Genomic Disulfide Analysis Program (GDAP) provides web access to computationally predicted protein disulfide bonds for over one hundred microbial genomes, including both bacterial and achaeal species. In the GDAP process, sequences of unknown structure are mapped, when possible, to known homologous Protein Data Bank (PDB) structures, after which specific distance criteria are applied to predict disulfide bonds. GDAP also accepts user-supplied protein sequences and subsequently queries the PDB sequence database for the best matches, scans for possible disulfide bonds and returns the results to the client. These predictions are useful for a variety of applications and have previously been used to show a dramatic preference in certain thermophilic archaea and bacteria for disulfide bonds within intracellular proteins. Given the central role these stabilizing, covalent bonds play in such organisms, the predictions available from GDAP provide a rich data source for designing site-directed mutants with more stable thermal profiles. The GDAP web application is a gateway to this information and can be used to understand the role disulfide bonds play in protein stability both in these unusual organisms and in sequences of interest to the individual researcher. The prediction server can be accessed at