Vol. 26 no. 16 2010, pages 1965–1974
A highly accurate statistical approach for the prediction of
Thomas C. Freeman, Jr. and William C. Wimley∗
Department of Biochemistry, Tulane University Health Sciences Center, New Orleans, LA 70112, USA
Associate Editor: Anna Tramontano
Advance access publication June 10, 2010
special structural class of proteins predominately found in the
outer membranes of Gram-negative bacteria, mitochondria and
chloroplasts. TMBBs are surface-exposed proteins that perform a
variety of functions ranging from nutrient acquisition to osmotic
for use in vaccine or drug therapy development. However, membrane
proteins, such as TMBBs, are notoriously difficult to identify and
characterize using traditional experimental approaches and current
prediction methods are still unreliable.
Results: A prediction method based on the physicochemical
properties of experimentally characterized TMBB structures was
databases. The Freeman–Wimley prediction algorithm developed in
this study has an accuracy of 99% and MCC of 0.748 when using the
most efficient prediction criteria, which is better than any previously
Availability: The MS Windows-compatible application is available
for download at http://www.tulane.edu/∼biochem/WW/apps.html
Supplementary information: Supplementary data are available at
Received on December 6, 2009; revised on May 24, 2010; accepted
on June 4, 2010
The transmembrane β-barrel (TMBB) is one of two major structural
classes of membrane-spanning proteins; TM helical bundles are
the other. TMBBs are found in the outer membranes of Gram-
negative bacteria, mitochondria and chloroplasts, while TM helical
bundles are found in the cytoplasmic membranes of all living
organisms. Although genes that encode TMBBs are estimated to
represent at least 3% of all protein-coding genes in Gram-negative
bacteria, TMBBs represent only 1% of the solved protein structures
from Gram-negative organisms. As a rapidly expanding number of
genomic sequences become available, using in silico methods to
identify previously unknown TMBBs is an appealing alternative
to more difficult and time-consuming experimental methods such
as crystallography. Computational TMBB prediction methods
can identify candidate genes in order to perform experimental
validation or structural proteomics on a more focused population.
∗To whom correspondence should be addressed.
These methods also provide the opportunity to identify and
characterize TMBBs that may not be expressed under standard
screening methods such as proteomic analysis.
Computational prediction methods have been used to predict
TM helices with an accuracy of 99% for nearly a decade. TM
helices are simple stretches of 19–25 hydrophobic residues, which
can be predicted with near-perfect accuracy using experimentally
determined hydrophobicity scales; an example of such a program
is MPEX (Jayasinghe et al., 2001; Snider et al., 2009). However,
the prediction of TMBBs presents a more difficult challenge due
to the cryptic nature of the TMBB structure (Wimley, 2002). The
in a cylindrical geometry forming a structure that resembles a
barrel (Schulz, 2000). The TM β-strands of TMBBs consist of
∼10 amino acids arranged in an alternating, dyad repeat pattern
of hydrophobic and hydrophilic residues, where the hydrophobic
side-chains face the lipid environment and the hydrophilic side-
chains face the interior of the β-barrel. The β-hairpin, which is the
major structural unit of the TMBB, is a pair of anti-parallel TM
β-strands connected by a short loop of 3–7 residues (i.e., hairpin
turn).The β-hairpins are connected to each other by loops of varying
length. The complexities and irregularities in the structure including
the variations in loop length and composition, deviations from the
content (e.g., only five hydrophobic residues in a TM strand) make
the identification ofTMBBs especially problematic (Wimley, 2003).
There are a wide variety of TMBB prediction algorithms that
The distinguishing variables, as interpreted by the algorithm, are
used as rules to classify a test sequence (Gromiha and Suwa, 2006).
Although these methods can yield reasonable TMBB prediction
accuracies (64–97%), their predictions are still less reliable than
those made for TM helical bundles (Gromiha and Suwa, 2006;
Hu and Yan, 2008). Besides achieving less than ideal prediction
accuracy, a major disadvantage of using a machine learning method
is that it cannot be used for hypothesis testing because the variables
used to make the predictions are either hidden or arbitrary, thus there
properties of the experimentally solved TMBB structures.
A TMBB prediction algorithm based on the physicochemical
algorithm is based on an analysis of the structure and composition
© The Author 2010. Published by Oxford University Press. All rights reserved. For Permissions, please email: firstname.lastname@example.org
T.C.Freeman Jr. and W.C.Wimley
of known TMBBs. The algorithm identifies the positions of TM
β-strands using a simple pattern-recognition scheme, which utilizes
the statistical amino acid abundance data derived from known
structures. The observed amino acid abundances from the TM
β-strands are compared to the expected genomic abundance, and
the difference between the two abundances yields information about
long β-strands with dyad repeat patterns. Next, adjacent β-strands
are scored for β-hairpin-forming potential, and the β-hairpin score
data is used in a function to give a protein sequence a single β-barrel
score. The β-barrel score is a rating of the overall propensity of the
sequence to fold into a TMBB.
The initial goal of this work was to rigorously evaluate the
performance of this algorithm since it was intended to make
predictions for genomic sequences, which will be listed in an
annotated database. The performance of the original algorithm was
evaluated using a non-redundant protein database (NRPDB) with
14238 proteins of known structure from the Protein Data Bank
(PDB; Berman et al., 2000). Each sequence was given a β-barrel
score, which was used as a threshold-dependent binomial classifier
to identify each sequence as either a TMBB or non-TMBB. Using
the NRPDB as a stringent test set, the performances of the original
prediction algorithm, as well as other prediction algorithms, were
unsatisfactory because they had very large rates of false positive
The algorithm described in this work was developed to address
the specific weaknesses in the ability of the original algorithm to
discriminate against non-TMBBs. The modified algorithm, which
we call the Freeman–Wimley algorithm, showed a substantial
improvement, from 87% to 99% when analyzing the NRPDB.
The accuracy of the Freeman–Wimley algorithm is comparable to
the accuracy of TM helix prediction and exceeds the accuracy of
other TMBB prediction methods. Furthermore, an analysis of the
Escherichia coli genome has revealed that the Freeman–Wimley
algorithm is more efficient at distinguishing TMBBs from non-
TMBBs in genomic databases compared to the NRPDB. This work
represents significant progress in the computational identification of
genomic TMBB sequences.
An NRPDB was constructed from the seqres text file available on the ftp site
of the PDB (ftp://snapshots.rcsb.org/20080107/pub/pdb/derived_data/). The
corresponding 50% clustering file (ftp://snapshots.rcsb.org/20080107/pub/
pdb/derived_data/NR/) was used to select a set of protein sequences that
were 50% or less identical to all other proteins. The database was further
refined by the exclusion of proteins outside the chain length constraints of
the prediction algorithm, i.e. between 60 and 4000 residues long, limiting
the total number of members in the database to 14238.
2.2 TMBB structural analysis and amino acid
A total of 22 non-redundant (≤40% identical) TMBBs were analyzed
for structural bioinformatic data (listed in Supplementary Table S1) as
was previously done by Wimley (2002). Briefly, transformation of PDB
coordinates to a bilayer plane was performed essentially as done by Wimley
except the software used was the Accelerys DS Viewer available as a free
Fig. 1. Analysis of TMBB structures. (A) The 3D coordinates of the
structures were transformed to a bilayer plane as described in methods.
The aromatic residues, shown in space-filling modeling, were used among
other cues to identify the TM domain. (B) The internal- and external-facing
residues were identified in eachTM strand along with the respective distance
acids were calculated in 4 structural subdomains.
download. The hydrophobicity profile used to center the TM section of each
TMBB was performed by calculating the average hydrophobicity of the
external residues using the Wimley–White hydrophobicity scale (White and
Wimley, 1998). The average hydrophobicity within a 5-Å sliding window
was calculated along the Y-axis using the structural Y-coordinates of the
β-carbons (except for glycine where the α-carbon was used). The midpoint
of the hydrophobic surface was used to transform the XYZ coordinates of
a structure to a bilayer plane centered at 0Å; the distance of the residues
from that center was used to determine if they were located in the core
region (0–6.5Å) or in the interfacial region (>6.5–13.5Å) (see Fig. 1).
The resulting raw abundance values were normalized by comparison to the
expected genome-wide abundance values (Supplementary Table S2). The
abundances determined in this analysis were averaged with those generated
by Wimley, weighting each group by the respective number of amino acids
that contributed to the value calculation.
The TMBB prediction algorithm used was based on the method previously
published by this lab (Wimley, 2002) with some modifications. Sequences
shorter than 60 and longer than 4000 residues were excluded; these limits
were set because sequences with fewer than 60 residues most likely cannot
fold into TMBBs, which must have at least eight β-strands, and sequences
longer than 4000 residues are uncommon and unlikely to be TMBBs (all
of the known TMBBs are shorter than 1000 residues). Sequences were
assigned abundance values (Fig. 1) in an alternating (dyad repeat) pattern of
internal/external and external/internal using the core and interfacial values
for the respective surfaces resulting in two separate abundance assignments
(see Fig. 2 and Supplementary Fig. S1). The β-strand scores were calculated
with a 10-residue-long sliding window that steps through the sequence one
position at a time.Within the sliding window, the three anterior and posterior
were assigned core abundances. This differs from the original algorithm that
used an average of the core and interfacial values known as the whole or
TMBB prediction algorithm
T.C.Freeman Jr. and W.C.Wimley
We would like to acknowledge Aram Krauson and Jessica Marks
for their critical reading of this manuscript.
and Educational Enhancement Program RC/EEP-05(2007-10).
Conflict of Interest: none declared.
Akama,H. et al. (2004) Crystal structure of the drug discharge outer membrane protein,
OprM, of Pseudomonas aeruginosa: dual modes of membrane anchoring and
occluded cavity end. J. Biol. Chem., 279, 52816–52819.
folding. Adv. Protein Chem., 29, 205–300.
Berman,H.M. et al. (2000) The Protein Data Bank. Nucleic Acids Res., 28, 235–242.
Berven,F.S. et al. (2004) BOMP: a program to predict integralβ-barrel outer membrane
proteins encoded within genomes of Gram-negative bacteria. Nucleic Acids Res.,
Chimento,D.P. et al. (2003) Substrate-induced transmembrane signaling in the
cobalamin transporter BtuB. Nat. Struct. Biol., 10, 394–401.
Fernandez,C. et al. (2001)Transverse relaxation-optimized NMR spectroscopy with the
outer membrane protein OmpX in dihexanoyl phosphatidylcholine micelles. Proc.
Natl Acad. Sci. USA., 98, 2358–2363.
Gromiha,M.M. and Suwa,M. (2006) Discrimination of outer membrane proteins using
machine learning algorithms. Proteins, 63, 1031–1037.
Hu,J. and Yan,C. (2008) A method for discovering transmembrane β-barrel proteins in
Gram-negative bacterial proteomes. Comput. Biol. Chem., 32, 298–301.
Jayasinghe,S. et al. (2001) Energetics, stability, and prediction of transmembrane
helices. J. Mol. Biol., 312, 927–934.
Koronakis,V. et al. (2000) Crystal structure of the bacterial membrane protein TolC
central to multidrug efflux and protein export. Nature, 405, 914–919.
Liu,Q. et al. (2003) Identification of β-barrel membrane proteins based on amino acid
composition properties and predicted secondary structure. Comput. Biol. Chem., 27,
Matthews,B.W. (1975) Comparison of the predicted and observed secondary structure
of T4 phage lysozyme. Biochim. Biophys. Acta, 405, 442–451.
Meng,G. et al. (2006) Structure of the outer membrane translocator domain of the
Haemophilus influenzae Hia trimeric autotransporter. EMBO J., 25, 2297–2304.
Mowat,C.G. et al. (2004) Octaheme tetrathionate reductase is a respiratory enzyme with
novel heme ligation. Nat. Struct. Mol. Biol., 11, 1023–1024.
Murzin,A.G. et al. (1995) SCOP: a structural classification of proteins database for the
investigation of sequences and structures. J. Mol. Biol., 247, 536–540.
Oomen,C.J. et al. (2004) Structure of the translocator domain of a bacterial
autotransporter. EMBO J., 23, 1257–1266.
Ou,Y.Y. et al. (2008) TMBETADISC-RBF: discrimination of β-barrel membrane
proteins using RBF networks and PSSM profiles. Comput. Biol. Chem., 32,
Parsiegla,G. et al. (2000) Crystal structures of the cellulase Cel48F in complex with
inhibitors and substrates give insights into its processive action. Biochemistry, 39,
Pautsch,A. and Schulz,G.E. (1998) Structure of the outer membrane protein A
transmembrane domain. Nat. Struct. Biol., 5, 1013–1017.
Remaut,H. et al. (2008) Fiber formation across the bacterial outer membrane by the
chaperone/usher pathway. Cell, 133, 640–652.
Rey,S. et al. (2005) PSORTdb: a protein subcellular localization database for bacteria.
Nucleic Acids Res., 33, D164–D168.
Rutten,L. et al. (2006) Crystal structure and catalytic mechanism of the LPS 3-O-
deacylase PagL from Pseudomonas aeruginosa. Proc. Natl Acad. Sci. USA., 103,
Sandgren,M. et al. (2001) The X-ray crystal structure of the Trichoderma reesei family
12 endoglucanase 3, Cel12A, at 1.9 A resolution. J. Mol. Biol., 308, 295–310.
Schulz,G.E. (2000)β-Barrel membrane proteins. Curr. Opin. Struct. Biol., 10, 443–447.
Snider,C. et al. (2009) MPEx: a tool for exploring membrane proteins. Protein Sci., 18,
Song,L. et al. (1996) Structure of staphylococcal α-hemolysin, a heptameric
transmembrane pore. Science, 274, 1859–1866.
Vandeputte-Rutten,L. et al. (2001) Crystal structure of the outer membrane protease
OmpT from Escherichia coli suggests a novel catalytic site. EMBO J., 20,
Wakarchuk,W.W. et al. (1994) Mutational and crystallographic analyses of the active
site residues of the Bacillus circulans xylanase. Protein Sci., 3, 467–475.
Wang,Y.F. et al. (1997) Channel specificity: structural basis for sugar discrimination
and differential flux rates in maltoporin. J. Mol. Biol., 272, 56–63.
White,S.H. and Wimley,W.C. (1998) Hydrophobic interactions of peptides with
membrane interfaces. Biochim. Biophys. Acta, 1376, 339–352.
Wimley,W.C. (2002) Toward genomic identification of β-barrel membrane proteins:
composition and architecture of known structures. Protein Sci., 11, 301–312.
Wimley,W.C. (2003)The versatileβ-barrel membrane protein. Curr. Opin. Struct. Biol.,
Tsx. EMBO J., 23, 3187–3195.