ArticlePDF Available

The Protein Data Bank

Authors:
  • Janssen Research & Development, LLC, Spring House, United States

Abstract and Figures

The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
No caption available
… 
Content may be subject to copyright.
© 2000 Oxford University Press Nucleic Acids Research, 2000, Vol. 28, No. 1 235–242
The Protein Data Bank
Helen M. Berman
1,2,
*, John Westbrook
1,2
, Zukang Feng
1,2
, Gary Gilliland
1,3
,T.N.Bhat
1,3
,
Helge Weissig
1,4
, Ilya N. Shindyalov
4
and Philip E. Bourne
1,4,5,6
1
Research Collaboratory for Structural Bioinformatics (RCSB),
2
Department of Chemistry, Rutgers University,
610 Taylor Road, Piscataway, NJ 08854-8087, USA,
3
National Institute of Standards and Technology, Route 270,
Quince Orchard Road, Gaithersburg, MD 20899, USA,
4
San Diego Supercomputer Center, University of California,
San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0505, USA,
5
Department of Pharmacology,
University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0500, USA and
6
The Burnham Institute, 10901 North Torrey Pines Road, La Jolla, CA 92037, USA
Received September 20, 1999; Revised and Accepted October 17, 1999
ABSTRACT
The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ )
is the single worldwide archive of structural data of
biological macromolecules. This paper describes the
goals of the PDB, the systems in place for data depo-
sition and access, how to obtain further information,
and near-term plansfor the future development of the
resource.
INTRODUCTION
The Protein Data Bank (PDB) was established at Brookhaven
National Laboratories (BNL) (1) in 1971 as an archive for
biological macromolecular crystal structures. In the beginning
the archive held seven structures, and with each year a handful
more were deposited. In the 1980s the number of deposited
structures began to increase dramatically. This was due to the
improved technology for all aspects of the crystallographic
process, the addition of structures determined by nuclear
magnetic resonance (NMR) methods, and changes in the
community views about data sharing. By the early 1990s the
majority of journals required a PDB accession code and at least
one funding agency (National Institute of General Medical
Sciences) adopted the guidelines published by the International
Union of Crystallography (IUCr) requiring data deposition for
all structures.
The mode of access to PDB data has changed over the years
as a result of improved technology, notably the availability of
the WWW replacing distribution solely via magnetic media.
Further, the need to analyze diverse data sets required the
development of modern data management systems.
Initial use of the PDB had been limited to a small group of
experts involved in structural research. Today depositors to the
PDB have varying expertise in the techniques of X-ray crystal
structure determination, NMR, cryoelectron microscopy and
theoretical modeling. Users are a very diverse group of
researchers in biology, chemistry and computer scientists,
educators, and students at all levels. The tremendous influx of
data soon to be fueled by the structural genomics initiative, and
the increased recognition of the value of the data toward
understanding biological function, demand new ways to
collect, organize and distribute the data.
In October 1998, the management of the PDB became the
responsibility of the Research Collaboratory for Structural
Bioinformatics (RCSB). In general terms, the vision of the
RCSB is to create a resource based on the most modern
technology that facilitates the use and analysis of structural
data and thus creates an enabling resource for biological
research. Specifically in this paper, we describe the current
procedures for data deposition, data processing and data
distribution of PDB data by the RCSB. In addition, we address
the issues of data uniformity. We conclude with some current
developments of the PDB.
DATA ACQUISITION AND PROCESSING
A key component of creating the public archive of information
is the efficient capture and curation of the data—data processing.
Data processing consists of data deposition, annotation and
validation. These steps are part of the fully documented and
integrated data processing system shown in Figure 1.
In the present system (Fig. 2), data (atomic coordinates,
structure factors and NMR restraints) may be submitted via
email or via the AutoDep Input Tool (ADIT; http://pdb.rutgers.
edu/adit/ ) developed by the RCSB. ADIT, which is also used
to process the entries, is built on top of the mmCIF dictionary
which is an ontology of 1700 terms that define the macro-
molecular structure and the crystallographic experiment (2,3),
and a data processing program called MAXIT (MAcromolecular
EXchange Input Tool). This integrated system helps to ensure
that the data submitted are consistent with the mmCIF
dictionary which defines data types, enumerates ranges of
allowable values where possible and describes allowable
relationships between data values.
After a structure has been deposited using ADIT, a PDB
identifier is sent to the author automatically and immediately
(Fig. 1, Step 1). This is the first stage in which information
about the structure is loaded into the internal core database (see
section on the PDB Database Resource). The entry is then
annotated as described in the validation section below. This
process involves using ADIT to help diagnose errors or
*To whom correspondence should be addressed at: Department of Chemistry, Rutgers University, 610 Taylor Road, Piscataway, NJ 08854-8087, USA.
Tel: +1 732 445 4667; Fax: +1 732 445 4320; Email: berman@rcsb.rutgers.edu
236 Nucleic Acids Research, 2000, Vol. 28, No. 1
inconsistencies in the files. The completely annotated entry as
it will appear in the PDB resource, together with the validation
information, is sent back to the depositor (Step 2). After
reviewing the processed file, the author sends any revisions
(Step 3). Depending on the nature of these revisions, Steps 2
and 3 may be repeated. Once approval is received from the
author (Step 4), the entry and the tables in the internal core
database are ready for distribution. The schema of this core
database is a subset of the conceptual schema specified by the
mmCIF dictionary.
All aspects of data processing, including communications
with the author, are recorded and stored in the correspondence
archive. This makes it possible for the PDB staff to retrieve
information about any aspect of the deposition process and to
closely monitor the efficiency of PDB operations.
Current status information, comprised of a list of authors,
title and release category, is stored for each entry in the core
database and is made accessible for query via the WWW interface
(http://www.rcsb.org/pdb/status.html ). Entries before release
are categorized as ‘in processing’ (PROC), ‘in depositor
review’ (WAIT), ‘to be held until publication’ (HPUB) or ‘on
hold until a depositor-specified date’ (HOLD).
Content of the data collected by the PDB
All the data collected from depositors by the PDB are considered
primary data. Primary data contain, in addition to the coordinates,
general information required for all deposited structures and
information specific to the method of structure determination.
Table 1 contains the general information that the PDB collects
for all structures as wellas the additional information collected for
those structures determined by X-ray methods. The additional
items listed for the NMR structures are derived from the
International Union of Pure and Applied Chemistry recommen-
dations (IUPAC) (4) and will be implemented in the near future.
The information content of data submittedby the depositoris
likely to change as new methods for data collection, structure
determination and refinement evolve and advance. In addition,
the ways in which these data are captured are likely to change
as the software for structure determination and refinement
produce the necessary data items as part of their output. ADIT,
Figure 1. The steps in PDB data processing. Ellipses represent actions and rectangles define content.
Figure 2. The integrated tools of the PDB data processing system.
Table 1. Content of data in the PDB
Nucleic Acids Research, 2000, Vol. 28, No. 1 237
the data input system for the PDB, has been designed so as to
easily incorporate these likely changes.
Validation
Validation refers to the procedure for assessing the quality of
deposited atomic models (structure validation) and for
assessing how well these models fit the experimental data
(experimental validation). The PDB validates structures using
accepted community standards as part of ADIT’s integrated
data processing system. The following checks are run and are
summarized in a letter that is communicated directly to the
depositor:
Covalent bond distances and angles. Proteins are compared
against standard values from Engh and Huber (5); nucleic acid
bases are compared against standard values from Clowney
et al. (6); sugar and phosphates are compared against standard
values from Gelbin et al. (7).
Stereochemical validation. All chiral centers of proteins and
nucleic acids are checked for correct stereochemistry.
Atom nomenclature. The nomenclature of all atoms is checked
for compliance with IUPAC standards (8) and is adjusted if
necessary.
Close contacts. The distances between all atoms within the
asymmetric unit of crystal structures and the unique molecule
of NMR structures are calculated. For crystal structures,
contacts between symmetry-related molecules are checked as
well.
Ligand and atom nomenclature. Residue and atom nomen-
clature is compared against the PDB dictionary (ftp://ftp.rcsb.
org/pub/pdb/data/monomers/het_dictionary.txt ) for all ligands
as well as standard residues and bases. Unrecognized ligand
groups are flagged and any discrepancies in known ligands are
listed as extra or missing atoms.
Sequence comparison. The sequence given in the PDB SEQRES
records is compared against the sequence derived from the
coordinate records. This information is displayed in a table
where any differences or missing residues are marked. During
structure processing, the sequence database references given
by DBREF and SEQADV are checked for accuracy. If no
reference is given, a BLAST (9) search is used to find the best
match. Any conflict between the PDB SEQRES records and
the sequence derived from the coordinate records is resolved
by comparison with various sequence databases.
Distant waters. The distances between all water oxygen atoms
and all polar atoms (oxygen and nitrogen) of the macromolecules,
ligands and solvent in the asymmetric unit are calculated.
Distant solvent atoms are repositioned using crystallographic
symmetry such that they fall within the solvation sphere of the
macromolecule.
In almost all cases, serious errors detected by these checks are
corrected through annotation and correspondence with the authors.
It is also possible to run these validation checks against
structures before they are deposited. A validation server
(http://pdb.rutgers.edu/validate/ ) has been made available for
this purpose. In addition to the summary report letter, the
server also provides output from PROCHECK (10), NUCheck
(Rutgers University, 1998) and SFCHECK (11). A summary
atlas page and molecular graphics are also produced.
The PDB will continually review the checking methods used
and will integrate new procedures as they are developed by the
PDB and members of the scientific community.
Other data deposition centers
The PDB is working with other groups to set up deposition
centers. This enables people at other sites to more easily
deposit their data via the Internet. Because it is critical that the
final archive is kept uniform, the content and format of the
final files as well as the methods used to check them must be
the same. At present, the European Bioinformatics Institute
(EBI) processes data that are submitted to them via AutoDep
(http://autodep.ebi.ac.uk/ ). Once these data are processed they
are sent to the RCSB in PDB format for inclusion in the central
archive. Before this system was put in place it was tested to
ensure consistency among entries in the PDB archive. In the
future, the data will be exchanged in mmCIF format using a
common exchange dictionary, which along with standardized
annotation procedures will ensure a high degree of uniformity
in the archival data. Structures deposited and processed at the
EBI represent ~20% of all data deposited.
Data deposition will also soon be available from an ADIT
Web site at The Institute for Protein Research at Osaka
University in Japan. At first, structures deposited at this site
will be processed by the PDB staff. In time, the staff at Osaka
will complete the data processing for these entries and send the
files to the PDB for release.
NMR data
The PDB staff recognizes that NMR data needs a special
development effort. Historically these data have been retro-
fitted into a PDB format defined around crystallographic infor-
mation. As a first step towards improving this situation, the
PDB did an extensive assessment of the current NMR holdings
and presented their findings to a Task Force consisting of a
cross section of NMR researchers. The PDB is working with
this group, the BioMagResBank (BMRB) (12), as well as other
members of the NMR community, to develop an NMR data
dictionary along with deposition and validation tools specific
for NMR structures. This dictionary contains among other
items descriptions of the solution components, the experimental
conditions, enumerated lists of the instruments used, as well as
information about structure refinement.
Data processing statistics
Production processing of PDB entries by the RCSB began on
January 27, 1999. The median time from deposition to the
completion of data processing including author interactions is
less than 10 days. The number of structures with a HOLD
release status remains at ~22% of all submissions; 28% are
held until publication; and 50% are released immediately after
processing.
When the RCSB became fully responsible there were about
900 structures that had not been completely processed. These
included so called Layer 1 structures that had been processed
by computer software but had not been fully annotated. All of
238 Nucleic Acids Research, 2000, Vol. 28, No. 1
these structures have now been processed and are being
released after author review.
The breakdown of the types of structures in the PDB is
shown in Table 2. As of September 14, 1999, the PDB
contained 10 714 publicly accessible structures with another
1169 entries on hold. Of these, 8789 (82%) were determined
by X-ray methods, 1692 (16%) were determined by NMR and
233 (2%) were theoretical models. Overall, 35% of the entries
have deposited experimental data.
Data uniformity
A key goal of the PDB is to make the archive as consistent and
error-free as possible. All current depositions are reviewed
carefully by the staff before release. Tables of features are
generated from the internal data processing database and
checked. Errors found subsequent to release by authors and
PDB users are addressed as rapidly as possible. Corrections
and updates to entries should be sent to deposit@rcsb.
rutgers.edu for the changes to be implemented and re-released
into the PDB archive.
One of the most difficult problems that the PDB now faces is
that the legacy files are not uniform. Historically, existing data
(‘legacy data’) comply with several different PDB formats and
variation exists in how the same features are described for
different structures within each format. The introduction of the
advanced querying capabilities of the PDB makes it critical to
accelerate the data uniformity process for these data. We are
now at a stage where the query capabilities surpass the quality
of the underlying data. The data uniformity project is being
approached in two ways. Families of individual structures are
being reprocessed using ADIT. The strategy of processing data
files as groups of similar structures facilitates the application
of biological knowledge by the annotators. In addition, we are
examining particular records across all entries in the archive.
As an example, we have recently completed examining and
correcting the chemical descriptions of all of the ligands in the
PDB. These corrections are being entered in the database. The
practical consequence of this is that soon it will be possible to
accurately find all the structures in the PDB bound to a particular
ligand or ligand type. In addition to the efforts of the PDB to
remediate the older entries, the EBI has also corrected many of
the records in the PDB as part of their ‘clean-up’ project. The
task of integrating all of these corrections done at both sites is
very large and it is essential that there is a well-defined
exchange format to do this; mmCIF will be used for this
purpose.
THE PDB DATABASE RESOURCE
The database architecture
In recognition of the fact that no single architecture can fully
express and efficiently make available the information content
of the PDB, an integrated system of heterogeneous databases
has been created that store and organize the structural data. At
present there are five major components (Fig. 3):
The core relational database managed by Sybase (Sybase
SQL server release 11.0, Emeryville, CA) provides the
central physical storage for the primary experimental and
coordinate data described in Table 1. The core PDB relational
database contains all deposited information in a tabular form
that can be accessed across any number of structures.
Table 2. Demographics of data in the PDB
Figure 3. The integrated query interface to the PDB.
Nucleic Acids Research, 2000, Vol. 28, No. 1 239
The final curated data files (in PDB and mmCIF formats)
and data dictionaries are the archival data and are present as
ASCII files in the ftp archive.
The POM (Property Object Model)-based databases, which
consist of indexed objects containing native (e.g., atomic
coordinates) and derived properties (e.g., calculated secondary
structure assignments and property profiles). Some properties
require no derivation, for example, B factors; others must be
derived, for example, exposure of each amino acid residue
(13) or C
contact maps. Properties requiring significant
computation time, such as structure neighbors (14), are pre-
calculated when the database is incremented to save considerable
user access time.
The Biological Macromolecule Crystallization Database
(BMCD; 15) is organized as a relational database within
Sybase and contains three general categories of literature
derived information: macromolecular, crystal and summary
data.
The Netscape LDAP server is used to index the textual
content of the PDB in a structured format and provides
support for keyword searches.
It is critical that the intricacies of the underlying physical
databases be transparent to the user. In the current implementation,
communication among databases has been accomplished using
the Common Gateway Interface (CGI). An integrated Web
interface dispatches a query to the appropriate database(s),
which then execute the query. Each database returns the PDB
identifiers that satisfy the query, and the CGI program integrates
the results. Complex queries are performed by repeating the
process and having the interface program perform the appropriate
Boolean operation(s) on the collection of query results. A
variety of output options are then available for use with the
final list of selected structures.
TheCGIapproach[andinthefutureaCORBA(Common
Object Request Broker Architecture)-based approach] will
permit other databases to be integrated into this system, for
example extended data on different protein families. The same
approach could also be applied to include NMR data found in
the BMRB or data found in other community databases.
Database query
Three distinct query interfaces are available for the query of data
within PDB: Status Query (http://www.rcsb.org/pdb/status.html ),
SearchLite (http://www.rcsb.org/pdb/searchlite.html ) and Search-
Fields (http://www.rscb.org/pdb/queryForm.cgi ). Table 3
summarizes the current query and analysis capabilities of the
PDB. Figure 4 illustrates how the various query options are
organized.
SearchLite, which provides a single form field for keyword
searches, was introduced in February 1999. All textual information
within the PDB files as well as dates and some experimental
data are accessible via simple or structured queries. Search-
Fields, accessible since May 1999, is a customizable query
form that allows searching over many different data items
including compound, citation authors, sequence (via a FASTA
search; 16) and release or deposition dates.
Two user interfaces provide extensive information for result
sets from SearchLite or SearchFields queries. The Query
Result Browser’ interface allows for access to some general
information, more detailed information in tabular format, and
the possibility to download whole sets of data files for result
sets consisting of multiple PDB entries. The ‘Structure
Explorer’ interface provides information about individual
structures as well as cross-links to many external resources for
macromolecular structure data (Table 4). Both interfaces are
accessible to other data resources through the simple CGI
application programmer interface (API) described at http://www.
rcsb.org/pdb/linking.html
Figure 4. The various query options that are available for the PDB.
Table 3. Current query capabilities of the PDB
240 Nucleic Acids Research, 2000, Vol. 28, No. 1
The website usage has climbed dramatically since the system
was first introduced in February 1999 (Table 5). As of
November 1, 1999, the main PDB site receives, on average,
greater than one hit per second and greater than one query per
minute.
DATA DISTRIBUTION
The PDB distributes coordinate data, structure factor files and
NMR constraint files. In addition it provides documentation
and derived data. The coordinate data are distributed in PDB
and mmCIF formats. Currently, the PDB file is created as the
final product of data annotation; the program pdb2cif (17) is
used to generate the mmCIF data. This program is used to accom-
modate the legacy data. In the future, both the mmCIF and PDB
format files created during data annotation will be distributed.
Data are distributed to the community in the following ways:
From primary PDB Web and ftp sites at UCSD, Rutgers and
NIST that are updated weekly.
From complete Web-based mirror sites that contain all data-
bases, data files, documentation and query interfaces updated
weekly.
From ftp-only mirror sites that contain a complete or subset
copy of data files, updated at intervals defined by the mirror
site. The steps necessary to create an ftp-only mirror site are
described in http://www.rcsb.org/pdb/ftpproc.final.html
Quarterly CD-ROM.
Data are distributed once per week. New data officially
become available at 1 a.m. PST each Wednesday. This follows
the tradition developed by BNL and has minimized the impact
of the transition on existing mirror sites. Since May 1999, two
ftp archives have been provided: ftp://ftp.rcsb.org , a reorganized
and more logical organization of all PDB data, software, and
documentation; and ftp://bnlarchive.rcsb.org , a near-identical
copy of the original BNL archive which is maintained for
purposes of backward compatibility. RCSB-style PDB mirrors
have been established in Japan (Osaka University), Singapore
(National University Hospital) and in UK (the Cambridge
Crystallographic Data Centre). Plans call for operating mirrors in
Brazil, Australia, Canada, Germany, and possibly India.
The first PDB CD-ROM distribution by the RCSB contained
the coordinate files,experimental data,software anddocumentation
as found in the PDB on June 30, 1999. Data are currently
distributed as compressed files using the compression utility
program gzip. Refer to http://www.rcsb.org/pdb/cdrom.html
for details of how to order CD-ROM sets. There is presently no
charge for this service.
DATA ARCHIVING
The PDB is establishing a central Master Archiving facility.
The Master Archive plan is based on five goals: reconstruction
of the current archive in case of a major disaster; duplication of
the contents of the PDB as itexistedon aspecific date; preservation
of software, derived data, ancillary data and all other computerized
and printed information; automatic archiving of all depositions
and the PDB production resource; and maintenance of the PDB
correspondence archive that documents all aspects of deposition.
During the transition period, all physical materials including
electronic media and hard copy materials were inventoried and
stored, and are being catalogued.
MAINTENANCE OF THE LEGACY BNL SYSTEM
One of the goals of the PDB has been to provide a smooth
transition from the system at BNL to the new system. Accordingly,
AutoDep, which was developed by BNL (18) for data deposition,
has been ported to the RCSB site and enables depositors to
complete in-progress depositions as well as to make new
depositions. In addition, the EBI accepts data using AutoDep.
Similarly, the programs developed at BNL for data query and
distribution (PDBLite, 3DBbrowser, etc.) are being maintained
by the remaining BNL-style mirrors. The RCSB provides data
in a form usable by these mirrors. Finally the style and format
of the BNL ftp archive is being maintained at ftp://bnlarchive.
rcsb.org
A multitude of resources and programs depend upon their
links to the PDB. To eliminate the risk of interruption to these
services, links to the PDB at BNL were automatically redirected to
the RCSB after BNL closed operations on June 30, 1999 using
Table 4. Static cross-links to other data resources currently provided
by the PDB
Table 5. Web query statistics for the primary RCSB site
(http://www.rcsb.org )
Nucleic Acids Research, 2000, Vol. 28, No. 1 241
a network redirect implemented jointly by RCSB and BNL
staff. While this redirect will be maintained, external resources
linking to the PDB are advised to change any URLs from http://
www.pdb.bnl.gov/ to http://www.rcsb.org/
CURRENT DEVELOPMENTS
In the coming months, the PDB plans to continue to improve
and develop all aspects of data processing. Deposition will be
made easier, and annotation will be more automated. In addition,
software for data deposition and validation will be made available
for in-laboratory use.
The PDB will also continue to develop ways of exchanging
information between databases. The PDB is leading the Object
Management Group Life Sciences Initiative’s efforts to define
a CORBA interface definition for the representation of macro-
molecular structure data. This is a standard developed under a
strict procedure to ensure maximum input by members of
various academic and industrial research communities. At this
stage, proposals for the interface definition, including a
working prototype that uses the standard, are being accepted.
For further details refer to http://www.omg.org/cgi-bin/doc?lifesci/
99-08-15 . The finalized standard interface will facilitate the query
and exchange of structural information not just at the level of
complete structures, but at finer levels of detail. The standard
being proposed by the PDB will conform closely to the mmCIF
standard. It is recognized that other forms of data representation
are desirable, for example using eXtensible Markup Language
(XML). The PDB will continue to work with mmCIF as the
underlying standard from which CORBA and XML represen-
tations can be generated as dictated by the needs of the
community.
The PDB will also develop the means and methods of
communications with the broad PDB user community via the
Web. To date we have developed prototype protein documentaries
(19) that explore this new medium in describing structure–
function relationships in proteins. It is also possible to develop
educational materials that will run using a recent Web browser
(20).
Finally it is recognized that structures exist both in the public
and private domains. To this end we are planning on providing
a subset of database tools for local use. Users will be able to
load both public and proprietary data and use the same search
and exploratory tools used at PDB resources.
The PDB does not exist in isolation, rather each structure
represents a point in a spectrum of information that runs from
the recognition of an open reading frame to a fully understood
role of the single or multiple biological functions of that molecule.
The available information that exists on this spectrum changes
over time. Recognizing this, the PDB has developed a scheme
for the dynamic update of a variety of links on each structure to
whatever else can be automatically located on the Internet.
This information is itself stored in a database and can be
queried. This feature will appear in the coming months to
supplement the existing list of static links to a small number of
the more well known related Internet resources.
PDB ADVISORY BOARDS
The PDB has severaladvisory boards.Each member institution
of the RCSB has its own local PDB Advisory Committee. Each
institution is responsible for implementing the recommendations
of those committees, as well as the recommendations of an
International Advisory Board. Initially, the RCSB presented a
report to the Advisory Board previously convened by BNL. At
their recommendation, a new Board has been approached
which contains previous members and new members. The goal
was to have the Board accurately reflect the depositor and user
communities and thus include experts from many disciplines.
Serious issues of policy are referred to the major scientific
societies, notably the IUCr. The goal is to make decisions
based on input from a broad international community of
experts. The IUCr maintains the mmCIF dictionary as the data
standard upon which the PDB is built.
FOR FURTHER INFORMATION
The PDB seeks to keep the community informed of new develop-
ments via weekly news updates to the Web site, quarterly
newsletters, and a soon to be initiated annual report. Users can
request information at any time by sending mail to info@rcsb.
org . Finally, the pdb-l@rcsb.org listserver provides a community
forum for the discussion of PDB-related issues. Changes to PDB
operations that may affect the community, for example, data
format changes, are posted here and users have 60 days to
discuss the issue before changes are made according to major
consensus. Table 6 indicates how to access these resources.
CONCLUSION
These are exciting and challenging times to be responsible for
the collection, curation and distribution of macromolecular
Table 6. PDB information sources
242 Nucleic Acids Research, 2000, Vol. 28, No. 1
structure data. Since the RCSB assumed responsibility for data
deposition in February 1999, the number of depositions has
averaged approximately 50 per week. However, with the
advent of a number of structure genomics initiatives world-
wide this number is likely to increase. We estimate that the
PDB, which at this writing contains approximately 10 500
structures, could triple or quadruple in size over the next
5 years. This presents a challenge to timely distribution while
maintaining high quality. The PDB’s approach of using
modern data management practices should permit us to scale to
accommodate a large data influx.
The maintenance and further development of the PDB are
community efforts. The willingness of others to share ideas,
software and data provides a depth to the resource not obtainable
otherwise. Some of these efforts are acknowledged below.
New input is constantly being sought and the PDB invites you
to make comments at any time by sending electronic mail to
info@rcsb.org
ACKNOWLEDGEMENTS
Research Collaboratory for Structural Bioinformatics (RCSB) is a
consortium consisting of three institutions: Rutgers University,
San Diego Supercomputer Center at University of California,
San Diego, and the National Institute of Standards and Technology.
The current RCSB PDB staff include the authors indicated and
Kyle Burkhardt, Anke Gelbin, Michael Huang, Shri Jain,
Rachel Kramer, Nate Macapagal, Victoria Colflesh, Bohdan
Schneider, Kata Schneider, Christine Zardecki (Rutgers);
Phoebe Fagan, Diane Hancock, Narmada Thanki, Michael
Tung, Greg Vasquez (NIST); Peter Arzberger, John Badger,
Douglas S. Greer, Michael Gribskov, John Kowalski, Glen
Otero, Shawn Strande, Lynn F. Ten Eyck, Kenneth Yoshimoto
(UCSD). The continuing support of Ken Breslauer (Rutgers),
John Rumble (NIST) and Sid Karin (SDSC) is gratefully
acknowledged. Current collaborators contributing to the future
development of the PDB are the BioMagResBank, the
Cambridge Crystallographic Data Centre, the HIV Protease
Database Group, The Institute for Protein Research, Osaka
University, National Center for Biotechnology Information,
the ReLiBase developers, and the Swiss Institute for Bio-
informatics/Glaxo. We are especially grateful to Kim Henrick
of the EBI and Steve Bryant at NCBI who have reviewed our
files and sent back constructive criticisms. This has helped the
PDB to continuously improve its procedures for producing
entries. The cooperation of the BNL PDB staff is gratefully
acknowledged. Portions of this article will appear in Volume F
of the International Tables of Crystallography. This work is
supported by grants from the National Science Foundation, the
Office of Biology and Environmental Research at the Department
of Energy, and two units of the National Institutes of Health:
the National Institute of General Medical Sciences and the
National Institute of Medicine.
REFERENCES
1. Bernstein,F.C., Koetzle,T.F., Williams,G.J., Meyer,E.E., Brice,M.D.,
Rodgers,J.R., Kennard,O., Shimanouchi,T. and Tasumi,M. (1977)
J. Mol. Biol., 112, 535–542.
2. Bourne,P., Berman,H.M., Watenpaugh,K., Westbrook,J.D. and
Fitzgerald,P.M.D. (1997) Methods Enzymol., 277, 571–590.
3. Westbrook,J. and Bourne,P.E. (2000) Bioinformatics, in press.
4. Markley,J.L., Bax,A., Arata,Y., Hilbers,C.W., Kaptein,R., Sykes,B.D.,
Wright,P.E. and Wüthrich,K. (1998) J. Biomol. NMR, 12, 1–23.
5. Engh,R.A. and Huber,R. (1991) Acta Crystallogr., A47, 392–400.
6. Clowney,L., Jain,S.C., Srinivasan,A.R., Westbrook,J., Olson,W.K. and
Berman,H.M. (1996) J. Am. Chem. Soc., 118, 509–518.
7. Gelbin,A., Schneider,B., Clowney,L., Hsieh,S.-H., Olson,W.K. and
Berman,H.M. (1996) J. Am. Chem. Soc., 118, 519–528.
8. IUPAC–IUB Joint Commission on Biochemical Nomenclature (1983)
Eur. J. Biochem., 131, 9–15.
9. Zhang.J., Cousens,L.S., Barr,P.J. and Sprang,S.R. (1991) Proc. Natl
Acad. Sci. USA, 88, 3346–3450.
10. Laskowski,R.A., McArthur,M.W., Moss,D.S. and Thornton,J.M. (1993)
J. Appl. Crystallogr., 26, 283–291.
11. Vaguine,A.A., Richelle,J.and Wodak,S.J. (1999) Acta Crystallogr.,D55,
191–205.
12. Ulrich,E.L., Markley,J.L andKyogoku,Y. (1989) Protein Seq. Data Anal.,
2, 23–37.
13. Lee,B. and Richards,F.M. (1971) J. Mol. Biol., 55, 379–400.
14. Shindyalov,I.N. and Bourne,P.E. (1998) Protein Eng., 11, 739–747.
15. Gilliland,G.L. (1988) J. Cryst. Growth, 90, 51–59.
16. Pearson,W.R. and Lipman,D.J. (1988) Proc. Natl Acad. Sci. USA, 24,
2444–2448.
17. Bernstein,H.J., Bernstein,F.C. and Bourne,P.E. (1998) J. Appl.
Crystallogr., 31, 282–295.
18. Laboratory,B.N. (1998) AutoDep, version 2.1. Upton, NY.
19. Quinn,G., Taylor,A., Wang,H.-P. and Bourne,P.E. (1999) Trends
Biochem. Sci., 24, 321–324.
20. Quinn,G., Wang,H.-P., Martinez,D. and Bourne,P.E. (1999)
Pacific Symp. Biocomput., 380–391.
21. Siddiqui,A. and Barton,G. (1996) Perspectives on Protein Engineering
1996, 2, (CD-ROM edition; Geisow,M.J. ed.) BIODIGM Ltd (UK).
ISBN 0-9529015-0-1.
22. Orengo,C.A., Michie,A.D., Jones,S., Jones,D.T., Swindels,M.B. and
Thornton,J.M. (1997) Structure, 5, 1093–1108.
23. Kabsch,W. and Sander,C. (1983) Biopolymers, 22, 2277–2637.
24. Holm,L. and Sander,C. (1998) Nucleic Acids Res., 26, 316–319.
25. Nayal,M., Hitz,B.C. and Honig,B. (1999) Protein Sci., 8, 676–679.
26. Dodge,C., Schneider,R. and Sander,C. (1998) Nucleic Acids Res., 26,
313–315.
27. Suhnel,J. (1996) Comput. Appl. Biosci., 12, 227–229.
28. Hogue,C., Ohkawa,H. and Bryant,S. (1996) Trends Biochem. Sci., 21,
226–229.
29. Berman,H.M., Olson,W.K., Beveridge,D.L., Westbrook,J., Gelbin,A.,
Demeny,T., Hsieh,S.H., Srinivasan,A.R. and Schneider,B. (1992)
Biophys. J., 63, 751–759.
30. Weissig,H., Shindyalov,I.N. and Bourne,P.E. (1998) Acta Crystallogr.,
D54, 1085–1094.
31. Laskowski,R.A., Hutchinson,E.G., Michie,A.D., Wallace,A.C.,
Jones,M.L. and Thornton,J.M. (1997) Trends Biochem. Sci., 22, 488–490.
32. Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995)
J. Mol. Biol., 247, 536–540.
33. Neshich,G., Togawa,R., Vilella,W. and Honig,B. (1998) Protein Data
Bank Quarterly Newsletter, 84.
34. Westhead,D., Slidel,T., Flores,T. and Thornton,J. (1998) Protein Sci., 8,
897–904.
35. Gibrat,J.-F., Madej,T. and Bryant,S.H. (1996) Curr. Opin. Struct. Biol., 6,
377–385.
36. Hooft,R.W.W., Sander,C. and Vriend,G. (1996) J. Appl. Crystallogr., 29,
714–716.
... We use the Q-BioLiP database [28] as the complex corpus. To prevent potential overfitting to a limited portion of the chemical space represented by the Q-BioLiP dataset, we additionally incorporate the PCQM4Mv2 dataset [32], which has been widely used for 3D molecular pretraining [33,34], and extract potential pockets on proteins from the Protein Data Bank [35]. To ensure the scalability of the pre-training process, we propose unified corrupt-then-denoise objectives applicable to various domain data (Figure 1d). ...
... We use the Q-BioLiP database [28] as the complex corpus. To prevent potential overfitting to a limited portion of the chemical space represented by the Q-BioLiP dataset, we additionally incorporate the PCQM4Mv2 dataset [32], which has been widely used for 3D molecular pre-training [33,34], and extract potential pockets on proteins from the Protein Data Bank [35]. ...
... These molecules are characterized by their 3D structures at equilibrium, calculated using density functional theory. For pocket data, we apply P2Rank [65] to detect potential ligand binding sites on proteins from the Protein Data Bank [35], which contains 0.2M proteins with experimentally-determined 3D structures, and collect a dataset of 2M pockets. ...
Preprint
Full-text available
Structure-based drug discovery (SBDD) is a systematic scientific process that develops new drugs by leveraging the detailed physical structure of the target protein. Recent advancements in pre-trained models for biomolecules have demonstrated remarkable success across various biochemical applications, including drug discovery and protein engineering. However, in most approaches, the pre-trained models primarily focus on the characteristics of either small molecules or proteins, without delving into their binding interactions which are essential cross-domain relationships pivotal to SBDD. To fill this gap, we propose a general-purpose foundation model named BIT (an abbreviation for Biomolecular Interaction Transformer), which is capable of encoding a range of biochemical entities, including small molecules, proteins, and protein-ligand complexes, as well as various data formats, encompassing both 2D and 3D structures. Specifically, we introduce Mixture-of-Domain-Experts (MoDE) to handle the biomolecules from diverse biochemical domains and Mixture-of-Structure-Experts (MoSE) to capture positional dependencies in the molecular structures. The proposed mixture-of-experts approach enables BIT to achieve both deep fusion and domain-specific encoding, effectively capturing fine-grained molecular interactions within protein-ligand complexes. Then, we perform cross-domain pre-training on the shared Transformer backbone via several unified self-supervised denoising tasks. Experimental results on various benchmarks demonstrate that BIT achieves exceptional performance in downstream tasks, including binding affinity prediction, structure-based virtual screening, and molecular property prediction.
... RAD51-BRC4 molecular docking calculations. The structure of RAD51 (PDB: 7EJC[62]) was retrieved from the Protein Data Bank[63]. Subsequently, only the 'B' chain was selected for further analysis, undergoing preparation using the Protein Preparation Wizard tool (Schrödinger Release 2022-1: Protein Preparation Wizard; Epik, Schrödinger, LLC, New York, NY, 2022; Impact, Schrödinger, LLC, New York, NY; Prime, Schrödinger, LLC, New York, NY, 2022). ...
Article
Full-text available
The RAD51-BRCA2 interaction is central to DNA repair through homologous recombination. Emerging evidence indicates RAD51 overexpression and its correlation with chemoresistance in various cancers, suggesting RAD51-BRCA2 inhibition as a compelling avenue for intervention. We previously showed that combining olaparib (a PARP inhibitor (PARPi)) with RS -35d (a BRCA2-RAD51 inhibitor) was efficient in killing pancreatic ductal adenocarcinoma (PDAC) cells. However, RS -35d impaired cell viability even when administered alone, suggesting potential off-target effects. Here, through multiple, integrated orthogonal biological approaches in different 2D and 3D PDAC cultures, we characterised RS -35d enantiomers, in terms of mode of action and single contributions. By differentially inhibiting both RAD51-BRCA2 interaction and sensor kinases ATM, ATR and DNA-PK, RS -35d enantiomers exhibit a ‘within-pathway synthetic lethality’ profile. To the best of our knowledge, this is the first reported proof-of-concept single small molecule capable of demonstrating this built-in synergism. In addition, RS -35d effect on BRCA2 -mutated, olaparib-resistant PDAC cells suggests that this compound may be effective as an anticancer agent possibly capable of overcoming PARPi resistance. Our results demonstrate the potential of synthetic lethality, with its diversified applications, to propose new and concrete opportunities to effectively kill cancer cells while limiting side effects and potentially overcoming emerging drug resistance.
... The structures of these eflornithine analogues were further prepared within UCSF Chimera to ensure proper formatting and energy minimization. Similarly, the crystal structure of ODC (PDB ID: 1NJJ) was retrieved from RCSB Protein Data Bank [51][52][53]. The structure of this enzyme was then imported into UCSF Chimera and pre-processed to remove water molecules and co-crystalized ligands, as well as to add hydrogen atoms and assign appropriate protonation states. ...
Article
Full-text available
Human African trypanosomiasis (HAT), a neglected tropical disease endemic to sub-Saharan Africa, faces treatment challenges due to drug-resistant Trypanosoma brucei strains and the need for safer and more effective therapeutic options. This study employed computational approaches, including 3D-similarity search, ADMET predictions, molecular docking simulation, and molecular dynamics (MD) simulation, to identify potential ornithine decarboxylase (ODC) inhibitors. Screening ChEMBL, DrugBank, and ZINC databases yielded seven eflornithine analogues as initial hit candidates. Structural optimization of these hit candidates led to the discovery of a novel compound with improved binding affinity, ADMET properties, safety profiles, and medicinal chemistry friendliness. MD simulation confirmed the stability of the protein–ligand complex involving ODC and the new compound. This study identifies the new compound as a promising candidate for the development of alternative ODC inhibitors for HAT treatment.
... Then, we use TrRosettaRNA, which is a deep learning-based RNA structure prediction method that predicts 3D RNA structure by leveraging evolutionary information and deep neural networks trained on RNA structural constraints [42]. We obtain the ground-truth 3D protein structure from the PDB database [43]. Using these structures, we conduct docking simulations with ZDOCK and HDOCKLite to evaluate the RNA-protein binding potential. ...
Preprint
Full-text available
Protein-RNA interactions are essential in gene regulation, splicing, RNA stability, and translation, making RNA a promising therapeutic agent for targeting proteins, including those considered undruggable. However, designing RNA sequences that selectively bind to proteins remains a significant challenge due to the vast sequence space and limitations of current experimental and computational methods. Traditional approaches rely on in vitro selection techniques or computational models that require post-generation optimization, restricting their applicability to well-characterized proteins. We introduce RNAtranslator, a generative language model that formulates protein-conditional RNA design as a sequence-to-sequence natural language translation problem for the first time. By learning a joint representation of RNA and protein interactions from large-scale datasets, RNAtranslator directly generates binding RNA sequences for any given protein target without the need for additional optimization. Our results demonstrate that RNAtranslator produces RNA sequences with natural-like properties, high novelty, and enhanced binding affinity compared to existing methods. This approach enables efficient RNA design for a wide range of proteins, paving the way for new RNA-based therapeutics and synthetic biology applications. The model and the code is released at github.com/ciceklab/RNAtranslator .
... Adding to the complexity is the inherent diversity of scientific data formats across various domains. Materials are typically represented using chemical formulas and periodic lattice structures [10], molecules are characterized by SMILES strings and three-dimensional conformations [11], and proteins are described by amino acid sequences and spatial folding patterns [12]. Each of these representational paradigms encodes unique, complex relationships between sequence and structure, demanding specialized approaches to accurately capture their intricacies. ...
Preprint
Full-text available
Unified generation of sequence and structure for scientific data (e.g., materials, molecules, proteins) is a critical task. Existing approaches primarily rely on either autoregressive sequence models or diffusion models, each offering distinct advantages and facing notable limitations. Autoregressive models, such as GPT, Llama, and Phi-4, have demonstrated remarkable success in natural language generation and have been extended to multimodal tasks (e.g., image, video, and audio) using advanced encoders like VQ-VAE to represent complex modalities as discrete sequences. However, their direct application to scientific domains is challenging due to the high precision requirements and the diverse nature of scientific data. On the other hand, diffusion models excel at generating high-dimensional scientific data, such as protein, molecule, and material structures, with remarkable accuracy. Yet, their inability to effectively model sequences limits their potential as general-purpose multimodal foundation models. To address these challenges, we propose UniGenX, a unified framework that combines autoregressive next-token prediction with conditional diffusion models. This integration leverages the strengths of autoregressive models to ease the training of conditional diffusion models, while diffusion-based generative heads enhance the precision of autoregressive predictions. We validate the effectiveness of UniGenX on material and small molecule generation tasks, achieving a significant leap in state-of-the-art performance for material crystal structure prediction and establishing new state-of-the-art results for small molecule structure prediction, de novo design, and conditional generation. Notably, UniGenX demonstrates significant improvements, especially in handling long sequences for complex structures, showcasing its efficacy as a versatile tool for scientific data generation.
... In terms of functionality, LEO constituents comprised hydrocarbons (30), alcohols (20), esters (3), ethers (3) and aldehydes (1). REO had similar functional groups but included ketones instead of aldehyde and additionally, a carboxylic acid, hydrocarbons (25), alcohols (19), ketones (6), ether (4), ester (4) and carboxylic acid (1). ...
Research
Full-text available
This study investigates the chemical composition and antifungal potential of Aframomum atewae’s essential oils, extracted from its leaves and rhizomes. Using GC-MS analysis, we identified a complex mixture dominated by monoterpene hydrocarbons—notably sabinene, β-pinene, and terpinen-4-ol. In vitro assays revealed that the rhizome oil exhibited significantly stronger antifungal activity against Candida albicans, Candida guilliermondii, and Saccharomyces cerevisiae compared to the leaf oil. Furthermore, molecular docking studies provided insights into how key oil constituents interact with fungal protein targets, suggesting that these compounds could serve as natural antifungal agents. Overall, the paper highlights the potential of Aframomum atewae as a promising source for biodegradable fungicides.
Article
Full-text available
Inhibition of pancreatic alpha-amylase and alpha-glucosidase is a common strategy to manage type 2 diabetes. This study focuses on the ability of compounds present in commercially available herbs and spices to inhibit pancreatic alpha-amylase and alpha-glucosidase. In silico molecular docking was performed to evaluate the binding affinity of the compounds present in herbs and spices. Molecular dynamics was performed with acarbose and rutin which had the best docking scores for pancreatic alpha-amylase and alpha-glucosidase. Six compounds (rutin, caffeic acid, p-coumaric acid, vanillin, ethyl gallate, and oxalic acid) with a range of docking scores were subjected to in vitro enzyme kinetic studies using pancreatic alpha-amylase and alpha-glucosidase biochemical assays. Acarbose, a prescribed alpha-amylase and alpha-glucosidase inhibitor, was used as a positive control. Ligands that interacted strongly with the amino acids at a particular site, were conformationally stable and had good docking scores. There was a correlation between the in silico and in vitro binding affinity. Caffeic acid, vanillin, ethyl gallate, and p-coumaric acid had inhibition constant (Ki) values that were not significantly different (p > 0.05) from the Ki of acarbose for pancreatic alpha-amylase. Rutin, caffeic acid, vanillin, and p-coumaric acid had Ki values that were not significantly different (p ˃ 0.05) from the Ki of acarbose for alpha-glucosidase. The cell viability of these compounds was assessed with the sulforhodamine B (SRB) assay in Caco2 cells. Caffeic acid, p-coumaric acid, rutin, and vanillin had Caco2 IC50 values that were not significantly different (p ˃ 0.05) from that of acarbose. The evaluated compounds present in herbs and spices can potentially reduce hyperglycemia associated with type 2 diabetes. Herbs and spices with high levels of these compounds were identified and these were common verbena, sweet basil, tarragon, pepper, parsley, sorrel, and vanilla. These herbs and spices may reduce the required dose of prescription drugs, such as acarbose, thereby reducing costs and drug-associated side effects.
Preprint
Full-text available
As AlphaFold achieves high-accuracy tertiary structure prediction for most single-chain proteins (monomers), the next 2 frontier in protein structure prediction lies in accurately modeling multi-chain protein complexes (multimers). We developed MULTICOM4, the latest version of the MULTICOM system, to improve protein complex structure prediction by integrating transformer-based AlphaFold2, diffusion model based AlphaFold3, and our in-house techniques. These include protein complex stoichiometry prediction, diverse multiple sequence alignment (MSA) generation leveraging both sequence and structure comparison, modeling exception handling, and deep learning-based model quality assessment. MULTICOM4 was blindly evaluated in the 16th community-wide Critical Assessment of Techniques for Protein Structure Prediction (CASP16) in 2024. In Phase 0 of CASP16, where stoichiometry information was unavailable, MULTICOM predictors performed best, with MULTICOM_human achieving a TM-score of 0.752 and a DockQ score of 0.584 for top-ranked predictions on average. In Phase 1 of CASP16, with stoichiometry information provided, MULTICOM_human remained among the top predictors, attaining a TM-score of 0.797 and a DockQ score of 0.558 on average. The CASP16 results demonstrate that integrating complementary AlphaFold2 and 3 with enhanced MSA inputs, comprehensive model ranking, exception handling, and accurate stoichiometry prediction can effectively improve protein complex structure prediction.
Preprint
Large language models have made remarkable progress in the field of molecular science, particularly in understanding and generating functional small molecules. This success is largely attributed to the effectiveness of molecular tokenization strategies. In protein science, the amino acid sequence serves as the sole tokenizer for LLMs. However, many fundamental challenges in protein science are inherently structure-dependent. The absence of structure-aware tokens significantly limits the capabilities of LLMs for comprehensive biomolecular comprehension and multimodal generation. To address these challenges, we introduce a novel framework, ProTeX, which tokenizes the protein sequences, structures, and textual information into a unified discrete space. This innovative approach enables joint training of the LLM exclusively through the Next-Token Prediction paradigm, facilitating multimodal protein reasoning and generation. ProTeX enables general LLMs to perceive and process protein structures through sequential text input, leverage structural information as intermediate reasoning components, and generate or manipulate structures via sequential text output. Experiments demonstrate that our model achieves significant improvements in protein function prediction, outperforming the state-of-the-art domain expert model with a twofold increase in accuracy. Our framework enables high-quality conformational generation and customizable protein design. For the first time, we demonstrate that by adopting the standard training and inference pipelines from the LLM domain, ProTeX empowers decoder-only LLMs to effectively address diverse spectrum of protein-related tasks.
Preprint
Full-text available
AlphaFold 2 has revolutionized protein structure prediction, yet systematic evaluations of its performance against experimental structures for specific protein families remain limited. Here we present the first comprehensive analysis comparing AlphaFold 2-predicted and experimental nuclear receptor structures, examining root-mean-square deviations, secondary structure elements, domain organization, and ligand-binding pocket geometry. While AlphaFold2 achieves high accuracy in predicting stable conformations with proper stereochemistry, it shows limitations in capturing the full spectrum of biologically relevant states, particularly in flexible regions and ligand-binding pockets. Statistical analysis reveals significant domain-specific variations, with ligand-binding domains showing higher structural variability (CV = 29.3\%) compared to DNA-binding domains (CV = 17.7\%). Notably, Alphafold 2 systematically underestimates ligand-binding pocket volumes and captures only single conformational states in homodimeric receptors where experimental structures show functionally important asymmetry. These findings provide critical insights for structure-based drug design targeting nuclear receptors and establish a framework for evaluating Alphafold 2 predictions across other protein families.
Article
Full-text available
pdb2cif is a new version of an awk script originally written by P. E. Bourne in 1993 to translate from the 1992 Protein Data Bank (PDB) format to the then-emerging macromolecular Crystallographic Information File (mmCIF) definition. This new version of pdb2cif translates from all current PDB formats, including the 1992 PDB format and the 1996 PDB Atomic Coordinate Entry Format, Version 2.0, to the 1997 mmCIF format as defined in the mmCIF dictionary 1.0.00. The program is provided as an m4 script from which both perl and awk versions can be produced. The program identifies mmCIF entities implicitly by sequence homology among PDB SEQRES records. With minor additions to the dictionary, the resultant mmCIF data-sets are substantially compliant with the mmCIF 1.0.00 dictionary.
Article
Full-text available
The topology of a protein structure is a highly simplified description of its fold including only the sequence of secondary structure elements, and their relative spatial positions and approximate orientations. This information can be embodied in a two-dimensional diagram of protein topology, called a TOPS cartoon. These cartoons are useful for the understanding of particular folds and making comparisons between folds. Here we describe a new algorithm for the production of TOPS cartoons, which is more robust than those previously available, and has a much higher success rate. This algorithm has been used to produce a database of protein topology cartoons that covers most of the data bank of known protein structures.
Article
Full-text available
We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA or protein sequences based on a variety of alternative scoring matrices.
Article
For a successful analysis of the relation between amino acid sequence and protein structure, an unambiguous and physically meaningful definition of secondary structure is essential. We have developed a set of simple and physically motivated criteria for secondary structure, programmed as a pattern-recognition process of hydrogen-bonded and geometrical features extracted from x-ray coordinates. Cooperative secondary structure is recognized as repeats of the elementary hydrogen-bonding patterns “turn” and “bridge.” Repeating turns are “helices,” repeating bridges are “ladders,” connected ladders are “sheets.” Geometric structure is defined in terms of the concepts torsion and curvature of differential geometry. Local chain “chirality” is the torsional handedness of four consecutive Cα positions and is positive for right-handed helices and negative for ideal twisted β-sheets. Curved pieces are defined as “bends.” Solvent “exposure” is given as the number of water molecules in possible contact with a residue. The end result is a compilation of the primary structure, including SS bonds, secondary structure, and solvent exposure of 62 different globular proteins. The presentation is in linear form: strip graphs for an overall view and strip tables for the details of each of 10.925 residues. The dictionary is also available in computer-readable form for protein structure prediction work.
Article
GRASS (Graphical Representation and Analysis of Structures Server), a new web-based server, is described. GRASS exploits many of the features of the GRASP program and is designed to provide interactive molecular graphics and quantitative analysis tools with a simple interface over the World-Wide Web. Using GRASS, it is now possible to view many surface features of biological macromolecules on either standard workstations used in macromolecular analysis or personal computers. The result is a World-Wide Web-based, platform-independent, easily used tool for macromolecular visualization and structure analysis.
Article
The Protein Data Bank is a computer-based archival file for macromolecular structures. The Bank stores in a uniform format atomic co-ordinates and partial bond connectivities, as derived from crystallographic studies. Text included in each data entry gives pertinent information for the structure at hand (e.g. species from which the molecule has been obtained, resolution of diffraction data, literature citations and specifications of secondary structure). In addition to atomic co-ordinates and connectivities, the Protein Data Bank stores structure factors and phases, although these latter data are not placed in any uniform format. Input of data to the Bank and general maintenance functions are carried out at Brookhaven National Laboratory. All data stored in the Bank are available on magnetic tape for public distribution, from Brookhaven (to laboratories in the Americans), Tokyo (Japan), and Cambridge (Europe and worldwide). A master file is maintained at Brookhaven and duplicate copies are stored in Cambridge and Tokyo. In the future, it is hoped to expand the scope of the Protein Data Bank to make available co-ordinates for standard structural types (e.g. α-helix, RNA double-stranded helix) and representative computer programs of utility in the study and interpretation of macromolecular structures.
Article
To facilitate understanding of, and access to, the information available for protein structures, we have constructed the Structural Classification of Proteins (scop) database. This database provides a detailed and comprehensive description of the structural and evolutionary relationships of the proteins of known structure. It also provides for each entry Links to co-ordinates, images of the structure, interactive viewers, sequence data and literature references. Two search facilities are available. The homology search permits users to enter a sequence and obtain a list of any structures to which it has significant levels of sequence similarity The key word search finds, for a word entered by the user, matches from both the text of the scop database and the headers of Brookhaven Protein Databank structure files. The database is freely accessible on World Wide Web (WWW) with an entry point to URL http://scop.mrc-lmb.cam.ac.uk/scop/ scop: an old English poet or minstrel (Oxford English Dictionary); ckon: pile, accumulation (Russian Dictionary).
Article
A crystallization database, the Biological Macromolecule Crystallization Database, containing crystal data and the crystallization conditions for more than 1000 crystal forms of over 600 biological macromolecules, has been compiled from the scientific literature. Data for proteins, protein: protein complexes, nucleic acids, nucleic-acid: nucleic-acid complexes, protein: nucleic-acid complexes and viruses have been included. The general information catalogued for each macromolecule. The crystal data molecular weight, the subunit composition, the presence of prosthetic group(s), and the source of the macromolecule. The crystal data include the unit cell parameters, space group, crystal density, crystal habit and size, and diffraction limit and lifetime. The crystallization data consist of the crystallization method, chemical additions to the crystal growth medium, macromolecule concentration, temperature, pH, and growth time. A result of the compilation of the crystallization data was the development of a general strategy for the crystallization of soluble proteins. The strategy employs vapor diffusion experiments with the most frequently used crystallization agents and microdialysis against low ionic strength to maximize the possiblity of obtaining crystals. A detailed outline of this strategy is presented.
Article
Guidelines for submitting commentsPolicy: Comments that contribute to the discussion of the article will be posted within approximately three business days. We do not accept anonymous comments. Please include your email address; the address will not be displayed in the posted comment. Cell Press Editors will screen the comments to ensure that they are relevant and appropriate but comments will not be edited. The ultimate decision on publication of an online comment is at the Editors' discretion. Formatting: Please include a title for the comment and your affiliation. Note that symbols (e.g. Greek letters) may not transmit properly in this form due to potential software compatibility issues. Please spell out the words in place of the symbols (e.g. replace “α” with “alpha”). Comments should be no more than 8,000 characters (including spaces ) in length. References may be included when necessary but should be kept to a minimum. Be careful if copying and pasting from a Word document. Smart quotes can cause problems in the form. If you experience difficulties, please convert to a plain text file and then copy and paste into the form.
Article
We believe the need exists for an organized and accessible repository of protein nuclear magnetic resonance data. The structure and dynamics of hundreds of proteins are being investigated by NMR. NMR data currently are spread throughout the world and often are not published for lack of journal space. Difficulties in locating, obtaining, and correlating these data with protein structures limits their usefulness to the scientific community. In time, the data may become lost or ignored. To provide a collection point for the results of protein NMR studies and a uniform means of distributing these data, we propose that a data bank be created consisting of two databases: a comprehensive and thoroughly indexed database for the NMR literature and a data repository for the storage of extensive protein NMR data sets. Our current specifications for the types of information to be stored in the NMR databases and their organization for dissemination are defined. The design is intended to be flexible, capable of expanding to include new techniques, new forms of data, and biopolymers other than proteins. This is a proposal to the community of NMR spectroscopists. Only through the active cooperation and support of those in the field of NMR spectroscopy can the proposed data bank succeed.