Nucleic Acids Research, 1993, Vol. 21, No. 13 3093-3096
The SWISS-PROT protein sequence data bank, recent
Amos Bairoch and Brigitte Boeckmann1
Department of Medical Biochemistry, University of Geneva, 1 rue Michel Servet, 1211 Geneva 4,
Switzerland and 'European Molecular Biology Laboratory, Heidelberg, Germany
SWISS-PROT  is an annotated protein sequence database
established in 1986 and maintained collaboratively, since 1988,
by the Department of Medical Biochemistry of the University
of Geneva and the EMBL Data Library . The SWISS-PROT
protein sequence data bank consist of sequence entries. Sequence
entries are composed ofdifferent lines tpes, each with their own
format. For standardization purposes the format ofSWISS-PROT
 follows as closely as possible that of the EMBL Nucleotide
Sequence Database. A sample SWISS-PROT entry is shown in
The SWISS-PROT database distinguishes itself from other
protein sequence databases by three distinct criteria:
In SWISS-PROT, as in most other sequence databases, two
classes of data can be distinguished: the core data and the
annotation. For each sequence entry the core data consists ofthe
references) and the taxonomic data (description ofthe biological
source of the protein) while the annotation consists of the
description of the following items:
Function(s) of the protein
Domains and sites
Similarities to other proteins
Disease(s) associated with deficiencie(s) in the protein
Sequence conflicts, variants, etc.
We try to include as much annotation information as possible
in SWISS-PROT. To obtain this information we use, in addition
to the publications that report new sequence data, review articles
to periodically update the annotations of families or groups of
proteins. We also make use of external experts, who have been
recruited to send us their comments and updates concerning
specific groups of proteins.
Many sequence databases contain, for a given protein sequence,
separate entries which correspond to different literature reports.
In SWISS-PROT we try as much as possible to merge all these
data so as to minimize the redundancy ofthe database. Ifconflicts
exist between various sequencing reports, they are indicated in
the feature table of the corresponding entry.
Integration with other databases
It is important to provide the users of biomolecular databases
with a degree ofintegration between the three types of sequence-
related databases (nucleic acid sequences, protein sequences and
protein tertiary structures) as well as with specialized data
collections. SWISS-PROT is currently cross-referenced with
twelve different databases. Cross-references are provided in the
form ofpointers to information related to SWISS-PROT entries
and found in data collections other than SWISS-PROT. For
example the sample sequence shown in Figure 1 contains DR
(Data bank Reference) lines that point to EMBL, PIR, PDB,
OMIM, and PROSITE. In this particular example it is therefore
possible to retrieve the nucleic acid sequence(s) that encodes for
that protein (EMBL),
coordinates (PDB), the description ofgenetic disease(s) associated
with that protein (OMIM), or the pattern specific for that family
of proteins (PROSITE).
Integration of information from 2D gel databases
Enormous progress has been made in two-dimensional (2D) gel
techniques in the last few years. One ofthe consequences of this
evolution has been the development of databases that contain
master gels from a variety ofmammalian tissues or from bacterial
sources. These databases will play an increasingly important role
in the analysis of genomes and of molecular diseases. 2D gel
databases generally contain one or more master images of the
gels that correspond to the tissue or organism studied; spots on
these images are attributed an identification code and a variable
percentage of these spots are linked to known proteins. The
identification of a protein on a 2D gel is generally carried out
using antibodies or by microsequencing. Microsequencing of2D
gel spots also produces partial sequences and physico-chemical
data for a number of yet uncharacterized proteins.
collaboration with a number of groups developing 2D gel
databases. Since last year, cross-references to the gene-protein
database of Escherichia coli K-12 (EC02DBASE)  have been
available and symmetrically that database now contains cross-
references to SWISS-PROT. As a second step we haveexpanded
our links to 2D gel databases by integrating data from the
The Human 2D gel protein database ofthe Faculty ofMedicine
of the University of Geneva (known as SWISS-2DPAGE).
itself to work
kQ-D 1993Oxford UniversityPress
3094 Nucleic Acids Research, 1993, Vol. 21, No. 13
21-JUL-1966 (EEL. 01, CREATED)
21-JUL-1966 (EEL. 01, LAST SESENCE UPDATE)
O1-DEC-1I2 (EEL. 24
NECROSIS FACTa PRECURSO
MONO SAPIENS (NU).
EUtARYOTA; METAZOA; CEHDATA; VERTERATA; TETRAPODA; MAMLIA;
SEQUENCE FROM N.A.
FILIPPOV SA., BYSTROV
SINGAROVA L.., OVCINIKV Y.A.-
COLD SPRING MAS. SYMP. QUANT.
SEWENCE FROM N.A.
NEDWIN G.E., NAYFLICK J.S.
PALLADIO A.A., KOHN W.J., AGAML
SEQUENCE FRM N.A.
ITO H., TODD C.W., WALLACE R.B.;
SEJENCE FROM N.A.
SAKACMNI A.Y., SMITH D.N.,
JARRETT-i.CIN J.,PENI6A D. GOEL D.V., GRAY P.W.;
NUCLEIC ACIDS RES. 13:636-61 h(1985).
SEQUENCE FRM N.A.
VAN ARDELL J.N.
X-RAY CRYSTALLOGPNY (2.6 ANGSTROMS).
ECK M.J., SPAGS.R-
J. SIOL. CNEN. 26:1%95-17605(1989).
EUO J. i0:&27-836(1991).
STEVENSON F T
LOCKSLEY R.M., LOVETT D.H.;
J. EXP. NED. 1t6:1053-10620(19).
-I- FUNCTION: TNF IS MAINLY SECRETED BY MACROPNAGES
WITN A WIM
VARIETY OF FUNCTIONS: IT CAN CA
IT IS A POTENT PY
OR BY STILATION OF INTERLEUIN I SECRETION
CELL PROLIFERATION AND INE CELL DIFFERENTIATION UNDER CERTAIN
-1- SUBIT: NMRINE.
UNDEESOES POST-TRANSLATIONAL CLEAVAGE LIERATING THE
-1- DISEASE: CACNEXIA ACCOMPAIES A VARIETY OF DISEASES
CANCER AD INFECTION, AND IS CHARACTERIZED BY GENERAL ILL HEALTH
ELONGS TO THE TUMOR NECRSIS FACTOR FAMILY.
EL; X02910; NSTNFA.
EBL M1641; NSTNFAB.
PIE; B7- GUIU.
PDB- ITNF- 15-J-91.
C-TTM&- CYTOTIN- TAIMEMBRANE; GLYCOPTEIN; SIGNAL-ANCHO;
L-5: LOW ACTIVITY.
R-)W: BIOLOGICALLY INACTIVE.
L->F: BIOLOGICALLY INACTIVE.
A->V: BIOLOGICALLY INACTIVE.
S->F: BIOLOGICALLY INACTIVE.
V2-A,D: BIOLOGICALLY INACTIVE.
E->K: BIOLOGICALLY INACTIVE.
MTESMIEDV ELAEELPKK TGGQGRRCLFLSLFSFLI VAGATTLFCL LHFGVIGPQR
EEFPRDLSLI SPLAQAVESS SRTP
DNOLVWSEG LYLIYSQVLF K
PSTHV LLTHTISRIA VSYQTKVNLL SAIKSPCGRE
TPEGAEAKPW YEPIYLGGVF OLEKGDRLSA EINPYLDF AESOVFGI IAL
LAST ANNOTATION UPDATE)
V.G., DOURtNIN V.N.,
OLDYREVA E.F., CNUVPILO S.A.,
6., GEDDEL D.V.;
LIN L.S., STRICKLER J.,
TAVERNIER J., PRAGE T., FIEES W.;
IT IS A CYTOKINE
CELLLINS, IT IS IMPLICATED IN TNE INDUCTION OF
CUISING FEVER BY DIRECT ACTION
IT CAM STIMUATE
LOCATION: SYNTNESIZED AS A TYPE II MEUERAME POTEIN,
F -> S (IN REF. 5).
Figure1. Asample entryfrom SWISS-PROT.
SWISS-2DPAGE currently contains data concerning plasma
 and liver  proteins, but will soon include additional
The Human keratinocyte 2D gel protein database from the
of Aarhus and Ghent
For both of the above databases we provide:
corresponding to known or unknown microsequenced
b) We have created new entries for microsequences that
correspond to novel, yet unidentified, proteins.
c) In some cases we have entered the extent of the
microsequences for already known proteins. This was done
for proteins which are not yet well characterized. The
availability of such microsequences allows, for example,
to confirm the position of a signal sequence cleavage site
or to confirm the correctness of a translated genomic
In the near future the collaboration with the group of Denis
Hochstrasser which produces the SWISS-2DPAGE database will
be expanded in the following directions:
a) The MELANIE software package  which is a complete
system for the analysis of2D gels and which is developed
by the group ofHochstrasser will allow its users tonavigate
back and forth between SWISS-2DPAGE
b) A file server will be set up that will allow anyone with a
network connection to obtain annotated graphic files
containing the region of the gels that correspond to a
selected SWISS-PROT entry linked to SWISS-2DPAGE.
Integration of secondary and tertiary structure data
Thanks to recent advances in experimental techniques there has
been a significant increase in the number of protein sequences
that have been characterized at the level oftheir tertiary structure
either by X-ray crystallography or by NMR-based methods. A
particular effort has been made to provide access to this category
of information from inside SWISS-PROT. This effort is
conceptualized by the following attributes:
a) Thanks to a collaboration with the group of Chris Sander
at EMBL, the feature table of sequence entries of proteins
whose tertiary structure is known experimentally contains
the secondary structure information corresponding to that
protein. The secondary structure assignment is made
according to the Dictionary of Secondary Structure of
Proteins (DSSP)  and the information is extracted from
the coordinate data sets of the Protein Data Bank (PDB)
. In the feature table three types of secondary structure
are specified: helices (key 'HELIX'), beta-strand (key
'STRAND') and turns (key 'TURN'). Residues not
specified in one ofthese classes are in a 'loop' or 'random-
b) Cross-references are available to entries in both sections
of the PDB database (annotated and preliminary). In
addition the protein sequence entries that are linked to PDB
contain the keyword '3D-STRUCTURE'.
c) We try to include, in SWISS-PROT as many bibliographical
references as possible to papers dealing with structural data
Nucleic Acids Research, 1993, Vol. 21, No. 13 3095
that originate from X-ray crystallography or NMR studies.
These references are prefixed by RP lines such as those
RP X-RAY CRYSTALLOGRAPHY (n.n ANGSTROMS).
RP STRUCTURE BY NMR.
RP 3D-STRUCTURE MODELLING.
Human genetic diseases
An increasing number of human genetic diseases are being
characterized at the molecular level. We have integrated
information concerning these diseases in SWISS-PROT. In
particular we provide:
a) Cross-references to OMIM, the on-line version ofthe book
'Mendelian Inheritance in Man'
provides a wealth ofdata on mapped and sequenced human
genes including a full description ofthe phenotype ofknown
Mendelian disorders as well as information relative to
known allelic variants. Currently there are more than 1700
human protein sequence entries in SWISS-PROT which are
(MIMTOSP.TXT) is distributed with SWISS-PROT that
these entries and
b) When a human protein is known to be involved in a genetic
disorder a brief description of that disease is available in
the comments section (CC lines) of that entry. As shown
in the example below the 'DISEASE' topic is used for such
CC -!- DISEASE: DEFECTS IN SOD1 ARE THE CAUSE OF
LATERAL SCLEROSIS (FALS), A DEGENERATIVE
DISORDER OF MOTOR
NEURONS IN THE CORTEX, BRAINSTEM AND SPINAL
c) Point mutations that affect a single amino acid and which
are linked with the occurrence of a disease are indicated
in the feature table (FT lines) of the relevant entry. As
shown in the example below the 'VARIANT' key is used
for such a purpose:
D - G (ALABAMA; MODERATE
Q - P (NEW LONDON; SEVERE
C -R (BASEL; SEVERE
D - N (OXFORD-D1; SEVERE
Escherichia coli as a model organism
Thanks to a very fruitful collaboration with Ken Rudd of the
National Center for Biotechnology Information (NCBI) protein
sequences that originate from the chromosome of Escherichia
coli K12 are considered to be a paradigm for what we want to
achieve in term of the completeness and quality of the data in
SWISS-PROT. The hallmarks of this undertaking are listed
a) These entries are cross-referenced to the EcoGene section
ofthe EcoSeq/EcoMap integrated Escherichia coli database
 and also, as described in subsection 2a above, to the
gene-protein 2D gel database of Escherichia coli K-12
. This database
their corresponding OMIM
b) New Escherichia coli sequences are entered and annotated
on a weekly basis and are immediately made available to
the scientific community.
c) Existing Escherichia coli sequence entries are constantly
updated to add data concerning their functions, to resolve
sequence conflicts, to add references and comments, to
update gene designations, etc.
d) We have implemented
nomenclature for unnamed Escherichia coli hypothetical
proteins and proteins of unknown function. They are
assigned gene names based upon their position on the
genomic physical map. They all begin with the letter 'Y'.
The next two letters designate which 1/100th of the map
(starting at the thr locus) contain the ORF in the order Yaa,
Yab,..Yaj, Yba, Ybb,..Ybj,..., Yja,..Yjj. ORF's within
any one of these 100 intervals are given a fourth letter (a-
z) that serves to distinguish them but is not meant to convey
e) We provide a document file (ECOLI.TXT) that specifically
lists all the E.coli K12 chromosomal sequence entries in
SWISS-PROT along with their primary and synonymous
the EcoGene gene name
Content of the current release
Release 25.0 of SWISS-PROT (April 1993) contains 29,955
sequence entries, comprising 10,214,020 amino acids abstracted
from 29,176 references. The data file (sequences and annotations)
requires 52 Mb ofdisk storage space. The database is distributed
with 17 documentation and index files (user's manual, release
notes, list oforganisms, citation index, keyword index, etc.) that
require about 14 Mb of disk space.
How to obtain SWISS-PROT
SWISS-PROT is distributed on magnetic tape and on CD-ROM
by theEMBL Data Library. The CD-ROM contains both SWISS-
PROT and the EMBL Nucleotide Sequence Database as well as
other data collections and some database query and retrieval
software for MS-DOS and Apple MacIntosh computers. For all
enquiries regarding the subscription and distribution of SWISS-
PROT one should contact:
EMBL Data Library
European Molecular Biology Laboratory
Postfach 10.2209, Meyerhofstrasse 1
6900 Heidelberg, Germany
Telephone:(+49 6221) 387 258
Telefax: (+49 6221) 387 519 or 387 306
Electronic network address: datalib@EMBL-heidelberg.de
Individual sequence entries can be obtained from the EMBL File
Server . Detailed instructions on how to make the best use
of this service, and in particular on how to obtain protein
sequences, can be obtained by sending to the network address
netserv@EMBL-heidelberg.de the following message:
3096 Nucleic Acids Research, 1993, Vol. 21, No. 13
If you have access to a computer system linked to the Internet
you can obtain SWISS-PROT using FlP (File Transfer Protocol),
from the following file servers:
EMBL anonymous FTP server
NCBI Repository (National Library of Medicine, NIH,
Washington D.C., U.S.A.)
Internet address: ncbi.nlm.nih.gov (22.214.171.124)
8. Appel R., Hochstrasser D.F., Funk M., Vargas J.R., Pellegrini C., Muller
A.F., Scherrer J.-R. Electrophoresis 12:722-735(1991).
9. Kabsch W., Sander C. Biopolymers 22:2577-2637(1983).
10. Koetzle T. CODATA Bulletin 23:83-84(1991).
11. McKusick V.A. Mendelian Inheritance in Man. Catalogs of autosomal
dominant, autosomal recessive, and X-linkedphenotypes; Tenti edition; Johns
Hopkins University Press, Baltimore, (1991).
12. RuddK.E., Miller W., Werner C., Ostell J., Tolstoshev C., Satterfield S.G.
Nucleic Acids Res. 19:637-647(1991).
13. Stoehr P.J., Omond R.A. Nucleic Acids Res. 17:6763-6764(1989).
14. Gilbert D. Trends Biochem. Sci. 18:107-108(1993).
Basel Biozentrum Biocomputing server (EMBnet SWISS node)
Internet address: bioftp.unibas.ch (or 126.96.36.199)
ExPASy (Expert Protein Analysis System server, University
of Geneva, Switzerland)
Internet address: expasy.hcuge.ch (188.8.131.52)
National Institute of Genetics (Japan) FTP server
Internet address: ftp.nig.ac.jp (184.108.40.206)
You can also obtain SWISS-PROT entries using various
Internet Gopher servers that specialize in biosciences (biogophers)
. Gopher is a distributed document delivery service that
allows a neophyte user to access various types of data residing
on multiple hosts in a seamless fashion.
No restrictions are placed on use or redistribution ofthe data.
The present distribution frequency is four releases per year.
Weekly updates are also available; these updates are available
by anonymous FTP. Three files are updated every week:
Contains all the new entries since the
last full release.
Contains the entries for which the
sequence data has been updated since
the last release.
Contains the entries for which one or
more annotation fields have been
updated since the last release.
These files are available on the EMBL, NCBI, EMBnet Swiss
node and Expasy servers, whose Internet addresses are listed
1. Bairoch A., Boeckmann B. Nucleic Acids Res. 20:2019-2022(1992).
2. Higgins D.G., Fuchs R., Stoehr P.J., Cameron G.N. Nucleic Acids Res.
3. Bairoch A. SWISS-PROT protein sequence data bank user manual, Release
25 of April 1993.
4. VanBogelen R.A., Sankar P., Clark R.L., Bogan J.A., Neidhardt F.C.
5. Hughes G.J., Frutiger S., Paquet N., Ravier F., Pasquali C., Sanchez J.-
C., James R., Tissot J.-D., Bjellqvist B., Hochstrasser D.F. Electrophoresis
6. Hochstrasser D.F., Frutiger S., Paquet N., Bairoch A., Ravier F., Pasquali
C., Sanchez J.-C., Tissot J.-D., Bjeilqvist B., Vargas R., Appel R.D., Hughes
G.J. Electrophoresis 13:992-1001(1992).
7. Celis J.E., Rasmussen H.H., Madsen P., Leffers H., Honore B., Dejgaard
K., Gesser B., Olsen E., Gromov P., Hoffmann H.J., Nielsen M., Celis
A., Basse B., Lauridsen J.B., Ratz G.P., Nielsen H., Andersen A.H.,
Walbum E., Kjaergaard I., Puype M., Van Damme J., Vandekerckhove
J. Electrophoresis 13:893-959(1992).