Diversity of Retrotransposable Elements
Cytogenet Genome Res 110:462–467 (2005)
Repbase Update, a database of eukaryotic
J. Jurka, V.V. Kapitonov, A. Pavlicek, P. Klonowski, O. Kohany,
Genetic Information Research Institute, Mountain View, CA (USA)
Manuscript received 16 October 2003; accepted in revised form for publication by J.-N. Volff 6 April 2004.
This work was supported by a grant from the National Institutes of Health (2P41 LM
Request reprints from: Dr. Jerzy Jurka, Genetic Information Research Institute
2081 Landings Drive, Mountain View, CA 94043 (USA)
telephone: +1-650-961-4480; fax: +1-650-961-4473; e-mail: firstname.lastname@example.org
Fax +41 61 306 12 34
© 2005 S. Karger AG, BaselAccessible online at:
Abstract. Repbase Update is a comprehensive database of
repetitive elements from diverse eukaryotic organisms. Cur-
rently, it contains over 3600 annotated sequences representing
different families and subfamilies of repeats, many of which are
unreported anywhere else. Each sequence is accompanied by a
short description and references to the original contributors.
Repbase Update includes Repbase Reports, an electronic jour-
nal publishing newly discovered transposable elements, and the
Transposon Pub, a web-based browser of selected chromosom-
al maps of transposable elements. Sequences from Repbase
Update are used to screen and annotate repetitive elements
using programs such as Censor and RepeatMasker. Repbase
Update is available on the worldwide web at http://www.gir-
Copyright © 2005 S. KargerAG, Basel
A substantial portion of eukaryotic genomes is composed of
multiple DNA copies referred to as “repetitive DNA”, which
can be divided into two major groups. The first group is com-
posed of tandem repeats generated primarily by the host repli-
cation machinery or recombination processes. This group in-
cludes microsatellites, minisatellites, and satellites (Jurka et al.,
2003). The second group includes interspersed repeats derived
from so-called “selfish elements” such as retroelements and
DNA transposons and are collectively known as transposable
elements, abbreviated to TEs (Jurka et al., 2003).
Generation of repetitive DNA in eukaryotic genomes is an
ongoing process that is probably as old as eukaryotes them-
selves (Jurka, 1998). Most repetitive DNA is lost from the
genome over time or is mutated beyond recognition. Yet, for
example, over 40% of the human genome is still composed of
recognizable interspersed repeats (Lander et al., 2001) of which
some are over 200 million years old (Kapitonov and Jurka,
Analysis of repetitive DNA has become an integral part of
eukaryotic sequence studies. The essential step in analyzing
repeats is identifying and classifying them. It is primarily done
by sequence comparison to well-characterized reference se-
quences. The first reference collection of repetitive elements
contained 53 representative sequences from the human ge-
nome, and after addition of other mammalian repeats became
known as “Repbase” (Jurka et al., 1992). In 1997 an expanded
version of Repbase was posted on the worldwide web and since
then it continues to be developed and updated monthly under
its current name “Repbase Update” (RU) (Jurka, 2000). RU is
widely used to annotate repetitive DNA in newly sequenced
eukaryotic genomes. Consequently, the repeat classification
based on RU is used in many other databases including the
human genome annotations at the UCSC genome database
(Karolchik et al., 2003), Ensembl annotation (Clamp et al.,
2003), and in secondary databases of repetitive elements such
as the human endogenous retrovirus database HERVd (Paces
et al., 2004).
Cytogenet Genome Res 110:462–467 (2005)
The database description
Repbase Update is a database of representative eukaryotic
repetitive sequences and other biologically relevant informa-
tion derived from printed journal articles, electronic journals,
and public databases including a selected subset of TEs from a
database for Triticeae repetitive sequences (Wicker et al.,
2002). RU is not a simple compilation of known repetitive
sequences. Numerous entries were reconstructed from scat-
tered, unclassified repeats, which were often mutated beyond
recognition, by standard screening routines. In many instances
only construction of quality consensus sequences permitted
their meaningful classification and made them useful as refer-
ence data (e.g. Kapitonov and Jurka, 2001). The reconstruction
process has been described elsewhere (Jurka, 1994, 1998; Kapi-
tonov and Jurka, 2003). It strongly depends on the number of
complete repeat copies from the same family/subfamily used
for the reconstruction, as well as on their divergence. Therefore,
the quality of different consensus sequences is determined by
the availability of raw sequence data.
Until 2001 RU was used as an electronic journal for instant
publishing most of the new TEs discovered during develop-
ment of the database (Jurka, 2000). Since then, the publication
of all original data was transferred to Repbase Reports, the
accompanying electronic journal described below. Some of the
original data published in RU by several contributors prior to
2001 have been reviewed in several articles (Smit, 1999; Jurka
et al., 2003; Mager and Medstrand, 2003; Kapitonov et al.,
As of March 2004, RU contained over 3600 sequences
representing different families of repetitive and transposable
elements. The number of entries continues to grow at the rate of
over five hundred new entries per year. Since 1997, the number
of sequences added to RU has doubled every three years
Repbase Update includes a growing number of reference
collections, or libraries, containing representative sequences
either from model organisms or a taxonomic group of euka-
ryotes. The list of current collections is presented in Fig. 2B. In
addition, there are more specialized collections that include
simple repeats (simple.ref), subfamilies of abundant or closely
related repeats (humsub.ref, invsub.ref, mamsub.ref) and col-
lection of pseudogenes (pseudo.ref). RU also contains appendix
files (e.g. humapp.ref, rodapp.ref, etc.) that document se-
quences removed from the database during previous updates.
These files are archived in a separate directory and are not used
in routine repeat analyses.
The RU collections are available in EMBL format and in
two variant FASTA formats. Only the variant in EMBL format
contains a detailed annotation of every entry, including key-
words, definitions, classifications, description of characteristic
structural features and basic credits to contributors. This infor-
mation continues to be updated but for historical reasons some
problems persist. Primarily, they include inconsistencies in
nomenclature, which will be addressed by dedicated work-
shops. For the time being, alternative nomenclature is listed in
the keywords and relevant information can be extracted by
browsing the keyword section (see Availability section below).
Fig. 1. Numbers of unique entries in Repbase
Update since 1997.
RU is not a relational database, but the EMBL-like indices
accompanying every repetitive element allow quick sorting,
extraction of specific information using web browsers, and fast
transformation into commonly used relational databases. The
FASTA-formatted libraries contain only the names of repeti-
tive elements and their DNA sequences. A second FASTA ver-
sion contains information extracted from RU and has been
manually rearranged to conform to the software-specific re-
quirements of RepeatMasker (http://www.repeatmasker.org).
The RepeatMasker version is less frequently updated and also
does not include many diverse elements from plants, insects
and fungi. As an alternative, any sequence of choice can be
screened against the up-to-date version of RU using the on-line
Censor server (http://www.girinst.org/Censor_Server.html).
Until 2001 RU published hundreds of transposable ele-
ments, which were unreported anywhere else and many of
which remain unacknowledged in subsequent publications de-
scribing the same data. This creates an obvious rift between the
public demand to rapidly release quality scientific information
in databases and inadequate credits that could help to deter-
mine their impact and the need for continuing public support.
To better document original contributions to RU, an elec-
tronic journal, Repbase Reports (RR), was established in 2001
(http://www.girinst.org/Repbase_Reports.html), which is dedi-
cated to publishing partial yet valuable data on TEs that may or
may not be included in a scientific article at some future time.
Prior to publication in RR, submitted entries are reviewed by
members of the Editorial Board. Unlike RU that can be modi-
fied as needed, this electronic publication is designed as a per-
manent archive of individual contributions and, like all other
periodic publications and scientific journals, it is registered
under its own ISSN number (1534-830X). To date, over 850
families have been published in RR and subsequently compiled
in RU. A sample entry to RR is shown in Fig. 2E.
Cytogenet Genome Res 110:462–467 (2005)
Cytogenet Genome Res 110:462–467 (2005)
Repbase Update and detection of repetitive DNA
There are two basic approaches for detecting repetitive ele-
ments in biological sequences: de novo and by comparison to
known repeats. De novo detection of unknown TEs is based on
their defining property, i.e. their presence in multiple copies.
Typically, such detection employs clustering of similar se-
quences obtained by similarity searches. De novo methods do
not require any prior knowledge of repeats and they can poten-
tially detect all repeats, provided that they are available in suffi-
cient number of copies. In addition to the number of copies,
these methods are also dependent on the length and complete-
ness of the analyzed sequences. However, the approach does
not provide any information about the type of repeats; it will
cluster all tandem repeats, interspersed repeats, and even seg-
mental duplications, and further analyses are required to infer
the nature of detected repeats. De novo methods are popular
and efficient in detection of tandem repeats. They can virtually
detect all possible arrays of tandem sequences and several inde-
pendent programs were developed for this purpose, e.g. Tan-
dem Repeat Finder (Benson, 1999); Satellites (Sagot and
Myers, 1998); Etandem, Equicktandem (Rice et al., 2000);
TROLL (Castelo et al., 2002). However, in the case of inter-
spersed repeats the usage of de novo methods is basically lim-
ited to detecting unknown repeats in newly sequenced genomes
(Recon et al., 2002; RepeatFinder, Volfovsky et al., 2001). A
special type of de novo approach is searching for characteristic
motifs present in a particular class of repeats. For example,
LTR_struck (McCarthy and McDonald, 2003) detects new
LTR retrotransposons by checking for characteristic long ter-
minal repeats flanking the elements.
The second approach to repeat detection is based on simi-
larity to previously known repeats. Similarity-based methods
typically compare a library of known repetitive sequences
against genomic DNA using general programs for sequence
similarity search such as Blast (www.ncbi.nlm.nih.gov/
BLAST), Wublast (http://blast.wustl.edu), or Crossmatch
(www.phrap.org). The repetitive library can be made of any
custom set of repeats, however; specialized programs often
depend on consensus sequences from RU. The three most com-
monly used programs for similarity detection, namely Censor
(see below), RepeatMasker (http://www.repeatmasker.org) and
MaskerAid (Bedell et al., 2000) use RU libraries for known
genomes by default. RepeatMasker and its accelerated exten-
Fig. 2. Schematic representation of data charts related to Repbase
Update. Arrows indicate information links. (A) Screenshot of the Genetic
Information Research Institute website with a list of major resources for
repeat analysis. (B) List of reference collections in Repbase Update. The col-
lection names, corresponding species, and the numbers of annotated families
are listed in columns 1–3, respectively. (C) The home page of Repbase
Reports. Links to its eight sections are underlined. (D) Content page of May,
2003 issue of Repbase Reports. (E) Partial screenshot of a selected repetitive
element, PegasusA, reported in the issue. Links to the nucleotide sequences
of PegasusA in IG, EMBL, and FASTA formats are underlined. Basic biolog-
ical information is summarized in the abstract section.
sion MaskerAid detect repeats by crossmatch or wublast nu-
cleotide searches. Programs for similarity-based detection re-
main the main tools for detecting interspersed repeats, but they
can also detect the main classes of simple repeats, particularly
short microsatellites. A detailed technical description will be
The on-line classification of Alu elements and the automatic
screening of repetitive DNA were first performed by the Pythia
server (Jurka and Milosavljevic, 1991; Jurka et al., 1992), fol-
lowed by a systematically maintained and upgraded Censor
server (www.girinst.org/Censor_Server.html). The Censor serv-
er permits screening repeats in DNA sequences from all euka-
ryotic species represented in the database by comparing them
to the most recent version of RU and mailing the output back
to the user. It also provides de novo detection of simple repeats
(Milosavljevic and Jurka, 1993), including simple repeats com-
posed of several sequence motifs. The input sequence provided
by the user must be in ASCII format and can be either pasted
into the on-line “Data Entry Form” or loaded from a local file.
Detailed information is presented in “Help/Information”
menu for Censor server (www.girinst.org/Censor_Server./
The original downloadable version of Censor (Jurka et al.,
1996) is already outdated and is succeeded by Censor4.1 (Ko-
hany and colleagues, personal communication), which is avail-
able for installation on local computers from our website
(www.girinst.org). It is powered by WU-blast 2.0, which needs
to be obtained separately from Washington University (http://
blast.wustl.edu). Due to its dependence on WU-blast, Censor
can be run only on Unix machines and Mac OS X. It takes
advantage of multiple processors. Censor can perform trans-
lated search (tBlastx) or compare DNA to proteins (tBlastn).
Translated searches are particularly suitable for detecting very
old repeats or for repeat annotation in new genomes. Simple
repeats are pre-screened using WU blast filter. This and other
options can be turned on or off, pending specific needs. For
technical details see the Censor help menu which shows up
automatically if the program is run without any options.
The Transposon Pub
Identifying repetitive elements in different genomes is an
ongoing process. It was relatively straightforward in the case of
human and mouse genomes since the corresponding collections
of TEs were catalogued in RU before the genomes were
sequenced. In other organisms such as A. thaliana, C. elegans,
and D. melanogaster, the corresponding collections were orga-
nized after the sequencing was completed. These genomes were
re-annotated using an updated version of Censor (Jurka et al.,
1996), and up-to-date reference collections from RU. The
annotation was posted on the web site under “Transposon Pub”
Transposon Pub provides a web interface to the genomic loca-
Cytogenet Genome Res 110:462–467 (2005)
Table 1. A partial map of transposable
elements in the Arabidopsis thaliana
chromosome I from the Transposon Pub (www.
Chromosome I TE Positions Strand Identity (%)
Columns from left to right: (1) chromosomal coordinates of the identified TEs; (2) the names of TEs as listed in
the Repbase Update; (3) beginning and end of similarity relative to the reference sequence; (4) repeat orientation:
direct (d) or complementary (c); (5) percent of sequence identity to the reference sequence. Columns 1 and 2 are
linked to the corresponding sequences, which can be retrieved.
ions of transposable elements in the form of repeat maps, which
illustrate chromosomal distributions of transposable elements
(Table 1). These maps are also available as density plots show-
ing the percentage of transposable elements calculated in a 25-
kb sliding window moved along individual chromosomes.
Repbase Update can be downloaded in compressed
(*.tar.gz) and uncompressed ASCII format. Furthermore, indi-
vidual sequences in EMBL format can be browsed by name
or keyword (www.girinst.org/Repbase_Update-Browser.html)
and copied to local directories. It is updated monthly and is
available to academic and non-profit researchers, free of
charge. The user name and password is issued individually
upon registration. The same password applies to all the remain-
ing resources including Censor, Repbase Reports and the
Repbase Update continues to impact a wide range of biolog-
ical studies. Apart from its role in a routine screening and spe-
cific applications reviewed above (see also Jurka, 2000), RU
serves as a major reference in systematics and evolutionary
studies of TEs (Feschotte et al., 2002; Robertson, 2002; Mager
and Medstrand, 2003). There is a growing list of genes derived
from TEs during eukaryotic evolution. In the human genome
alone there are over two dozen such genes derived from DNA
transposons, eight from extinct gypsy-like retroelements, and
two from retroviruses (Kapitonov et al., 2004). Based on bio-
chemical studies, the origin of the immune system in jawed ver-
tebrates has also been linked to DNA transposition (Roth and
Craig, 1998) but not to any specific transposon family. The sys-
tematic reconstruction of TEs deposited in RU may eventually
reveal the actual DNA transposon(s) involved in evolution of
the immune system and possibly other biologically important
In addition to biological studies of TEs and of the derived
genes, another major line of research focuses on their role in
genome stability and evolution. The genome itself is a highly
conservative structure preserving the existing information.
Throughout the history of eukaryotes, active TEs added new
DNA and caused genomic mutations driving a complex host-
TE co-evolution process. Details of the process continue to be
revealed and debated (Brosius, 1999; Hurst and Werren, 2001;
Kidwell and Lisch, 2001). Recent evidence indicates that TEs
may be involved in generating segmental duplications in the
human (Bailey et al., 2003; Jurka et al., 2004) and other euka-
ryotic genomes (Hughes et al., 2003). Moreover, such duplica-
tions appear to be more likely in gene-rich than gene-poor
regions (Jurka et al., 2004). These results link proliferation of
TEs to a fundamental mode of evolution by gene duplication
As genome sequencing continues to expand so does the
wealth of new information on the TE world. This information
needs a coherent database structure to be efficiently applied in
diverse biological studies. Given the enormity of the task, a
growing involvement of the research community in the devel-
opment of Repbase will be essential. In addition to the ongoing
contributions to RR, much effort will be needed in the area of
systematics of TEs, which will be a focus of future workshops
devoted to Repbase.
Cytogenet Genome Res 110:462–467 (2005) Download full-text
Bailey JA, Liu G, Eichler EE: An Alu transposition
model for the origin and expansion of human seg-
mental duplications. Am J Hum Genet 73:823–
Bao Z, Eddy SR: Automated de novo identification of
repeat sequence families in sequenced genomes.
Genome Res 12:1269-1276 (2002).
Bedell JA, Korf I, Gish W: MaskerAid: a performance
enhancement to RepeatMasker. Bioinformatics
Benson G: Tandem repeats finder: a program to ana-
lyze DNA sequences. Nucleic Acids Res 27:573–
Brosius J: RNAs from all categories generate retrose-
quences that may be exapted as novel genes or reg-
ulatory elements. Gene 238:115–134 (1999).
Castelo AT, Martins W, Gao GR: TROLL – tandem
repeat occurrence locator. Bioinformatics 18:634–
Clamp M, Andrews D, Barker D, Bevan P, Cameron G,
Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down
T, Durbin R, Eyras E, Gilbert J, Hammond M,
Hubbard T, Kasprzyk A, Keefe D, Lehvaslaiho H,
Iyer V, Melsopp C, Mongin E, Pettett R, Potter S,
Rust A, Schmidt E, Searle S, Slater G, Smith J,
Spooner W, Stabenau A, Stalker J, Stupka E, Ure-
ta-Vidal A, Vastrik I, Birney E.: Ensembl 2002:
accomodating comparative genomics. Nucleic
Acids Res 31:38–42 (2003).
Feschotte C, Zhanng X, Wessler SR: Miniature in-
verted-repeat transposable elements and their rela-
tionship to established DNA transposons, in Craig
NL, Craigie R, Gellert M, Lambowitz AM (eds):
Mobile DNA II, pp 1147–1158 (ASM Press, Wash-
ington, D.C. 2002).
Hughes AL, Friedman R, Ekollu V, Rose JR: Non-ran-
dom association of transposable elements with du-
plicated genomic blocks in Arabidopsis thaliana.
Mol Phylogenet Evol 29:410–416 (2003).
Hurst GD, Werren JH: The role of selfish genetic ele-
ments in eukaryotic evolution. Nat Rev Genet
Jurka J: Approaches to identification and analysis of
interspersed repetitive DNA sequences, in Adams
MD, Fields C, Venter JC (eds): Automated DNA
Sequencing and Analysis, pp 294–298 (Academic
Press, San Diego 1994).
Jurka J: Repeats in genomic DNA: mining and mean-
ing. Curr Opin Struct Biol 8:333–337 (1998).
Jurka J: Repbase Update: a database and an electronic
journal of repetitive elements. Trends Genet 16:
Jurka J, Milosavljevic A: Reconstruction and analysis
of human Alu genes. J Mol Evol 32:105–121
Jurka J, Walichiewicz J, Milosavljevic A: Prototypic
sequences for human repetitive DNA. J Mol Evol
Jurka J, Klonowski P, Dagman V, Pelton P: CENSOR
– a program for identification and elimination of
repetitive elements from DNA sequences. Comput
Chem 20:119–121 (1996).
Jurka J, Kapitonov VV, Smit AF: Repetitive elements:
detection, in Cooper DN (ed): Nature Encyclope-
dia of the Human Genome, pp 9–14 (Nature Pub-
lishing Group, London 2003).
Jurka J, Kohany O, Pavlicek A, Kapitonov VV, Jurka
MV: Duplication, coclustering, and selection of
human Alu retrotransposons. Proc Natl Acad Sci
USA 101:1268–1272 (2004).
Kapitonov VV, Jurka J: Rolling-circle transposons in
eukaryotes. Proc Natl Acad Sci USA 98:8714–
Kapitonov VV, Jurka J: The esterase and PHD do-
mains in CR1-like non-LTR retrotransposons. Mol
Biol Evol 20:38–46 (2003).
Kapitonov VV, Pavlicek A, Jurka J: Anthology of
human repetitive DNA, in Meyers RA (ed): Ency-
clopedia of Molecular Cell Biology and Molecular
Medicine, volume 1, pp 251–305 (Wiley, Hobok-
en, NJ 2004).
Karolchik D, Baertsch R, Diekhans M, Furey TS, Hin-
richs A, Lu YT, Roskin KM, Schwartz M, Sugnet
CW, Thomas DJ, Weber RJ, Haussler D, Kent WJ:
The UCSC Genome Browser Database. Nucleic
Acids Res 23:51–54 (2003).
Kidwell MG, Lisch DR: Perspective: transposable ele-
ments, parasitic DNA, and genome evolution. Evo-
lution Int J Org Evolution 55:1–24 (2001).
Lander ES, et al; International Human Genome Se-
quencing Consortium: Initial sequencing and anal-
ysis of the human genome. Nature 409:860–921
Mager D, Medstrand P: Endogenous retroviruses, in
Cooper DN (ed): Nature Encyclopedia of the Hu-
man Genome, pp 57–63 (Nature Publishing
Group, London 2003).
McCarthy EM, McDonald JF: LTR_STRUC: a novel
search and identification program for LTR retro-
transposons. Bioinformatics 19:362–367 (2003).
Milosavljevic A, Jurka J: Discovering simple DNA
sequences by the algorithmic significance method.
Computer Appl Biosci 9:407–411 (1993).
Ohno S: Evolution by gene duplication (Springer Ver-
lag, New York 1970).
Paces J, Pavlicek A, Zika R, Kapitonov VV, Jurka J,
Paces V: HERVd: the Human Endogenous Retro-
Viruses Database: update. Nucleic Acids Res
Rice P, Longden I, Bleasby A: EMBOSS: the European
Molecular Biology Open Software Suite. Trends
Genet 16:276–277 (2000).
Robertson HM: Evolution of DNA transposons in eu-
karyotes, in Craig NL, Craigie R, Gellert M, Lam-
bowitz AM (eds): Mobile DNA II, pp 1093–1110
(ASM Press, Washington, DC 2002).
Roth DB, Craig NL: VDJ Recombination: A transpo-
sase goes to work. Cell 94:411–414 (1998).
Sagot MF, Myers EW: Identifying satellites and peri-
odic repetitions in biological sequences. J Comput
Biol 5:539–553 (1998).
Smit AF: Interspersed repeats and other mementos of
transposable elements in mammalian genomes.
Curr Opin Genet Dev 9:657–663 (1999).
Volfovsky N, Haas BJ, Salzberg SL: A clustering meth-
od for repeat analysis in DNA sequences. Genome
Biol 2:Research0027 (2001).
Wicker T, Matthews DE, Keller B: TREP: a database of
Triticeae repetitive elements. Trends Plant Sci