Genome Biology 2007, 8:R61
2007Kuninet al.Volume 8, Issue 4, Article R61
Evolutionary conservation of sequence and secondary structures in
Victor Kunin¤, Rotem Sorek¤ and Philip Hugenholtz
Address: DOE Joint Genome Institute, 2800 Mitchell Drive, Walnut Creek, CA 94598, USA.
¤ These authors contributed equally to this work.
Correspondence: Victor Kunin. Email: email@example.com
© 2007 Kunin et al.; licensee BioMed Central Ltd.
This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which
permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Clustered regularly interspaced short palindromic repeat<p>The categorisation and structural analysis of Clustered Regularly Interspaced Short Palindromic Repeats (CRISPRs) sequences from 195 microbial genomes show that repeats from diverse organisms can be grouped based on sequence similarity, and that some groups have pronounced secondary structures with compensatory base changes.</p>
Background: Clustered regularly interspaced short palindromic repeats (CRISPRs) are a novel
class of direct repeats, separated by unique spacer sequences of similar length, that are present in
approximately 40% of bacterial and most archaeal genomes analyzed to date. More than 40 gene
families, called CRISPR-associated sequences (CASs), appear in conjunction with these repeats and
are thought to be involved in the propagation and functioning of CRISPRs. It has been recently
shown that CRISPR provides acquired resistance against viruses in prokaryotes.
Results: Here we analyze CRISPR repeats identified in 195 microbial genomes and show that they
can be organized into multiple clusters based on sequence similarity. Some of the clusters present
stable, highly conserved RNA secondary structures, while others lack detectable structures. Stable
secondary structures exhibit multiple compensatory base changes in the stem region, indicating
evolutionary and functional conservation.
Conclusion: We show that the repeat-based classification corresponds to, and expands upon, a
previously reported CAS gene-based classification, including specific relationships between CRISPR
and CAS subtypes.
Clustered regularly interspaced short palindromic repeats
(CRISPRs) are repetitive structures in Bacteria and Archaea
composed of exact repeat sequences 24 to 48 bases long
(herein called repeats) separated by unique spacers of similar
length (herein called spacers) [1,2]. The CRISPR sequences
appear to be among the most rapidly evolving elements in the
genome, to the point that closely related species and strains,
sometimes more than 99% identical at the DNA level, differ in
their CRISPR composition [3,4].
Up to 45 gene families, called CRISPR-associated sequences
(CASs), appear in conjunction with these repeats and are
hypothesized to be responsible for CRISPR propagation and
functioning [2,5,6]. It has been proposed that CASs can be
divided into seven or eight subtypes, according to their
operon organization and gene phylogeny [5,6]. Phylogenetic
analysis additionally indicates that CASs have undergone
extensive horizontal gene transfer, as very similar CAS genes
are found in distantly related organisms [6,7]. CRISPRs and
CASs have been found on mobile genetic elements, such as
Published: 18 April 2007
Genome Biology 2007, 8:R61 (doi:10.1186/gb-2007-8-4-r61)
Received: 9 October 2006
Revised: 24 January 2007
Accepted: 18 April 2007
The electronic version of this article is the complete one and can be
found online at http://genomebiology.com/2007/8/4/R61
R61.2 Genome Biology 2007, Volume 8, Issue 4, Article R61 Kunin et al.
Genome Biology 2007, 8:R61
plasmids, skin mobile elements, and even prophages, sug-
gesting a possible distribution mechanism for the system [7-
CRISPRs have been suggested to play roles in replicon parti-
tioning , DNA repair , regulation  and chromosomal
rearrangement . It was recently reported that the spacers
are often highly similar to fragments of extrachromosomal
DNA, such as phage or plasmid DNA [3,12]. It was suggested
that the CRISPR/CAS system participates in an antiviral
response, probably by an RNA interference-like mechanism.
The proposed mechanism for this CRISPR function involves
sampling and maintaining a record of invasive DNA ele-
ments, and inhibition of gene functions necessary for inva-
sion . Indeed, it was recently shown that CRISPRs provide
acquired resistance against viruses in prokaryotes .
Despite in-depth analyses of CASs, the nature of the repeat
sequences has not been examined closely. This is presumably
because repeats, as short DNA sequences, have less compara-
tive potential than protein-coding genes. Previous studies
have noted only that repeats are highly variable, and do not
appear to be similar between organisms [2,7]. However, we
show that repeats from diverse organisms can be grouped
into clusters based on sequence similarity, and that some
clusters have pronounced secondary structures with compen-
satory base changes. We further show that there is a clear cor-
respondence between CAS subtypes and repeat clusters. Our
findings have important implications for CRISPR function
To obtain a set of CRISPR arrays we employed the PILER-CR
program  on 439 currently available bacterial and
archaeal genomes in IMG version 1.50 . We found 561
arrays, ranging in size from 3 to 220 repeats, in 195 genomes
(44% of the genomes tested). These results are in agreement
with the results of Godde et al. , who found CRISPR arrays
in 40% of the genomes they tested. Overall, our set of
CRISPRs contained 561 repeat sequences (as repeats are gen-
erally identical within an array) and 13,372 spacers.
Repeats were first noticed to be palindromic by Mojica et al.
, a feature that was subsequently incorporated into the
acronym CRISPR . We hypothesized that the palindromic
signature might be indicative of a functional RNA secondary
structure within the repeat. This hypothesis is supported by
the experimental demonstration that CRISPRs are tran-
scribed and processed into non-messenger RNAs in several
Archaea , indicating that they are active through an RNA
To assess the possibility that CRISPR repeats form stable
RNA secondary structures, we used the RNAfold software
 (see Materials and methods) to predict the intramolecu-
lar RNA structure for each of the repeats in our set. This soft-
ware provides a bit-score that reflects the stability of each
secondary structure. We compared the stability of the pre-
dicted secondary structure of repeats and spacers to that of
similarly sized sequences selected randomly from bacterial
genomes (Figure 1a). We found that the folding-score distri-
bution of repeats deviates from the scores for random
sequences, indicating a tendency of repeats to form stable
The trimodal pattern of the RNA folding distribution for
CRISPR repeats (Figure 1a) suggests that they are not homo-
geneous, and that a large subset form stable secondary struc-
tures, in contrast to spacers and random sequences. To
identify repeat subtypes we first attempted to align each of
the 561 repeats in our set to all other repeats using the Smith-
Waterman algorithm . The sequence similarity results
were then clustered using the MCL algorithm  (see Mate-
rials and methods). This procedure generated 33 clusters, 12
of which contained 10 or more members, with the largest
Distributions of folding scores of (a) all CRISPR repeats and all spacers, as compared to random sequences and (b) individual repeat clusters
Distributions of folding scores of (a) all CRISPR repeats and all spacers, as
compared to random sequences and (b) individual repeat clusters. X-axis,
negative folding scores; Y-axis, fraction (percent) of total.
02468 10 121416182022
- Folding score
Fraction of a cluster (percentage)
- Folding score
Genome Biology 2007, Volume 8, Issue 4, Article R61 Kunin et al. R61.3
Genome Biology 2007, 8:R61
cluster (cluster 1) containing 94 repeat sequences. Some
clusters contained repeats from organisms as distantly
related as Archaea and Bacteria, supporting the inference that
CRISPR/CAS systems can be horizontally transferred
between microorganisms [5-7].
As an independent measure for the validity of the clustering,
we examined the RNA stability scores in each of the MCL-
defined clusters (note that RNA stability was not taken into
account in the clustering procedure). As seen in Figure 1b,
clusters 2 and 3 comprise repeats with consistently high fold-
ing scores, indicating pronounced secondary structure. By
contrast, clusters 1, 6, 7, 9, 10 and 11 contain repeats with con-
sistently poor folding scores. Clusters 4, 5, 8 and 12 show
intermediate folding scores, suggesting they have weaker sec-
ondary structures. Together, these groups explain the trimo-
dal distribution observed in Figure 1a. The homogeneity of
RNA structure stability scores within each cluster, along with
the dramatic difference in scores between clusters, suggests
that our clustering method is valid.
To further explore the observation that repeats form stable
RNA secondary structures, we examined sequence align-
ments of the repeat clusters. CRISPR repeats are generally
considered to be highly dissimilar to each other , except for
similar repeats in strains of the same species or in closely
related species . However, repeats within the clusters we
generated, although often containing sequences from vastly
different phylogenetic groups, were generally more similar to
each other and hence alignable. Figure 2a presents a multiple
alignment of a subset of the repeats in cluster 3. A highly sta-
ble stem-loop structure was consistently predicted for repeats
in this cluster by RNAfold  (Figure 1b). Notably, substitu-
tions in the predicted stem structure are consistently accom-
panied by compensatory changes that preserve the base
pairing (Figure 2a). This mutational pattern, together with
the presence of G:U base pairs (Figure 2a), is typical of con-
served RNA secondary structures and highlights the impor-
tance of the stem-loop in the repeats for the functionality of
A summary of the repeat similarity space is presented in Fig-
ure 3. As with cluster 3 (Figure 2), repeats in other clusters
with high and intermediate folding scores also form stem-
loop structures (Figure 3) and display compensatory muta-
tions, suggesting stable structures. While the stem-loop motif
is seen in all of these clusters, the actual sequence, as well as
the length of the stem, its position relative to the unstructured
region, and the size of the unstructured sequence varies
between clusters. For example, while the stem in cluster 4 is
typically 5 bp long and is found in the middle of the repeat, the
stem in cluster 3 is typically 7 bp long, and is found towards
the 5' end of the repeat (Figures 2 and 3). The difference in
Evidence for secondary structure in cluster 3
Evidence for secondary structure in cluster 3. (a) Multiple alignment of a subset (for clarity) of repeats in cluster 3. Numbers 1 to 7 and 7 to 1 indicate the
residues involved in stem base-pairing, some compensatory mutations in the stem are highlighted with circles. Note G:U base pairing at position 5 in
Xanthomonas oryzae and relaxed conservation of loop residues typical of RNA secondary structure in which the structure is functional rather than the
sequence. (b) Sequence logo for all repeats in cluster 3. (c) Predicted secondary structure of Syntrophus acidotrophicus repeat using RNAfold. Stem
positions are numbered in accordance with the alignment.
R61.4 Genome Biology 2007, Volume 8, Issue 4, Article R61 Kunin et al.
Genome Biology 2007, 8:R61
calculated folding scores between clusters with high and
intermediate scores is likely to be due to the stem length and
the frequency of GC as opposed to AT base pairings. Consist-
ent with previous reports , many repeat clusters have a
conserved 3' terminus of GAAA(C/G), possibly acting as a
binding site for one of the conserved CAS proteins.
Two recent studies identified between 20 and 45 gene fami-
lies of CASs [5,6]. Based on the tendency of CAS genes to
appear together, Haft et al.  defined eight CAS subtypes
(named Ecoli, Ypest, Nmeni, Dvulg, Tneap, Hmari, Apern
and Mtube). We sought to determine whether our CRISPR
repeat clusters corresponded to particular CAS subtypes. For
this, we searched 20 kb of sequence flanking each side of the
repeat array for CAS genes using the 45 CAS families
TIGRFAM hidden Markov models (HMMs) defined by Haft
et al. .
We found that the Ecoli CAS subtype genes appear exclusively
in the proximity of structured repeat cluster 2, and, similarly,
the Dvulg and Ypest CAS subtypes correspond strictly to our
structured clusters 3 and 4, respectively (Table 1 and Table S1
in Additional data file 1). Presumably, specific and different
sets of genes are needed in order to recognize, bind and proc-
ess the different repeat types. Despite the overall pronounced
correspondence between the CAS subtypes and repeat clus-
ters, particularly for structured clusters, there are notable
exceptions. For example, the reported frequent co-occurrence
of the Mtube subtype with other CAS subtypes  is consist-
ent with its promiscuous association with numerous repeat
clusters (Table 1). Another interesting exception is the co-
occurrence of the Tneap and Apern subtypes in the Thermo-
coccus kodakaraensis genome with cluster 6, which is appar-
ently due to a fusion of the Tneap and Apern subtypes (Figure
S1 and Table S1 in Additional data file 1). This genome con-
tains three CRISPR arrays, all with identical repeat sequences
classified as cluster 6 (Table S1 in Additional data file 1). In
some cases the CAS subtype for one or more repeat cluster
members differs from the consensus for that cluster (Table S1
in Additional data file 1), suggesting that the association
between CRISPR repeat subtypes and CAS subtypes is some-
We also identified a repeat cluster (cluster 5) that is not asso-
ciated with any of the recognized CAS subtypes. We found
that it is associated with most of the core CASs (cas1-4 and
cas6), but lacks any of the additional type-defining genes.
Cluster 5 occurs exclusively in genomes that contain other
CRISPR repeat subtypes and it is possible that it employs at
least part of their CAS machinery.
This study shows that CRISPR repeats are not structurally
homogeneous and can be divided into distinct types based on
sequence similarity and ability to form stable secondary
structures. This explains why previous attempts to align all
repeats resulted in a poorly defined consensus sequence .
We observed compensatory base changes in the stems of the
structured repeat clusters, including G:U base pairs,
indicating that the CRISPR system likely functions through
an RNA intermediate.
Some clusters, such as clusters 2, 3 and 4, are discrete in the
sequence similarity space, whereas the boundaries of others,
such as clusters 1, 6 and 7, were not clearly defined. The dis-
crete clusters were generally composed of structure-forming
repeats, and the less well-defined clusters were composed of
unstructured repeats. This may be a reflection of the greater
evolutionary constraints on the stem structure.
The inference of stem-loop formation within individual
CRISPR repeats is in contrast to the speculation that pairs of
repeats form duplexes, and are subsequently cleaved to
release spacers . Such hypothesized duplexing would
unlikely require the ubiquitous presence of the less conserved
interior nucleotides, which would form a loop in the single
repeat folding model (Figure 2) and an unpaired bulge in the
duplex repeat folding model. A CRISPR array in Sulfolobus is
transcribed and processed into 60 nucleotide long non-mes-
senger RNAs, a size consistent with a single repeat-spacer
unit [17,21], supporting the argument that transcribed spac-
ers remain associated with their repeats. The repeats may
serve to mediate contact between the spacer-targeted foreign
RNA or DNA and CAS-encoded proteins. A stem-loop struc-
ture of some repeats may have evolved to facilitate recogni-
tion  by RNA-binding CAS-encoded proteins, although
unstructured Sulfolobus repeats (in cluster 7; Figure 3) have
been shown to bind via a sequence-specific interaction to a
genus-specific protein . This may partly explain the
sequence conservation observed in unstructured repeats.
A previous report suggested that spacer regions contribute to
the formation of secondary structures in CRISPR arrays .
However, we could not detect a significant deviation of spacer
secondary structures from random sequences (Figure 1),
indicating that spacers are unlikely to be selected based on
their secondary structure. In fact, the spacers appear to have
slightly weaker structures than random sequences. This is
probably due to the AT richness of spacers (46% GC) relative
to average bacterial genomic sequences (53% GC), as AT base
pairs form less stable structures than GC pairs. The lower
spacer GC content is consistent with a proposed viral origin of
spacer sequences , as viruses are, on average, 7% lower in
GC content than bacteria.
Previous attempts to classify CRISPR/CAS systems were
based on CAS gene content and phylogeny (mostly of cas1)
[5,6]. We add a further dimension to this classification by
showing that the repeat sequence itself is also a classifying
feature. This can be advantageous in instances where CRISPR
arrays occur in the absence of CAS genes. For example, Ther-
Genome Biology 2007, Volume 8, Issue 4, Article R61 Kunin et al. R61.5
Genome Biology 2007, 8:R61
moplasma acidophilum contains a CRISPR array but lacks
CAS genes , so it cannot be classified based on CASs. Our
clustering indicates that the T. acidophilum repeat belongs to
(euryarchaeal) cluster 6 (Figure 3; Table S1 in Additional data
file 1). In some instances, the repeat classification was able to
provide higher resolution than the existing CAS classification.
For example, the Nmeni subtype was reported to have an
optional gene csn2 . Our clustering divides this subtype
into three clusters (10, 16 and 22). The csn2 gene is invariably
The sequence similarity space of CRISPR repeats visualized with the BioLayout (Java) program 
The sequence similarity space of CRISPR repeats visualized with the BioLayout (Java) program . Dots denote individual repeat sequences; connecting
lines represent Smith-Waterman similarities, such that closer dots represent more similar sequences. Dot colors denote cluster association as derived
from MCL clustering. The 12 largest clusters are indicated by circles together with their sequence logos, coarse phylogenetic composition, and sample
secondary structures where applicable.
Occurrence of CAS subtypes in the proximity (± 20 kb) of the 12 largest repeat clusters
CAS subtype123456789 101112
CAS subtypes are as defined in . Associations are indicated by an X. An instance of a putative fusion between two CAS subtypes is indicated by an
R61.6 Genome Biology 2007, Volume 8, Issue 4, Article R61 Kunin et al.
Genome Biology 2007, 8:R61
present in one cluster (cluster 10) and absent in the other two.
The finding of a repeat cluster (cluster 5) that cannot be read-
ily resolved by associated CAS genes (see Results) further
demonstrates the power of CRISPR-based classification.
The significant differences between CRISPR/CAS subtypes,
both in CRISPR repeat sequence and structure, and in CAS
gene content and phylogeny, raises the possibility that the
subtypes also differ functionally. Support for this hypothesis
could be the fact that frequently several CRISPR/CAS sub-
types are found in the same genome and at least four func-
tions have been hypothesized for these elements (host cell
defense , regulation , chromosomal segregation  and
rearrangement ). The study of CRISPRs is in its infancy,
and their mode and function is still highly speculative. Our
results provide another step toward a comprehensive under-
standing of these intriguing elements.
Materials and methods
Identification of CRISPR arrays
All genome sequences available through the IMG database
version 1.50  were analyzed for CRISPR arrays using the
PILER-CR program .
Delineation of repeat clusters
Pairwise similarities between repeats were calculated using
an in-house implementation of the Smith-Waterman algo-
rithm . The best scoring similarity from the two possible
repeat pair orientations, and only scores >7, were used for
further analysis. Clustering of pairwise similarities was per-
formed using the MCL program with default parameters .
Multiple alignments were performed using MUSCLE ,
and the alignments were manually curated, including
removal of outliers. Sequence logos for each cluster were gen-
erated using WebLogo . The similarity space of repeats
was visualized using BioLayout (Java) . The sequences of
the repeats, the assignments to clusters and the multiple
alignments are provided as Additional data file 1.
Determining orientation of repeats
The PILER-CR program provides an arbitrary orientation for
the repeats. To determine the correct orientation, we com-
pared each repeat to the ones found experimentally to be
transcribed into RNA [17,21], assuming that the transcribed
direction is the 'correct' direction. The direction most similar
to the transcribed repeats (using Waterman similarity scores
) was selected as the correct one. We also used the
GAAA(C/G) signature at the end of some repeats in cases
where the Waterman similarity scores were ambiguous. It is
possible, therefore, that some repeats may be presented in the
Determination of repeat secondary structures
Structural predictions were performed using the RNA Vienna
Package  downloaded from the Vienna Package server
[28,29]. Folding scores for all repeats or individual repeat
clusters were divided into bins of 2 score units and plotted as
percentages. Random sequence strings with the same length
distribution as repeats were generated from the analyzed
genomes. The average GC contents were calculated for
archaeal, bacterial and viral genomes in the IMG database,
version 1.50, and the average GC content was calculated for
all spacers in all genomes.
CAS gene identification
The HMMs for CAS genes described in  were obtained
from the TIGRFAM database, version 6.0 . To identify
CAS genes, all coding sequences within 20 kb of the identified
CRISPR arrays were searched with the CAS HMMs using
hmmpfam  with the thresholds of an e-value <0.001 and
a positive score.
Additional data files
The following additional data are available with the online
version of this paper. Additional data file 1 contains several
files showing alignments of clusters 1-12, the arrangement of
the CAS cassette in the Thermococcus kodakaraensis
genome, and CAS genes in the neighborhood of CRISPR
arrays as predicted by TIGRFAM, as well as an index of
organisms used in the study, a sequence fasta file containing
all repeats, and a description of automatic assignment of
repeats to clusters with MCL. Some files may be mac-format-
Additional data file 1 Alignments of clusters 1-12, the arrangement of the CAS cassette in the Thermococcus kodakaraensis genome, CAS genes in the neigh- borhood of CRISPR arrays as predicted by TIGRFAM, an index of organisms used in the study, a sequence fasta file containing all repeats, and a description of automatic assignment of repeats to clusters with MCLReadme.txt contains a description of the files in the archive. Align- ments is a directory containing manually curated fasta alignments of clusters 1-12. FigureS1.png contains a figure showing the arrangement of the CAS cassette in the Thermococcus kodakara- ensis genome. Chromosomal coordinates are given at the top of the figure. A CRISPR array is shown to the left of the figure as red ver-tical lines (1 line = 5 repeats). Core CAS genes are shown in black, Apern subtype genes are shown in blue and Tneap subtype genes in red as predicted by TIGRFAM analysis (see Materials and meth-ods). TableS1.xls is an excel-formated table containing CAS genes in the neighborhood of CRISPR arrays, as predicted by TIGRFAM (see Materials and methods). Core and type-specific genes are indi- cated, each genome is given both with its full name and an IMG accession code. IMG gene OIDs are given for each protein. Organ- isms.index is a table containing an index of organisms used in the study. Repeats.fasta is a sequence fasta file containing all repeats. Repeats.mcl describes automatic assignment of repeats to clusters with MCL. Each line contains a cluster number followed by space- separated member repeats. Some files may be mac-formatted. Click here for file
We thank two anonymous reviewers for their detailed and informative
feedback on this manuscript. This work was performed under the auspices
of the US Department of Energy's Office of Science, Biological and Environ-
mental Research Program, and by the University of California, Lawrence
Livermore National Laboratory under Contract No. W-7405-Eng-48, Law-
rence Berkeley National Laboratory under contract No. DE-AC02-
05CH11231 and Los Alamos National Laboratory under contract No. DE-
1.Mojica FJ, Ferrer C, Juez G, Rodriguez-Valera F: Long stretches of
short tandem repeats are present in the largest replicons of
the Archaea Haloferax mediterranei and Haloferax volcanii
and could be involved in replicon partitioning. Mol Microbiol
2. Jansen R, Embden JD, Gaastra W, Schouls LM: Identification of
genes that are associated with DNA repeats in prokaryotes.
Mol Microbiol 2002, 43:1565-1575.
3.Pourcel C, Salvignol G, Vergnaud G: CRISPR elements in Yersinia
pestis acquire new repeats by preferential uptake of bacteri-
ophage DNA, and provide additional tools for evolutionary
studies. Microbiology 2005, 151:653-663.
4. Bolotin A, Quinquis B, Renault P, Sorokin A, Ehrlich SD, Kulakauskas
S, Lapidus A, Goltsman E, Mazur M, Pusch GD, et al.: Complete
sequence and comparative genome analysis of the dairy bac-
terium Streptococcus thermophilus. Nat Biotechnol 2004,
5.Haft DH, Selengut J, Mongodin EF, Nelson KE: A guild of 45
CRISPR-associated (Cas) protein families and multiple
CRISPR/Cas subtypes exist in prokaryotic genomes. PLoS
Comput Biol 2005, 1:e60.
http://genomebiology.com/2007/8/4/R61 Download full-text
Genome Biology 2007, Volume 8, Issue 4, Article R61 Kunin et al. R61.7
Genome Biology 2007, 8:R61
6.Makarova KS, Grishin NV, Shabalina SA, Wolf YI, Koonin EV: A puta-
tive RNA-interference-based immune system in prokaryo-
tes: computational analysis of the predicted enzymatic
machinery, functional analogies with eukaryotic RNAi, and
hypothetical mechanisms of action. Biol Direct 2006, 1:7.
Godde JS, Bickerton A: The repetitive DNA elements called
CRISPRs and their associated genes: evidence of horizontal
transfer among prokaryotes. J Mol Evol 2006, 62:718-729.
Sebaihia M, Wren BW, Mullany P, Fairweather NF, Minton N, Stabler
R, Thomson NR, Roberts AP, Cerdeno-Tarraga AM, Wang H, et al.:
The multidrug-resistant human pathogen Clostridium difficile
has a highly mobile, mosaic genome. Nat Genet 2006,
Greve B, Jensen S, Brugger K, Zillig W, Garrett RA: Genomic com-
parison of archaeal conjugative plasmids from Sulfolobus.
Archaea 2004, 1:231-239.
Makarova KS, Aravind L, Grishin NV, Rogozin IB, Koonin EV: A DNA
repair system specific for thermophilic Archaea and bacteria
predicted by genomic context analysis. Nucleic Acids Res 2002,
DeBoy RT, Mongodin EF, Emerson JB, Nelson KE: Chromosome
evolution in the Thermotogales: large-scale inversions and
strain diversification of CRISPR sequences. J Bacteriol 2006,
Mojica FJ, Diez-Villasenor C, Garcia-Martinez J, Soria E: Intervening
sequences of regularly spaced prokaryotic repeats derive
from foreign genetic elements. J Mol Evol 2005, 60:174-182.
Barrangou R, Fremaux C, Deveau H, Richards M, Boyaval P, Moineau
S, Romero DA, Horvath P: CRISPR provides acquired resistance
against viruses in prokaryotes. Science 2007, 315:1709-1712.
Edgar RC: PILER-CR: Fast and accurate identification of
CRISPR repeats. BMC Bioinformatics 2007, 8:18.
Markowitz VM, Ivanova N, Palaniappan K, Szeto E, Korzeniewski F,
Lykidis A, Anderson I, Mavrommatis K, Kunin V, Garcia Martin H, et
al.: An experimental metagenome data management and
analysis system. Bioinformatics 2006, 22:e359-367.
Mojica FJ, Diez-Villasenor C, Soria E, Juez G: Biological significance
of a family of regularly spaced repeats in the genomes of
Archaea, Bacteria and mitochondria. Mol Microbiol 2000,
Tang TH, Bachellerie JP, Rozhdestvensky T, Bortolin ML, Huber H,
Drungowski M, Elge T, Brosius J, Huttenhofer A: Identification of
86 candidates for small non-messenger RNAs from the
archaeon Archaeoglobus fulgidus. Proc Natl Acad Sci USA 2002,
Hofacker IL, Fontana W, Stadler PF, Bonhoeffer S, Tacker M, Schuster
P: Fast folding and comparison of RNA secondary structures.
Monatshefte f Chemie 1994, 125:167-188.
Smith TF, Waterman MS: Identification of common molecular
subsequences. J Mol Biol 1981, 147:195-197.
Van Dongen S: Graph clustering by flow simulation. In PhD thesis
University of Utrecht; 2000.
Tang TH, Polacek N, Zywicki M, Huber H, Brugger K, Garrett R,
Bachellerie JP, Huttenhofer A: Identification of novel non-coding
RNAs as potential antisense regulators in the archaeon Sul-
folobus solfataricus. Mol Microbiol 2005, 55:469-481.
Cusack S: RNA-protein complexes. Curr Opin Struct Biol 1999,
Peng X, Brugger K, Shen B, Chen L, She Q, Garrett RA: Genus-spe-
cific protein binding to the large clusters of DNA repeats
(short regularly spaced repeats) present in Sulfolobus
genomes. J Bacteriol 2003, 185:2410-2417.
Edgar RC: MUSCLE: multiple sequence alignment with high
accuracy and high throughput. Nucleic Acids Res 2004,
Crooks GE, Hon G, Chandonia JM, Brenner SE: WebLogo: a
sequence logo generator. Genome Res 2004, 14:1188-1190.
Goldovsky L, Cases I, Enright AJ, Ouzounis CA: BioLayout(Java):
versatile network visualisation of structural and functional
relationships. Appl Bioinformatics 2005, 4:71-74.
Mathews DH, Sabina J, Zuker M, Turner DH: Expanded sequence
dependence of thermodynamic parameters improves pre-
diction of RNA secondary structure. J Mol Biol 1999,
RNA Vienna Package [http://rna.tbi.univie.ac.at/cgi-bin/RNA
Hofacker IL: Vienna RNA secondary structure server. Nucleic
Acids Res 2003, 31:3429-3431.
TIGRFAMs Home Page [http://www.tigr.org/TIGRFAMs/]