ArticlePDF Available

DECIPHER, a search-based approach to chimera identification for 16S rRNA sequences

Authors:

Abstract and Figures

DECIPHER is a new method for finding 16S rRNA chimeric sequences by the use of a search-based approach. The method is based upon detecting short fragments that are uncommon in the phylogenetic group where a query sequence is classified but frequently found in another phylogenetic group. The algorithm was calibrated for full sequences (fs_DECIPHER) and short sequences (ss_DECIPHER) and benchmarked against WigeoN (Pintail), ChimeraSlayer, and Uchime using artificially generated chimeras. Overall, ss_DECIPHER and Uchime provided the highest chimera detection for sequences 100 to 600 nucleotides long (79% and 81%, respectively), but Uchime's performance deteriorated for longer sequences, while ss_DECIPHER maintained a high detection rate (89%). Both methods had low false-positive rates (1.3% and 1.6%). The more conservative fs_DECIPHER, benchmarked only for sequences longer than 600 nucleotides, had an overall detection rate lower than that of ss_DECIPHER (75%) but higher than those of the other programs. In addition, fs_DECIPHER had the lowest false-positive rate among all the benchmarked programs (<0.20%). DECIPHER was outperformed only by ChimeraSlayer and Uchime when chimeras were formed from closely related parents (less than 10% divergence). Given the differences in the programs, it was possible to detect over 89% of all chimeras with just the combination of ss_DECIPHER and Uchime. Using fs_DECIPHER, we detected between 1% and 2% additional chimeras in the RDP, SILVA, and Greengenes databases from which chimeras had already been removed with Pintail or Bellerophon. DECIPHER was implemented in the R programming language and is directly accessible through a webpage or by downloading the program as an R package (http://DECIPHER.cee.wisc.edu).
Content may be subject to copyright.
DECIPHER, a Search-Based Approach to Chimera Identification for
16S rRNA Sequences
Erik S. Wright, L. Safak Yilmaz, and Daniel R. Noguera
Department of Civil and Environmental Engineering, University of Wisconsin—Madison, Madison, Wisconsin, USA
DECIPHER is a new method for finding 16S rRNA chimeric sequences by the use of a search-based approach. The method is
based upon detecting short fragments that are uncommon in the phylogenetic group where a query sequence is classified but
frequently found in another phylogenetic group. The algorithm was calibrated for full sequences (fs_DECIPHER) and short se-
quences (ss_DECIPHER) and benchmarked against WigeoN (Pintail), ChimeraSlayer, and Uchime using artificially generated
chimeras. Overall, ss_DECIPHER and Uchime provided the highest chimera detection for sequences 100 to 600 nucleotides long
(79% and 81%, respectively), but Uchime’s performance deteriorated for longer sequences, while ss_DECIPHER maintained a
high detection rate (89%). Both methods had low false-positive rates (1.3% and 1.6%). The more conservative fs_DECIPHER,
benchmarked only for sequences longer than 600 nucleotides, had an overall detection rate lower than that of ss_DECIPHER
(75%) but higher than those of the other programs. In addition, fs_DECIPHER had the lowest false-positive rate among all the
benchmarked programs (<0.20%). DECIPHER was outperformed only by ChimeraSlayer and Uchime when chimeras were
formed from closely related parents (less than 10% divergence). Given the differences in the programs, it was possible to detect
over 89% of all chimeras with just the combination of ss_DECIPHER and Uchime. Using fs_DECIPHER, we detected between
1% and 2% additional chimeras in the RDP, SILVA, and Greengenes databases from which chimeras had already been removed
with Pintail or Bellerophon. DECIPHER was implemented in the R programming language and is directly accessible through a
webpage or by downloading the program as an R package (http://DECIPHER.cee.wisc.edu).
The small subunit (SSU) rRNA molecule has been used exten-
sively as a phylogenetic marker since the late 1980s (18), and
nowadays, 16S rRNA sequences are essential for microbial identi-
fication (2, 19). As the number of publicly available SSU rRNA
sequences has increased, several repositories that curate and align
the sequences have emerged. These databases also provide useful
tools for data analysis and interpretation. For instance, the Ribo-
somal Database Project (RDP) (4) is a major repository of bacte-
rial and archaeal 16S rRNA sequences and offers tools for brows-
ing, classification, probe checking, and sequence matching,
among others. The SILVA rRNA database (12) contains small sub-
unit (SSU) and large subunit (LSU) rRNA sequences of bacteria,
archaea, and eukarya, while Greengenes (6) offers a variety of
sequence analysis tools.
One of the challenges in maintaining the ever-expanding
rRNA databases is in the implementation of strategies to ensure
that they are populated only with good-quality sequences. The
most common type of sequence anomaly is the chimera, which is
composed of two or more distinct sequences concatenated into a
single one. The presence of chimeras in a database artificially in-
creases measurements of diversity (10, 13, 16), and when chimeras
are comprised of sequences from different lineages, they could be
misinterpreted as representing novel lines of descent. In addition,
the presence of chimeras in database repositories can lead to erro-
neous interpretations of specificity when using the database for
the purpose of primer or probe design.
In 2005, Ashelford et al. (3) created Pintail, a program for de-
tection of sequence abnormalities (mainly chimeras) and used it
to estimate that about 5% of 16S rRNA sequences held in public
repositories had substantial anomalies. Subsequently, RDP and
SILVA implemented quality control filters based on Pintail to pre-
vent populating their databases with unchecked anomalous se-
quences. Greengenes uses a chimera check based on Bellerophon,
another algorithm created for this purpose (9). More recently,
ChimeraSlayer (8) was introduced as a program with improved
chimera detection, especially for chimeras created from two
closely related parent sequences, and Uchime (7) has been de-
scribed as having a higher detection rate than ChimeraSlayer.
Eliminating chimeric sequences is now understood by the re-
search community as an important step to take before submitting
sequences to public databases or before assembling sequences
from short fragments produced by next-generation sequencing
approaches (8, 10, 13). It is now common practice to use one or
more programs to remove chimeras before sequence submission.
In addition, sequence repositories such as RDP (5), SILVA (12),
and Greengenes (6) provide a second line of defense by flagging
possible chimeras after submission.
Nevertheless, when routinely using RDP’s “good quality” da-
tabase for probe design, we still encounter chimeric sequences that
affect data interpretation. Therefore, we developed a novel ap-
proach for chimera detection in the 16S databases. Chimeric re-
gions within a query sequence are identified by detecting short
sequence fragments that are uncommon within a reference phy-
logenetic group where the sequence is classified but much more
common in another phylogenetic group.
Using this approach, we have confirmed that a number of chi-
meric sequences have evaded existing sequence anomaly detection
Received 11 August 2011 Accepted 13 November 2011
Published ahead of print 18 November 2011
Address correspondence to Daniel R. Noguera, noguera@engr.wisc.edu.
Supplemental material for this article may be found at http://aem.asm.org/.
Copyright © 2012, American Society for Microbiology. All Rights Reserved.
doi:10.1128/AEM.06516-11
0099-2240/12/$12.00 Applied and Environmental Microbiology p. 717–725 aem.asm.org 717
methods and are presently populating the 16S databases. Here we
describe the method and the results of chimera detection in the
main 16S rRNA repositories and benchmark the new method
against Pintail, ChimeraSlayer, and Uchime. In addition, to help
in the detection of chimeras before sequence submission, we in-
troduce DECIPHER (http://DECIPHER.cee.wisc.edu), a publicly
available web-based tool specific for detection of chimeric 16S
rRNA sequences by the use of the novel search-based approach.
For standalone implementations, the DECIPHER R package,
source code, and associated documentation are available for
download under the terms of the GNU General Public License.
MATERIALS AND METHODS
16S reference phylogenetic groups. The data set of “good quality” un-
aligned sequences available from RDP was used for creating a higher-
quality reference data set free of detectable chimeras. The downloaded set
of sequences (RDP release 10, update 22) contained 1,251,070 bacterial
and 62,055 archaeal sequences. A total of 280 reference phylogenetic
groups (see Table S1 in the supplemental material) were created from this
sequence set by combining sequences from similar hierarchical levels. The
goal of this step was to establish reference groups with a sufficiently high
number of related sequences (i.e., greater than 500) so that the search for
common and uncommon sequence fragments was meaningful as a chi-
mera detection approach.
Reference groups were limited to a maximum of 10,000 sequences to
facilitate computational optimization of the search algorithm, except for
genera or unclassified groups that already had more than 10,000 se-
quences (e.g., Staphylococcus genus or the “unclassified_Bacteria” group;
see Table S1 in the supplemental material). Some of the reference groups
had to be defined as having less than 500 sequences, because their combi-
nation with other groups at the same hierarchical level was not logical. For
instance, the phyla Korarchaeota and Nanoarchaeota within the domain
Archaea are both represented in the database by single genera with less
than 500 sequences each. A potential combination of these phyla with
other groups at the same hierarchical level would require inclusion of the
Crenarchaeota and Euryarchaeota phyla. Such a diverse group would pre-
vent detection of chimeras within the entire Archaea domain by the use of
the search-based algorithm. Thus, Korarchaeota and Nanoarchaeota ap-
pear as individual groups in the reference group set. The resulting 280
reference phylogenetic groups are presented in Table S1 in the supple-
mental material, along with a list of the genera or unclassified groups
contained in each reference group.
Chimera detection method. The evaluation of whether a query se-
quence is a chimera takes the following steps.
First, the query sequence is classified with 51% confidence using the
RDP Classifier software (17) and then assigned to one of the 280 reference
phylogenetic groups (see Table S1 in the supplemental material) based
upon this classification.
Then, a set of 30-nucleotide-long overlapping fragments is formed
from the sequence, beginning every fifth nucleotide and continuing for
the length of the sequence (Fig. 1). For instance, a sequence of 1,400
nucleotides would result in 275 fragments 30 nucleotides long that begin
at nucleotide positions 1, 6, 11, etc. Fragments containing wild-card char-
acters (i.e., N) are excluded.
The presence or absence of these fragments within other sequences in
the classified reference group (i.e., in-group search) and outside the clas-
sified group (i.e., out-of-group search) forms the basis for detection of
chimeras in this algorithm. That is, if the query sequence is a chimera, then
some fragments are likely to have very few matches within their own
reference phylogenetic group but a large number of matches to another
reference group.
Thus, a search for each fragment is conducted in the set of all se-
quences within the classified reference group. The number of sequences
containing the 30-mer fragment, allowing a maximum of one mismatch,
is recorded as the number of hits in-group. Fragments with 5 or more hits
in-group are excluded from further analysis, as such large numbers of hits
in-group are not indicative of a potential chimeric fragment. A second
search through the remaining fragments is done while allowing two mis-
matches, which is a more lenient evaluation. Those fragments with more
than 9 lenient hits in-group are also excluded, leaving a set of 30-mer
fragments that are rare within their own phylogenetic reference group.
This is the initial set of fragments suspected to correspond to a chimeric
region (Fig. 1).
To determine whether these uncommon fragments within the classi-
fied group are more common in another phylogenetically different group,
a search for the presence of these fragments (with a maximum of one
mismatch allowed) is then conducted in each of the other phylogenetic
groups in the reference set. If the number of hits found in another group
exceeds 20 times the number of hits detected in-group for a fragment,
then it is considered suspect. If a fragment has no hits in-group (i.e., a
similar fragment was not found in the reference group) and it has more
than 20 hits out-of-group, then it is also considered suspect.
When the RDP classifier tool places the query sequence in an unclas-
sified group, other groups sharing the same line of descent are skipped in
the out-of-group search, because unclassified groups may not be phylo-
genetically coherent. For instance, if a query sequence is classified as un-
classified_Actinobacteria, then the search for out-of-group hits excludes
all reference groups within Actinobacteria but does not exclude classified
or unclassified reference groups in other lineages. Likewise, if the refer-
ence group where a query sequence is classified has a reference group of
unclassified organisms at the same hierarchical level, then this group is
also skipped in the out-of-group search. An important consequence of the
former restriction is that sequences assigned to the unclassified_Bacteria
or unclassified_Archaea group cannot be evaluated by DECIPHER. Using
RDP’s classifier tool with a 51% confidence level ensures that this is not a
common problem. In the data sets analyzed in this study, less than 5% of
sequences fell into this category.
The set of suspect fragments resulting from the steps described above
must meet several more criteria in order for a query sequence to be iden-
tified as a chimera (Fig. 1). Two different sets of criteria were defined,
depending on whether DECIPHER is used to evaluate assembled,
nearly complete 16S sequences or short sequences. For full sequences
(fs_DECIPHER), there must be six or more suspect fragments belonging
to a sequence. In addition, the identified chimeric region, defined as the
FIG 1 Steps used by DECIPHER to determine whether a sequence is a chi-
mera. The differently shaded lines represent the different pieces of a chimeric
sequence, with the darker color representing the chimeric region. Sequence
fragments exhibiting low in-group frequency and high out-of-group fre-
quency are marked as suspect fragments. These suspect fragments must meet
the additional rules shown in the figure in order for the sequence to be deemed
a chimera.
Wright et al.
718 aem.asm.org Applied and Environmental Microbiology
distance between the start of the first fragment and end of the last frag-
ment, must be at least 70 nucleotides long, and the chimeric region must
have at least one nucleotide overlapping the first or last 200 nucleotides of
the sequence. Finally, the combined ranges of all suspect fragments be-
longing to another reference group must cover more than 60% of the total
chimeric region.
For short sequences (ss_DECIPHER), the criteria were relaxed such
that a chimeric region of at least 40 nucleotides that included at least two
suspect fragments was required. The rules of overlapping the first or last
200 nucleotides and the 60% coverage are also kept in ss_DECIPHER, but
these rules become less important as the sequence length decreases. Al-
though both DECIPHER options can handle sequences of any length,
fs_DECIPHER is a conservative option with a very low rate of false posi-
tives and moderately lower chimera detection capabilities, while ss_
DECIPHER detects more chimeras, albeit with a higher level of false-
positive detections. Based on the rates of false-positive and false-negative
detections (see Fig. S1 in the supplemental material), ss_DECIPHER is
recommended for sequences of any length, while fs_DECIPHER is rec-
ommended only for sequences longer than 600 nucleotides.
The specific values of the different parameters used by the two
versions of DECIPHER were determined by comparing results ob-
tained with hundreds of manually checked sequences originating from
the RDP database, as well as sets of artificial chimeras. To manually
check each query sequence, we use the Probe Match tool of RDP to find
sequences that matched 40- to 60-nucleotide-long fragments taken
from different locations within the query sequence. If the sequences
that matched the different fragments belonged to different genera,
then we aligned the query sequence with the matched sequences by the
use of BioEdit (http://www.mbio.ncsu.edu/BioEdit/bioedit.html) and
visually inspected the alignments to identify potential breakpoints
within the query sequence. If breakpoints were evident by visual in-
spection, then the sequence was determined to be a chimera. The cal-
ibration parameters were primarily aimed at providing a very low rate
of false-positive detections with a high efficiency of chimera detection.
For instance, the length of the overlapping fragments was set to 30
nucleotides to maximize detection of very short chimeric regions
within a sequence, and the selections of the thresholds for in-group
and out-of-group hits were calibrated to minimize false-positive iden-
tifications. Other parameters, such as the length of the identifying
fragment and its distance to the end of the sequence as they affect the
rates of false-positive and false-negative identifications, are described
in more detail in supplementary documentation using a set of artifi-
cially generated chimeras (see Fig. S2 and S3 in the supplemental ma-
terial).
Implementation. The method described above was implemented in
the R programming language (14). The slow search speed was one of the
main challenges in applying this method, since each suspect fragment is
compared to a reference database that contains more than one million
sequences. To speed up the search, we made use of the Aho-Corasick
dictionary-matching algorithm (1) implemented as part of Biostrings
(11). In this algorithm, a trusted band of known nucleotides is defined for
each fragment to be searched. We defined the trusted band as the first five
nucleotides of each fragment when performing the in-group search that
allowed 1 mismatch, and then the trusted band was shifted by five nucle-
otides for the search allowing 2 mismatches. For the out-of-group search,
the trusted band was returned to the first 5 nucleotides of each fragment.
Additionally, the trusted band cannot contain ambiguity characters (e.g.,
S for C or G), so fragments with a trusted band containing these characters
were excluded from the search. Ambiguity characters outside the trusted
band were determined to be perfect matches or mismatches according to
conventional IUPAC notation.
RESULTS AND DISCUSSION
Preparation of a chimera-free reference sequence set. We used
the more conservative version of DECIPHER (fs_DECIPHER) to
determine how many chimeras could be found in the RDP data-
base of “good quality” sequences (i.e., the RDP database already
screened with Pintail) while minimizing the erroneous flagging of
nonchimeric sequences. The goal was to produce a higher-quality
data set that could be used as a reference for the detection of
chimeras. As a starting point, the downloaded sequences were
used as the reference database and every sequence in the database
was tested. A total of 12,470 sequences were determined to be
chimeras. These chimeras were then removed from the reference
database, and the search process was repeated. Since the search
results changed between runs, it was possible to continue finding
additional chimeras by successively updating the reference data-
base and searching again for potential chimeras not detected in
earlier runs. The number of chimeras found in each subsequent
run was much smaller than the number in the previous run, until
the process approached zero newly detected chimeras. Overall, 12
runs were made, yielding a total of 18,484 chimeras, correspond-
ing to 1.41% of the “good quality” RDP database. With a false-
positive detection rate of 0.15% (see below), we estimate that
about 2,000 of these chimeras could potentially represent false-
positive detections. The resulting database, free of chimeras de-
tectable with fs_DECIPHER, became the reference database for
the rest of the study.
A list of all the sequences identified as chimeras can be found in
the supplementary documentation (see Table S2 in the supple-
mental material). Of all the newly identified chimeras from RDP,
54.9% were formed between two sequences in the same phylum,
40.4% corresponded to chimeras formed between two sequences
in different phyla, and 0.18% were formed between sequences
belonging to the domains Bacteria and Archaea. The low percent-
age of chimeras formed between sequences in different domains
reflects the relatively low number of studies that have used univer-
sal primer sets simultaneously targeting both domains. The higher
number of chimeras formed between sequences in the same phy-
lum compared to those formed from sequences of different phyla
is more difficult to explain because of the multitude of factors that
may influence these observations. For instance, sequencing exper-
iments targeting a specific group of organisms produce chimeras
only from sequences within the same phylum, thus contributing
to the greater percentage of in-phylum chimeras. It is also possible
that chimeras are more likely to be formed between closely related
parents (8), which would also contribute to a higher number of
in-phylum chimeras. However, it is also important that the anal-
ysis was done with the RDP database that has already been
screened by Pintail, so Pintail’s chimera detection was also influ-
encing the percentages observed. Regardless, the high percentage
of chimeras identified as being formed from two different phyla
(i.e., distantly related parents) is an important indicator that these
types of chimeras are also likely to be formed and should be tar-
geted by chimera detection programs.
Comparing databases. In addition to evaluating the good-
quality data set downloaded from RDP, fs_DECIPHER was also
used to investigate the presence of chimeras in the SILVA and
Greengenes databases. SILVA had 1,096,710 bacterial and ar-
chaeal sequences with a Pintail score of 100 (perfect score) in their
database in October 2010. In these sequences, we detected 12,231
chimeras (1.1%). In the set of Greengenes sequences that Bellero-
phon had declared to be “not chimeric” (377,150 sequences), we
found 7,136 chimeras (1.9%). Thus, all three 16S repositories con-
Search-Based Chimera Identification
February 2012 Volume 78 Number 3 aem.asm.org 719
tain relatively similar percentages of chimeras that were unde-
tected by Pintail (RDP and SILVA) or Bellerophon (Greengenes).
Evaluation of DECIPHER with artificially generated chime-
ras. We compared ss_DECIPHER and fs_DECIPHER to Pintail,
ChimeraSlayer, and Uchime. Bellerophon was excluded from
analysis due to a high rate of false positives detected in preliminary
tests (13%), which agrees with the observations of Haas et al. (8).
To evaluate Pintail, we used the WigeoN reimplementation (8),
and for Uchime, we used its implementation in mothur (15). For
ChimeraSlayer and Uchime runs, we used the gold data set de-
scribed by Haas et al. (8) as the reference set of sequences.
Simple two-parent and more complex three- and four-parent
chimeras were formed by combining segments from different par-
ent sequences joined at random breakpoints. In the generation of
all the artificial chimeras, each parent sequence was required to be
represented by a minimum of 30 nucleotides. In addition, the
length of the artificial sequences was randomly adjusted from 80
nucleotides to full length, with some data sets restricted to se-
quences shorter than either 600 or 300 nucleotides. Most chimeras
were formed from aligned sequences randomly chosen from the
7,451 type-strain bacterial sequences available from RDP (release
10, update 22), while one data set was formed from the 284 aligned
type-strain archaeal sequences found in RDP. Each set of artificial
chimeras contained 1,000 sequences.
When evaluated with the data set of simple two-parent chime-
ras, ss_DECIPHER and fs_DECIPHER detected 88% and 75% of
the chimeras, while Uchime, ChimeraSlayer, and WigeoN de-
tected 73%, 56%, and 47%, respectively. The 2,000 parent se-
quences in the data set were also evaluated to estimate the rate of
false-positive detections (assuming that all type-strain sequences
are not chimeric), with fs_DECIPHER having the lowest false-
positive rate (0.15%), followed by ChimeraSlayer (0.70%),
WigeoN (0.85%), Uchime (1.5%), and ss_DECIPHER (1.6%).
Since all the parent sequences were part of the reference data set, a
more conservative calculation of false positives for fs_DECIPHER
and ss_DECIPHER was performed after removing all type-strain
sequences from the reference data set. This had no effect on the
false positives seen with fs_DECIPHER but increased the ss_
DECIPHER false-positive rate to 2.1%.
The most significant parameter influencing the rate of detec-
tion was the chimeric range, defined in the simple artificial chime-
ras as the shorter of the two segments contributed by the parents
to the chimera. Figure 2 shows a comparison of chimera detection
results as a function of the chimeric range and identifies ss_
DECIPHER as the only method capable of a high rate of detection
of chimeras (85%) with chimeric ranges between 40 and 125 nu-
cleotides. Chimeras with smaller chimeric ranges, between 30 and
40 nucleotides long, were more difficult to detect, with ss_
DECIPHER detecting the most (36%), followed by Uchime (11%)
and ChimeraSlayer (8%), while fs_DECIPHER and WigeoN did
not detect any chimeras. Uchime and fs_DECIPHER followed ss_
DECIPHER in overall performance. They had poor detection of
chimeras with chimeric fragments shorter than 100 nucleotides
but a relatively high detection rate when the chimeric range was
between 100 and 250 nucleotides. ChimeraSlayer and WigeoN
were not suitable for detection of chimeras with chimeric ranges
below 200 nucleotides. For chimeric ranges of more than 400 nu-
cleotides, Uchime and WigeoN had the highest detection efficien-
cies (99% and 96%, respectively), followed by ss_DECIPHER, fs_
DECIPHER, and ChimeraSlayer, with 89%, 88%, and 87%,
respectively.
The high rate of detection of short chimeric ranges by ss_
DECIPHER is possible because the fundamental unit of detection
in DECIPHER is 30 nucleotides and because ss_DECIPHER re-
quires only an identifying chimeric range of at least 40 nucleotides
(Fig. 1). Importantly, this basic unit is independent of the overall
length of the sequence. The conservative fs_DECIPHER method
requires the presence of an identifying chimeric range of at least 70
nucleotides in length; hence its poorer performance with shorter
chimeric ranges. The low efficiency of chimera detection in
ChimeraSlayer when the chimeric range is short can be explained
by the fact that this algorithm uses 30% of the query sequences at
each end of the sequence as the basic fragments to perform the
searches in the reference database (8). Since these basic fragments
are of various sizes, ChimeraSlayer’s detection efficiency depends
on the overall length of the sequence. Thus, if a short chimeric
range (e.g., 40 nucleotides long) is present in a nearly full sequence
(e.g., 1,500 nucleotides long), ChimeraSlayer is likely to miss it,
FIG 2 Chimera detection by ss_DECIPHER, fs_DECIPHER, ChimeraSlayer, WigeoN (Pintail), and Uchime as a function of chimeric range length. The artificial
chimera set (see the Two_Part-1 dataset in the supplemental materials) contained a total of 1,000 simple chimeras formed by combining two parent sequences.
The chimeras in this data set were binned according to the length of the chimeric range, and the average of the chimeric range in each bin is shown in this figure.
Wright et al.
720 aem.asm.org Applied and Environmental Microbiology
since the majority of the 30% end (460 nucleotides) that contains
the very small chimeric range would correspond to the same par-
ent represented by the other end of the sequence. This problem is
not as significant for shorter sequences, in which the small chime-
ric range has a stronger influence in the evaluation of potential
parent sequences. For example, out of 33 artificial chimeras with
chimeric ranges between 41 and 50 nucleotides long, Chimera-
Slayer missed the detection of all chimeras with overall sequences
more than 300 nucleotides long (i.e., 27 chimeras). On the other
hand, ChimeraSlayer detected 4 out of 6 chimeras with overall
sequence lengths between 164 and 281 nucleotides, for which 30%
of the sequence closely matched the size of the chimeric fragment.
Uchime seems to have the same problem as ChimeraSlayer, al-
though it is not as severe. For the same subset of sequences with
short chimeric ranges, Uchime failed to detect all chimeras in
sequences longer than 523 nucleotides (i.e., 18 chimeras) but de-
tected 10 out of the 15 chimeras in the shorter sequences. The
improvement of Uchime compared to ChimeraSlayer could have
resulted from the fact that Uchime uses the entire query sequence,
split into four nonoverlapping sections, for searches in the refer-
ence database (7) rather than using only the sequence ends as
ChimeraSlayer does. Nevertheless, the lengths of Uchime’s
searchable sections are also dependent on the length of the query
sequence, and therefore, Uchime has the same difficulty as
ChimeraSlayer in detecting the short chimeric ranges within long
sequences.
Other variables such as overall sequence length and the diver-
gence between the parents of the chimera were also analyzed. Fig-
ure 3 shows the comparison as a function of sequence length for
the same set of artificial chimeras used in Fig. 2 (Two_Part-1 data
set). Except for sequences between 100 and 300 nucleotides long,
ss_DECIPHER outperformed the other methods. Uchime was the
best method for sequences less than 300 nucleotides long (95%
detection), followed by ss_DECIPHER (80% detection). Interest-
ingly, Uchime’s performance deteriorated as the total sequence
length increased, reaching a detection rate as low as 51% for se-
quences longer than 1,400 nucleotides. In contrast, ss_DECIPHER
maintained a high level of detection for long sequences (82 to
93%). ChimeraSlayer and WigeoN had the lowest performance,
averaging 55% and 60% detection rates, respectively, for
sequences longer than 800 nucleotides (Fig. 3). For shorter
sequences, ChimeraSlayer maintained a similar level of detection,
except for sequences shorter than 200 nucleotides, whereas
WigeoN was unable to detect chimeras in sequences shorter than
400 nucleotides. The more conservative fs_DECIPHER option,
evaluated only for sequences longer than 600 nucleotides, also
outperformed ChimeraSlayer, WigeoN, and Uchime in the level
of detection of full-length chimeric sequences.
The general trend of these results was confirmed with an inde-
pendent set of 1,000 artificial chimeras (Two_Part-2 data set)
formed from two parent sequences (see Fig. S4 in the supplemen-
tal material). The overall detection rates in this second set were
88% for ss_DECIPHER, 75% for fs_DECIPHER, 71% for
Uchime, 55% for ChimeraSlayer, and 49% for WigeoN, in agree-
ment with the results from the first set of artificial chimeras (Fig.
3). The false positives in this second set, estimated from the eval-
uation of the 2,000 parent sequences, were similar to the data from
the first set, with 0.2% for fs_DECIPHER, 0.6% for Chimera-
Slayer, 1.1% for WigeoN, 1.4% for ss_DECIPHER, and 1.6% for
Uchime. The more conservative calculation using a reference data
set that did not contain any type-strain sequences had no effect on
false positives with fs_DECIPHER but increased ss_DECIPHER’s
rate to 2.2%.
Effect of parent divergence. A comparison of DECIPHER with
the other chimera detection methods was also made as a function
of parent divergence (Fig. 4; see also Fig. S5 in the supplemental
material), since this is a fundamental parameter used in the cre-
ation and benchmarking of ChimeraSlayer and Uchime (7, 8). By
definition, DECIPHER does not detect chimeras formed from a
combination of parents belonging to the same genus, whereas
ChimeraSlayer and Uchime were designed to detect this type of
chimera. In the benchmarking data set used (Two_Part-1), parent
FIG 3 Comparison of chimera detection by ss_DECIPHER, fs_DECIPHER,
ChimeraSlayer, WigeoN (Pintail), and Uchime as a function of sequence
length for simple chimeras formed by combining two parent sequences at a
random breakpoint. The artificial chimera set (Two_Part-1) contained a total
of 1,000 chimeras of random length, which were binned according to sequence
length for this figure. All sequences were analyzed with ss_DECIPHER, while
only sequences 600 nucleotides long were analyzed with fs_DECIPHER.
FIG 4 Comparison of chimera detection by ss_DECIPHER, ChimeraSlayer,
WigeoN (Pintail), and Uchime as a function of parent divergence for simple
chimeras formed by combining two parent sequences at a random breakpoint
(Two_Part-1 data set). The artificial chimera set contained a total of 1,000
chimeras of random length, binned according to the divergence between par-
ents. The comparison with fs_DECIPHER is shown in Fig. S5 in the supple-
mental material.
Search-Based Chimera Identification
February 2012 Volume 78 Number 3 aem.asm.org 721
divergence ranged from 7% to 41%. For parents diverging less
than 20%, Uchime outperformed the other methods, while ss_
DECIPHER had the best performance when parent divergence
was greater than 20% (Fig. 4). This higher detection with ss_
DECIPHER was again the result of DECIPHER’s superior ability
to detect chimeras with short chimeric ranges as discussed previ-
ously (Fig. 2). To further evaluate the effect of parent divergence
for the range of divergences described in the ChimeraSlayer
benchmark study (8), an additional artificial chimera data set was
created by restricting the parent divergence to less than 10%. In
this case (see Fig. S6 in the supplemental material), Uchime and
ChimeraSlayer outperformed DECIPHER as expected, although
also with low detection rates that reflected the limitations of these
programs seen with short chimeric fragments as described previ-
ously (Fig. 2). WigeoN also had low detection rates, which was in
agreement with observations described in the ChimeraSlayer
benchmark study (8).
Evaluation of chimeras with random mutations. In the
benchmarking of ChimeraSlayer, Haas et al. (8) prepared a set of
artificial chimeras created from parents with various degrees of
divergence. The effect of mutations (i.e., substitutions, insertions,
and deletions) was evaluated with sets that had 1% to 5% random
mutations introduced into a set of artificially generated chimeras.
Thus, we also used these data sets to evaluate the effect of random
mutations on DECIPHER’s ability to detect chimeras. For the
subset with parent divergence greater than or equal to 20% (Table
1), a decrease in detection was observed as the percentage of mu-
tations increased, with a small drop in detection for up to 2%
mutations and with the most significant drop resulting from 4%
and 5% insertion or deletion mutations. The variability was not as
high with lower parent divergence, but as expected, the overall
detection rate of chimeras formed from closely related parents was
low (data not shown).
The effect of a large mutation rate on DECIPHER’s ability to
detect chimeras can be explained by the mutations resulting in
sequences that are not properly classified by the RDP classifier tool
and by mutations affecting the number of chimeric fragments
identified. These problems did not occur with lower rates of arti-
ficially simulated mutations. By definition, DECIPHER allows the
presence of mismatches in the identification of in-group and out-
of-group fragments, likely contributing to the adequate detection
of chimeras with low rates of mutations. In addition, by using
taxonomic groupings made from the RDP database, the natural
variability of sequences, due to mutations or sequencing errors, is
inherently included in the DECIPHER analysis. Nevertheless,
since the benchmark data set of Haas et al. uses mutations at ran-
dom locations, which may not represent natural sequence vari-
ability present in the RDP database, the larger drop in detection as
the mutation rate increased was anticipated.
Evaluation of complex chimeras. The overall rates of detec-
tion of three-parent chimeras (Fig. 5) were 94% with ss_
DECIPHER, 87% with Uchime, 85% with fs_DECIPHER, 76%
with WigeoN, and 65% with ChimeraSlayer. For four-parent chi-
meras (Fig. 5), 92% were detected with ss_DECIPHER, 90% with
Uchime, 89% with WigeoN, 85% with fs_DECIPHER, and 63%
with ChimeraSlayer. Clearly, ChimeraSlayer had the lowest per-
formance with complex chimeras, reflecting specific limitations in
the design of the algorithm, which uses only the ends of the query
sequence for comparisons to the reference (8). On the other hand,
because DECIPHER works with the principle of identifying 30-
mer fragments that do not belong to the group where the sequence
is classified, it is more effective at detecting the smaller chimeric
regions that would be present in the complex chimeras. Likewise,
because Uchime works with the entire sequence divided into four
segments, it is also efficient at detecting multiple-part chimeras (7).
The random sets of complex chimeras used in the analysis de-
scribed above resulted in a majority of the sequences (90%)
having lengths greater than 600 nucleotides. Thus, in order to
FIG 5 DECIPHER, Uchime, ChimeraSlayer, and WigeoN (Pintail) detection
of chimeras formed from (A) three-parent and (B) four-parent sequences. The
artificial chimera sets contained 1,000 chimeras of random length. All se-
quences were analyzed with ss_DECIPHER, while only sequences 600 nu-
cleotides long were analyzed with fs_DECIPHER (926 three-parent and 983
four-parent chimeras).
TABLE 1 Chimera detection rate by DECIPHER as a function of
random mutations, using the benchmark data set of Haas et al. (8)
Type of evolution
% detection at indicated evolution rate (%)
a
12345
Nucleotide substitution 76 74 70 67 61
Insertion or deletion 73 69 54 45 36
Insertion 73 66 55 42 34
Deletion 72 66 54 43 31
a
The detection rate for 0% evolution was 76%. Only chimeras with parent divergence
greater than or equal to 20% were used.
Wright et al.
722 aem.asm.org Applied and Environmental Microbiology
further evaluate the effect of sequence length, new sets were gen-
erated, but the length was restricted to either less than 600 (see Fig.
S7 in the supplemental material) or less than 300 (see Fig. S8 in the
supplemental material) nucleotides. These complex chimera sets
confirmed ss_DECIPHER and Uchime as having the best perfor-
mance and ChimeraSlayer’s and WigeoN=s inability to detect
multiple-part chimeras in short sequences. For the complex chi-
meras (three and four parent) restricted to 600 nucleotides long,
ss_DECIPHER detected 89% to 91%, Uchime detected 85% to
89%, ChimeraSlayer detected 45% to 56%, and WigeoN detected
26% to 28%. The detection rate with the complex chimeras with
less than 300 nucleotides was 74% to 75% with ss_DECIPHER,
61% to 76% with Uchime, 15% to 29% with ChimeraSlayer, and
0% with WigeoN.
Evaluation of archaeal sequences. An additional evaluation
was performed using a two-parent chimera data set formed exclu-
sively with type-strain archaeal sequences. As shown in Fig. 6,
DECIPHER’s chimera detection rate was much higher than the
rates of the other methods, achieving overall detections of 71%
and 63% with ss_DECIPHER and fs_DECIPHER, respectively,
while detection with the other methods were between 32% and
48%. Such a significant difference can be traced to the effect of the
reference data set used for the other methods (i.e., the gold data set
described by Haas et al. [8] for ChimeraSlayer and also recom-
mended for Uchime [7]), which contains 5,181 total sequences
but only 33 archaeal sequences. When we changed to a reference
data set containing all 284 parent sequences used in the generation
of the artificial chimeras, Uchime’s detection rate increased to
76% (data not shown), illustrating that the results are dependent
on the reference data set used.
Since DECIPHER takes the comprehensive approach of using
the entire 16S rRNA database as the reference set of good se-
quences (after removing detectable chimeras with fs_DECIPHER
as discussed previously), detection is not dependent on the scope
of the reference data set. Furthermore, the database can be up-
dated as RDP grows in size or if changes are made to the sequences
marked as “suspect quality” in RDP. These updates are important
to ensure that novel lineages or recently defined genera, currently
represented in the database with a small number of sequences, are
better represented as the database is populated with additional
related sequences.
Comparative abilities of the different chimera-detection
methods. Table 2 presents a qualitative summary of the chimera-
finding abilities of the different programs tested. With the recent
benchmarks by Haas et al. (8), Edgar et al. (7), and this study,
as well as the recent release of ChimeraSlayer, Uchime, and
DECIPHER, it is evident that there is currently not a single pro-
gram capable of accurate detection of all possible types of chime-
ras. Furthermore, next-generation sequencing produces large
data sets of shorter sequences compared with traditional Sanger
sequencing, and therefore, the characteristics of data sets to be
screened for chimeras are also changing, making earlier chimera-
finding algorithms obsolete. For instance, Bellerophon has now
been shown to have a rate of false positives that is too high com-
pared to other methods, and both Bellerophon and Pintail have
been demonstrated to perform poorly with short sequences (8).
ChimeraSlayer was introduced as a method suitable for shorter
sequences and optimized to detect chimeras formed from closely
related parents, even parents within the same genus, but these
advantages were quickly overshadowed by Uchime (7). Further-
more, ChimeraSlayer’s development focused only on simple chi-
meras formed by the combination of two parent sequences, while
the developers of Uchime considered the possibility of more
complex chimeras, and as a result, Uchime also outperforms
ChimeraSlayer in this regard. Our benchmark of DECIPHER con-
firms the known limitations of earlier algorithms and reveals ad-
FIG 6 Comparison of chimera detection by ss_DECIPHER, fs_DECIPHER,
ChimeraSlayer, WigeoN (Pintail), and Uchime as a function of sequence
length for archaeal chimeras formed by combining two parent sequences at a
random breakpoint. The artificial chimera set contained a total of 1,000 chi-
meras of random length constructed from 284 archaeal type-strain sequences.
All sequences were analyzed with ss_DECIPHER, while only sequences 600
nucleotides long were analyzed with fs_DECIPHER.
TABLE 2 Qualitative summary of chimera-detection characteristics of the benchmarked programs
Characteristic
Chimera detection rate
a
fs_DECIPHER ss_DECIPHER Uchime ChimeraSlayer Pintail (WigeoN)
Detection in short sequences (100length400) ⫹⫹⫹ ⫹⫹⫹⫹ ⫹
Detection in midrange sequences (400length800) ⫹⫹⫹⫹ ⫹⫹⫹ ⫹⫹
Detection in long sequences (length 800) ⫹⫹⫹ ⫹⫹⫹⫹ ⫹⫹⫹ ⫹⫹ ⫹⫹
Detection of short chimeric regions ⫹⫹ ⫹⫹⫹⫹ ⫹⫹ ⫹⫹
Detection of complex chimeras ⫹⫹⫹⫹ ⫹⫹⫹⫹ ⫹⫹⫹⫹ ⫹ ⫹⫹⫹
Detection of chimeras from low-divergence parents ⫹⫹⫹⫹ ⫹⫹⫹
Independence from reference data set
b
⫹⫹⫹ ⫹⫹⫹ ⫹⫹ ⫹⫹ ⫹⫹
Low false positives ⫹⫹⫹⫹ ⫹⫹ ⫹⫹ ⫹⫹⫹ ⫹⫹
a
⫹⫹⫹⫹, very high rate of detection; ⫹⫹⫹, high rate of detection; ⫹⫹, low rate of detection; , very low rate of detection.
b
DECIPHER depends on the RDP taxonomy, while the other methods depend on the reference data set provided by the user.
Search-Based Chimera Identification
February 2012 Volume 78 Number 3 aem.asm.org 723
ditional limitations (Table 2). WigeoN (Pintail) is unable to detect
chimeras in short sequences (400 nucleotides) and is inefficient
with the midrange sequences between 400 and 800 nucleotides in
length. ChimeraSlayer does relatively well with midrange se-
quences but is inefficient with short sequences. ChimeraSlayer is
also inefficient at detecting complex, multiple-parent chimeras, in
agreement with the findings of Edgar et al. (7). DECIPHER’s sim-
ple approach for chimera detection makes it less efficient at de-
tecting chimeras from closely related parents and has limitations
when sequences belong to unclassified groups according to the
RDP classifier tool. However, the most salient observation in this
study is the inefficiency of all the earlier methods at detecting
chimeras when the chimeric range is very short (e.g., 30 to 100
nucleotides long), as shown in Fig. 2. This creates a significant
limitation in Uchime, particularly when the very short chimeric
ranges are in long sequences (Fig. 3). The short-sequence (ss)
option of DECIPHER provides the means to detect these chime-
ras, although with the caveat that DECIPHER does not detect
chimeras formed from sequences classified within the same genus.
Nevertheless, our analysis of the sequences present in RDP’s good-
quality database showed that 40.4% of the detected chimeras were
formed between sequences from different phyla, and therefore,
chimera detection should not be limited to finding chimeras
formed from closely related parents.
Interestingly, we find that the highest possible rate of chimera
detection in our benchmark data sets is achieved simply by com-
bining the chimera-detection advantages of ss_DECIPHER and
Uchime, reaching overall detection rates of 89% to 99%, except
for the data set of chimeras formed from closely related parents
(Fig. 7). The chimeras uniquely detected by ss_DECIPHER mostly
corresponded to those with short chimeric ranges, while the ones
uniquely detected by Uchime corresponded to those formed from
closely related parents and sequences that were indecipherable
because they were classified by RDP’s classifier tool (17) as unclas-
sified_Bacteria or unclassified_Archaea. Nevertheless, there was
surprisingly minimal sequence overlap between the false-positive
detections of ss_DECIPHER and Uchime, and therefore, one may
expect a higher rate of false-positive detections when using both
programs to assess a single set of sequences.
Online DECIPHER tool and standalone implementation. It
is essential not only to identify chimeras present in the public
databases but also to prevent more from entering the databases.
To this end, DECIPHER has been implemented as a web tool and
is publicly available online (http://DECIPHER.cee.wisc.edu). On
the website, a user can submit a FASTA file of unaligned or aligned
16S rRNA sequences to be checked for chimeras and select
whether the analysis is done with the full-sequence or short-
sequence option. DECIPHER results are returned via email. The
online tool is limited to submissions of files of less than 10 Mb,
which corresponds to approximately 6,500 full sequences (1,500
nucleotides long) or 20,000 short sequences (500 nucleotides
long). In its current implementation on a Dell Precision Worksta-
tion T5400 with two Xeon 5405 2.00 GHz processors, 10,000 se-
quences of approximately 400 nucleotides were processed in 120
min (0.7 s per sequence), and 4,000 full-length sequences (1,200
nucleotides) were processed in 113 min (1.7 s per sequence).
For users interested in processing much bigger data sets, it is
possible to submit their data sets split into multiple files or to use
the standalone tool that is available to run in the R programming
environment. In the workstation described above, it took approx-
imately 72 h to analyze the entire RDP database of 1.2 million
sequences (0.25 s per sequence). The database was split into eight
separate sections, each with its own processor core, which explains
its speed improvement over the sequence sets. Splitting a large
sequence set across multiple processes allows parallel processing,
albeit without the shared memory configuration that is consistent
with true parallelization.
The DECIPHER R package, source code, and supporting doc-
umentation and the 16S reference database are all available online
(http://DECIPHER.cee.wisc.edu). Sequence classification re-
quires the RDP Multiclassifier tool (17), which is available from
the RDP website (http://rdp.cme.msu.edu/classifier/).
With either the online or standalone version, the results file
indicates for each detected chimera the start and end positions of
the identifying chimeric region along with the corresponding ref-
erence group or groups where the chimeric region is commonly
found. In its current version, DECIPHER does not provide a con-
fidence evaluation score as found in other chimera detection pro-
grams. In some cases, the detected chimeric region corresponds to
only one reference group, but it is also common to find chimeric
regions associated with multiple reference groups, which is an
indication that the chimeric region is conserved in some branches
of the phylogenetic tree.
ACKNOWLEDGMENTS
Support for this research was provided by the Water Research Foundation
(WaterRF Project 4291). The University of Wisconsin—Madison grate-
fully acknowledges that WaterRF is the joint owner of the technical infor-
mation upon which this publication is based. The University of Wiscon-
sin—Madison thanks the Water Research Foundation for its financial,
technical, and administrative assistance in funding and managing the
project through which this information was discovered.
Finally, we thank Daewook Kang for his helpful evaluations of the
online tool.
REFERENCES
1. Aho AV, Corasick MJ. 1975. Efficient string matching—aid to biblio-
graphic search. Commun. ACM 18:333–340.
2. Amann R, Fuchs BM. 2008. Single-cell identification in microbial com-
munities by improved fluorescence in situ hybridization techniques. Nat.
Rev. Microbiol. 6:339 –348.
FIG 7 Comparison of chimera detection with ss_DECIPHER and Uchime for
all the artificial chimera sets used in this study. All chimera sets had a total of
1,000 chimeras.
Wright et al.
724 aem.asm.org Applied and Environmental Microbiology
3. Ashelford KE, Chuzhanova NA, Fry JC, Jones AJ, Weightman AJ. 2005.
At least 1 in 20 16S rRNA sequence records currently held in public repos-
itories is estimated to contain substantial anomalies. Appl. Environ. Mi-
crobiol. 71:7724–7736.
4. Cole JR, et al. 2007. The ribosomal database project (RDP-II): introduc-
ing myRDP space and quality controlled public data. Nucleic Acids Res.
35:D169–D172.
5. Cole JR, et al. 2009. The Ribosomal Database Project: improved align-
ments and new tools for rRNA analysis. Nucleic Acids Research. 37:
D141–D145.
6. DeSantis TZ, et al. 2006. Greengenes, a chimera-checked 16S rRNA gene
database and workbench compatible with ARB. Appl. Environ. Microbiol.
72:5069–5072.
7. Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R. 23 June 2011.
UCHIME improves sensitivity and speed of chimera detection. Bioinfor-
matics [Epub ahead of print.] doi:10.1093/bioinformatics/btr1381.
8. Haas BJ, et al. 2011. Chimeric 16S rRNA sequence formation and detec-
tion in Sanger and 454-pyrosequenced PCR amplicons. Genome Res. 21:
494–504.
9. Huber T, Faulkner G, Hugenholtz P. 2004. Bellerophon: a program to
detect chimeric sequences in multiple sequence alignments. Bioinformat-
ics 20:2317–2319.
10. Huse SM, Welch DM, Morrison HG, Sogin ML. 2010. Ironing out the
wrinkles in the rare biosphere through improved OTU clustering. Envi-
ron. Microbiol. 12:1889 –1898.
11. Pages H, Aboyoun P, Gentleman R, DebRoy S. 2010. Biostrings: string
objects representing biological sequences, and matching algorithms.R
package version 2.16.9. R Foundation for Statistical Computing, Vienna,
VA. http://www.R-project.org.
12. Pruesse E, et al. 2007. SILVA: a comprehensive online resource for quality
checked and aligned ribosomal RNA sequence data compatible with ARB.
Nucleic Acids Res. 35:7188–7196.
13. Quince C, et al. 2009. Accurate determination of microbial diversity from
454 pyrosequencing data. Nat. Methods 6:639641.
14. R Development Core Team. 2010.R: a language and environment for
statistical computing.R Foundation for Statistical Computing, Vienna,
VA. http://www.R-project.org.
15. Schloss PD, et al. 2009. Introducing mothur: open-source, platform-
independent, community-supported software for describing and compar-
ing microbial communities. Appl. Environ. Microbiol. 75:7537–7541.
16. von Wintzingerode F, Gobel UB, Stackebrandt E. 1997. Determination
of microbial diversity in environmental samples: pitfalls of PCR-based
rRNA analysis. FEMS Microbiol. Rev. 21:213–229.
17. Wang Q, Garrity GM, Tiedje JM, Cole JR. 2007. Naive Bayesian classifier
for rapid assignment of rRNA sequences into the new bacterial taxonomy.
Appl. Environ. Microbiol. 73:5261–5267.
18. Woese CR. 1987. Bacterial evolution. Microbiol. Rev. 51:221–271.
19. Yarza P, et al. 2008. The All-Species Living Tree project: a 16S rRNA-
based phylogenetic tree of all sequenced type strains. Syst. Appl. Micro-
biol. 31:241–250.
Search-Based Chimera Identification
February 2012 Volume 78 Number 3 aem.asm.org 725
... Objective clustering (Meier et al., 2016) was carried out following Wang et al. (2018) to group sequence reads into molecular operational taxonomic units (MOTUs) based on a 1% distance threshold for 18S (Stoeck et al., 2010) and 3% for COI (Ip et al., 2019;Ip et al., 2021b). We screened for potential chimerasin 18S MOTUs with DECIPHER v2.20 web server (Wright et al., 2012). For COI MOTUs, we screened for pseudogenes and chimeras through translation checks with Seqkit v2.1 (Shen et al., 2016). ...
Article
Coral reefs are among the richest marine ecosystems on Earth, but there remains much diversity hidden within cavities of complex reef structures awaiting discovery. While the abundance of corals and other macroinvertebrates are known to influence the diversity of other reef‐associated organisms, much remains unknown on the drivers of cryptobenthic diversity. A combination of standardised sampling with 12 units of Autonomous Reef Monitoring Structures (ARMS) and high‐throughput sequencing was utilised to uncover reef cryptobiome diversity across the equatorial reefs in Singapore. DNA barcoding and metabarcoding of mitochondrial cytochrome c oxidase subunit I, nuclear 18S and bacterial 16S rRNA genes revealed the taxonomic composition of the reef cryptobiome, comprising 15,356 microbial ASVs from over 50 bacterial phyla, and 971 MOTUs across 15 metazoan and 19 non‐metazoan eukaryote phyla. Environmental factors across different sites were tested for relationships with ARMS diversity. Differences among reefs in diversity patterns of metazoans and other eukaryotes, but not microbial communities, were associated with biotic (coral cover) and abiotic (distance, temperature and sediment) environmental variables. In particular, ARMS deployed at reefs with higher coral cover had greater metazoan diversity and encrusting plate cover, with larger‐sized non‐coral invertebrates influencing spatial patterns among sites. Our study shows that DNA barcoding and metabarcoding of ARMS constitute a valuable tool for quantifying cryptobenthic diversity patterns and can provide critical information for the effective management of coral reef ecosystems.
... (Callahan et al., 2016), with custom scripts presented in Supplementary File 5. Briefly, reads were filtered, trimmed, denoised, and merged to yield sequences from 251 to 253 nucleotides long. Chimeras were then removed with the function removeBimeraDenovo in DADA2 and putative nonchimeric sequences were assigned taxonomy and aligned with the functions IdTaxa and AlignSeqs in Decipher v 2.20.0 (Wright et al., 2012). Amplicon sequence variants (ASVs) and taxonomic identifications were merged to create a phylogseqclass object-available as Supplementary File 6-with functions in phyloseq v 1.36.0 (McMurdie and Holmes, 2013). ...
Article
Full-text available
Extreme weather events can temporarily alter the structure of coastal systems and generate floodwaters that are contaminated with fecal indicator bacteria (FIB); however, every coastal system is unique, so identification of trends and commonalities in these episodic events is challenging. To improve our understanding of the resilience of coastal systems to the disturbance of extreme weather events, we monitored water quality, FIB at three stations within Clear Lake, an estuary between Houston and Galveston, and three stations in bayous that feed into the estuary. Water samples were collected immediately before and after Hurricane Harvey (HH) and then throughout the fall of 2017. FIB levels were monitored by culturing E. coli and Enterococci. Microbial community structure was profiled by high throughput sequencing of PCR-amplified 16S rRNA gene fragments. Water quality and FIB data were also compared to historical data for these water body segments. Before HH, salinity within Clear Lake ranged from 9 to 11 practical salinity units (PSU). Immediately after the storm, salinity dropped to < 1 PSU and then gradually increased to historical levels over 2 months. Dissolved inorganic nutrient levels were also relatively low immediately after HH and returned, within a couple of months, to historical levels. FIB levels were elevated immediately after the storm; however, after 1 week, E. coli levels had decreased to what would be acceptable levels for freshwater. Enterococci levels collected several weeks after the storm were within the range of historical levels. Microbial community structure shifted from a system dominated by Cyanobacteria sp. before HH to a system dominated by Proteobacteria and Bacteroidetes immediately after. Several sequences observed only in floodwater showed similarity to sequences previously reported for samples collected following Hurricane Irene. These changes in beta diversity corresponded to salinity and nitrate/nitrite concentrations. Differential abundance analysis of metabolic pathways, predicted from 16S sequences, suggested that pathways associated with virulence and antibiotic resistance were elevated in floodwater. Overall, these results suggest that floodwater generated from these extreme events may have high levels of fecal contamination, antibiotic resistant bacteria and bacteria rarely observed in other systems.
... accessed on 12 December 2021), using universal primers: 785F (GGATTAGATACCCTGGTA) and 907R (CCGTCAATTCMTT-TRAGTTT). The sequenced gene was analyzed via BioEdit software v. 7.2 [11], and chimera sequences were removed using DECIPHER software [12]. A gene homology search was performed on the BLASTn search engine [13], and sequence data of closely related strains were retrieved from the NCBI database. ...
Article
Full-text available
Proteases that can remain active under extreme conditions such as high temperature, pH, and salt concentration are widely applicable in the commercial sector. Majority of the proteases are rendered useless under harsh conditions. Therefore, there is a need to search for new proteases that can tolerate and function in harsh conditions, thus improving their commercial value. In this study, 142 bacterial isolates were isolated from diverse alkaline soil habitats. The two highest protease-producing bacterial isolates were identified as Bacillus subtilis S1 and Bacillus amyloliquefaciens KSM12, respectively, based on 16S rRNA sequencing. Optimal protease production was detected at pH 8, 37°C, 48 h for Bacillus subtilis S1 (99.8 U/mL) and pH 9, 37°C, 72 h, 10% (w/v) NaCl for Bacillus amyloliquefaciens KSM12 (94.6 U/mL). The molecular weight of these partially purified proteases was then assessed on SDS-PAGE (17 kDa for Bacillus subtilis S1 and 65 kDa for Bacillus amyloliquefaciens KSM12), respectively. The maximum protease activity for Bacillus subtilis S1 was detected at pH 8, 40°C, and for Bacillus amyloliquefaciens KSM12 at pH 9, 60°C. These results suggest that the proteases secreted by Bacillus subtilis S1 and Bacillus amyloliquefaciens KSM12 are suitable for industries working under a highly alkaline environment.
... Then, 16S rRNA gene sequences obtained in this study were checked for the presence of chimeric sequences using DECIPHER's Find Chimeras web tool [11] and submitted to the GenBank database under the accession numbers ON028648-ON028654. ...
Article
Full-text available
The sulfur cycle participates significantly in life evolution. Some facultatively autotrophic microorganisms are able to thrive in extreme environments with limited nutrient availability where they specialize in obtaining energy by oxidation of reduced sulfur compounds. In our experiments focused on the characterization of halophilic bacteria from a former salt mine in Solivar (Presov, Slovakia), a high diversity of cultivable bacteria was observed. Based on ARDRA (Amplified Ribosomal DNA Restriction Analysis), at least six groups of strains were identified with four of them showing similarity levels of 16S rRNA gene sequences lower than 98.5% when compared against the GenBank rRNA/ITS database. Heterotrophic sulfur oxidizers represented ~34% of strains and were dominated by Halomonas and Marinobacter genera. Autotrophic sulfur oxidizers represented ~66% and were dominated by Guyparkeria and Hydrogenovibrio genera. Overall, our results indicate that the spatially isolated hypersaline deep subsurface habitat in Solivar harbors novel and diverse extremophilic sulfur-oxidizing bacteria.
... For the sequence of DH-B6 T , a short contig was discarded, as it was present as well in the larger genome contig with high sequence identity. Before submitting the sequences to NCBI GenBank a chimaera check was performed with DECIPHER (version 2.17.1, [27]), and no chimaera was detected. ...
Article
Modified atmosphere (MA) packaging plays an important role in improving food quality and safety. By using different gas mixtures and packaging materials the shelf life of fresh produce can significantly be increased. A Gram-negative-staining, rod-shaped, orange-pigmented strain DH-B6 T , has been isolated from MA packed raw pork sausage (20% CO 2 , 80% O 2 ). The strain produced biofilms and showed growth at high CO 2 levels of up to 40%. Complete 16S rRNA gene and whole-genome sequences revealed that strain DH-B6 T belongs to the genus Chryseobacterium , being closely related to strain Chryseobacterium indologenes DSM 16777 T (98.4%), followed by Chryseobacterium gleum NCTC11432 T (98.3%) and Chryseobacterium lactis KC1864 T (98.2%). Average nucleotide identity value between DH-B6 T and C. indologenes DSM 16777 T was 81.1% and digital DNA–DNA hybridisation was 24.9%, respectively. The DNA G+C content was 35.51 mol%. Chemotaxonomical analysis revealed the presence of the rare glycine lipid cytolipin, the serine-glycine lipid flavolipin and the sulfonolipid sulfobacin A, as well as phosphatidylethanolamine, monohexosyldiacylglycerol and ornithine lipid, including the hydroxylated forms. Major fatty acids were iC 15 : 0 (50.7%) and iC 17 : 1 cis 9 (28.7%), followed by iC 15 : 0 2-OH (7.0%) and iC 17 : 0 3-OH (6.2%). The isolated strain contained MK-6 as the only respiratory quinone and flexirubin-like pigments were detected as the major pigments. Based on the phenotypic, chemotaxonomic and phylogenetic characteristics, the strain DH-B6 T (=DSM 110542 T =LMG 31915 T ) represents a novel species of the genus Chryseobacterium , for which the name Chryseobacterium capnotolerans sp. nov. is proposed. Emended descriptions of the genus Chryseobacterium and eight species of this genus based on polar lipid characterisation are also proposed.
... Les séquences du gène de l'ARNr 16S ont été alignées et éditées avec le programme BioEdit 7.2.5 (Hall, 1999) en utilisant ClustalW (Thompson, 1994). Les séquences chimères ont été éliminées en utilisant le programme Decipher (Wright et al., 2012). L'arbre phylogénétique a été réalisé en utilisant SeaView (Galtier et al., 1996). ...
Thesis
Les sources hydrothermales océaniques sont caractérisées par la formation de dépôts de sulfure massif d’associations minéralogiques complexes autour de la zone d’émission du fluide hydrothermal. La base de la chaîne trophique de ces écosystèmes est assurée par la production primaire chimio-synthétique des micro-organismes spécifiques à ces écosystèmes. L’objectif de ce travail de thèse était d’identifier et de caractériser le fonctionnellement de ces communautés microbiennes en interaction à l'interface entre la biosphère et la géosphère. Afin de répondre à cet objectif, des approches innovantes de colonisation ex situ de substrats naturels de sulfures massifs ont été réalisées grâce à des incubations en mésocosmes. Cinq incubations en bioréacteur Gas-lift ont été réalisées à partir de cheminées et de fluides prélevés sur trois champs hydrothermaux de l’Atlantique (Lucky Strike, Snake Pit et TAG). L’utilisation de fluide non filtré comme base minérale a permis l’enrichissement de micro-organismes originaux jusqu’alors non détectés dans les inventaires moléculaires. Dans l’ensemble, les minéraux associés à ces incubations ont été colonisés par des communautés d’hyperthermophiles dont la structure de population est similaire à celle des fractions liquides. De plus, un nouveau mécanisme de tolérance à l’H2 considéré jusqu’alors comme inhibiteur de la croissance chez les Thermococcales, a été décrit. Ce trait métabolique original constitue un élément pour comprendre leur distribution ubiquiste dans les écosystèmes hydrothermaux. Ces travaux permettent d’étendre nos connaissances sur la diversité hyperthermophile de ces écosystèmes.
... Quality filtering on joined sequences was performed, and sequences that did not fulfill the following criteria were discarded: sequence length <200 bp, no ambiguous bases, and mean quality score ≥20. Then, the sequences were compared with the reference database (RDP Gold database) using the UCHIME algorithm to detect the sequences, and chimeric sequences were removed (16). ...
Article
Full-text available
Microbes and microbiota dysbiosis are correlated with the development of lung cancer; however, the airway taxa characteristics and bacterial topography in synchronous multiple primary lung cancer (sMPLC) are not fully understood. The present study aimed to investigate the microbiota taxa distribution and characteristics in the airways of patients with sMPLC and clarify specimen acquisition modalities in these patients. Using the precise positioning of electromagnetic navigation bronchoscopy (ENB), we analyzed the characteristics of the respiratory microbiome, which were collected from different sites and using different sampling methods. Microbiome predictor variables were bacterial DNA burden and bacterial community composition based on 16sRNA. Eight non-smoking patients with sMPLC in the same pulmonary lobe were included in this study. Compared with other sampling methods, bacterial burden and diversity were higher in surface areas sampled by bronchoalveolar lavage (BAL). Bacterial topography data revealed that the segment with sMPLC lesions provided evidence of specific colonizing bacteria in segments with lesions. After taxonomic annotation, we identified 4863 phylotypes belonging to 185 genera and 10 different phyla. The four most abundant specific bacterial community members detected in the airway containing sMPLC lesions were Clostridium, Actinobacteria, Fusobacterium, and Rothia, which all peaked at the segments with sMPLC lesions. This study begins to define the bacterial topography of the respiratory tract in patients with sMPLC and provides an approach to specimen acquisition for sMPLC, namely BAL fluid obtained from segments where lesions are located.
... Paired-end sequences were clustered into Amplicon Sequence Variants (ASVs) following the DADA2 pipeline (version 1.16) [61]. After filtering the sequences and removing the chimeras, the taxonomy assignment was carried out comparing our data against the SILVA NR99rel138 standard database of bacteria [62] using "DECIPHER" R package (version 2.18.1) [63] as implementation of DADA2 (SSU version 138 available at: http://www2.decipher.codes/Downloads.html, last accessed on 21 December 2021). ...
Article
Full-text available
The taxonomic assemblage and functions of the plant bacterial community are strongly influenced by soil and host plant genotype. Crop breeding, especially after the massive use of nitrogen fertilizers which led to varieties responding better to nitrogen fertilization, has implicitly modified the ability of the plant root to recruit an effective bacterial community. Among the priorities for harnessing the plant bacterial community, plant genotype-by-microbiome interactions are stirring attention. Here, we analyzed the effect of plant variety and fertilization on the rhizosphere bacterial community. In particular, we clarified the presence in the bacterial community of a varietal effect of N and P fertilization treatment. 16S rRNA gene amplicon sequence analysis of rhizospheric soil, collected from four wheat varieties grown under four N-P fertilization regimes, and quantification of functional bacterial genes involved in the nitrogen cycle (nifH; amoA; nirK and nosZ) were performed. Results showed that variety played the most important role and that treatments did not affect either bacterial community diversity or bacterial phyla abundance. Variety-specific response of rhizosphere bacterial community was detected, both in relation to taxa (Nitrospira) and metabolic functions. In particular, the changes related to amino acid and aerobic metabolism and abundance of genes involved in the nitrogen cycle (amoA and nosZ), suggested that plant variety may lead to functional changes in the cycling of the plant-assimilable nitrogen.
... cee.wisc.edu; Wright et al., 2012) was used to remove chimeras. Afterwards, the remaining sequences in FASTA format were compared with those deposited in the GenBank database (http://www.ncbi.nlm.nih. ...
Article
Given the drift to improve economic and ecological sustainability of the aquaculture sector, novel ingredients fulfilling these requirements are sought. Hermetia illucens, commonly called black soldier fly, (Diptera: Stratiomydae; H) is a promising dietary protein source but its effect on fish gut microbiota is still to be clarified. The aim of the present study was to increase the knowledge of the effect of dietary full-fat H meal on rainbow trout (Oncorhynchus mykiss) microbiota and, in particular, on intestinal mucosa-adherent microbiota by applying a dual approach based on polymerase chain reaction-denaturing gradient gel electrophoresis and high-throughput sequencing of the 16S rRNA gene. Rainbow trout (initial body weight of 137.3 ± 10.5 g) was fed for 98 days with a control diet (H0) containing fishmeal and protein-rich vegetable ingredients and an experimental diet (H50) where 50% of the fishmeal had been replaced by full-fat H meal rich in saturated fatty acids. Proteobacteria, Firmicutes and Actinobacteria were generally present in all samples, although the core microbiota (relative prevalence higher than or equal to 80% in all samples) only consisted of the proteobacteria Caulobacter, Delftia, Agrobacterium and Ochrobactrum. In addition, Streptococcus infantis and a member of the Cytophagaceae family were part of the core taxa of mucosa samples. Tenericutes were abundant in pyloric caeca samples and, among them, Mycoplasmataceae seemed to increase in the group fed the high saturated fatty acid diet containing H meal; a consideration about the connection between this bacterial group and the dietary lipid content must be considered. Dietary treatment did not clearly affect alpha-diversity metrics, but mucosa samples tended to be more resilient to dietary changes than content samples. Permutational analysis of variance showed significantly different β-diversities between diets (p < 0.05) but principal coordinates analysis did not confirm this result. Diets for rainbow trout containing full-fat H meal determined interesting modifications in the gut microbiota with patterns similar to the ones found in the literature. The dietary lipids can exert an effect on microbiota. Nonetheless, research data on this topic are still scarce and further studies are highly encouraged.
... Sequences were assembled with the DNABaser software 3.5.3 and prior to phylogenetic analysis, vector sequences flanking the 16S rRNA gene inserts were identified using the VecScreen tool (NCBI) and removed. 1 Clone sequences were checked for chimeras (Wright et al., 2012), aligned with SINA (v1.2.11; Pruesse et al., 2012), and added to a database of over 230,000 homologous prokaryotic 16S rRNA gene primary structures by using the merging tool of the ARB program package (Ludwig et al., 2004). Sequences were then manually corrected with the alignment tool of the same software and added by parsimony to the tree generated in the Living Tree Project (LTP; Yarza et al., 2008). ...
Article
Full-text available
In acid drainage environments, biosulfidogenesis by sulfate-reducing bacteria (SRB) attenuates the extreme conditions by enabling the precipitation of metals as their sulfides, and the neutralization of acidity through proton consumption. So far, only a handful of moderately acidophilic SRB species have been described, most of which are merely acidotolerant. Here, a novel species within a novel genus of moderately acidophilic SRB is described, Acididesulfobacillus acetoxydans gen. nov. sp. nov. strain INE, able to grow at pH 3.8. Bioreactor studies with strain INE at optimum (5.0) and low (3.9) pH for growth showed that strain INE alkalinized its environment, and that this was more pronounced at lower pH. These studies also showed the capacity of strain INE to completely oxidize organic acids to CO2 , which is uncommon among acidophilic SRB. Since organic acids are mainly in their protonated form at low pH, which increases their toxicity, their complete oxidation may be an acid stress resistance mechanism. Comparative proteogenomic and membrane lipid analysis further indicated that the presence of saturated ether-bound lipids in the membrane, and their relative increase at lower pH, was a protection mechanism against acid stress. Interestingly, other canonical acid stress resistance mechanisms, such as a Donnan potential and increased active charge transport, did not appear to be active.
Article
Full-text available
A 16S rRNA gene database (http://greengenes.lbl.gov) addresses limitations of public repositories by providing chimera screening, standard alignment, and taxonomic classification using multiple published taxonomies. It was found that there is incongruent taxonomic nomenclature among curators even at the phylum level. Putative chimeras were identified in 3% of environmental sequences and in 0.2% of records derived from isolates. Environmental sequences were classified into 100 phylum-level lineages in the Archaea and Bacteria.
Article
Full-text available
The Ribosomal Database Project (RDP) provides researchers with quality-controlled bacterial and archaeal small subunit rRNA alignments and analysis tools. An improved alignment strategy uses the Infernal secondary structure aware aligner to provide a more consistent higher quality alignment and faster processing of user sequences. Substantial new analysis features include a new Pyrosequencing Pipeline that provides tools to support analysis of ultra high-throughput rRNA sequencing data. This pipeline offers a collection of tools that automate the data processing and simplify the computationally intensive analysis of large sequencing libraries. In addition, a new Taxomatic visualization tool allows rapid visualization of taxonomic inconsistencies and suggests corrections, and a new class Assignment Generator provides instructors with a lesson plan and individualized teaching materials. Details about RDP data and analytical functions can be found at http://rdp.cme.msu.edu/.
Article
Full-text available
This paper describes a simple, efficient algorithm to locate all occurrences of any of a finite number of keywords in a string of text. The algorithm consists of constructing a finite state pattern matching machine from the keywords and then using the pattern matching machine to process the text string in a single pass. Construction of the pattern matching machine takes time proportional to the sum of the lengths of the keywords. The number of state transitions made by the pattern matching machine in processing the text string is independent of the number of keywords. The algorithm has been used to improve the speed of a library bibliographic search program by a factor of 5 to 10.
Article
Full-text available
Chimeric DNA sequences often form during polymerase chain reaction amplification, especially when sequencing single regions (e.g. 16S rRNA or fungal Internal Transcribed Spacer) to assess diversity or compare populations. Undetected chimeras may be misinterpreted as novel species, causing inflated estimates of diversity and spurious inferences of differences between populations. Detection and removal of chimeras is therefore of critical importance in such experiments. We describe UCHIME, a new program that detects chimeric sequences with two or more segments. UCHIME either uses a database of chimera-free sequences or detects chimeras de novo by exploiting abundance data. UCHIME has better sensitivity than ChimeraSlayer (previously the most sensitive database method), especially with short, noisy sequences. In testing on artificial bacterial communities with known composition, UCHIME de novo sensitivity is shown to be comparable to Perseus. UCHIME is >100× faster than Perseus and >1000× faster than ChimeraSlayer. robert@drive5.com Source, binaries and data: http://drive5.com/uchime. Supplementary data are available at Bioinformatics online.
Article
Full-text available
Bacterial diversity among environmental samples is commonly assessed with PCR-amplified 16S rRNA gene (16S) sequences. Perceived diversity, however, can be influenced by sample preparation, primer selection, and formation of chimeric 16S amplification products. Chimeras are hybrid products between multiple parent sequences that can be falsely interpreted as novel organisms, thus inflating apparent diversity. We developed a new chimera detection tool called Chimera Slayer (CS). CS detects chimeras with greater sensitivity than previous methods, performs well on short sequences such as those produced by the 454 Life Sciences (Roche) Genome Sequencer, and can scale to large data sets. By benchmarking CS performance against sequences derived from a controlled DNA mixture of known organisms and a simulated chimera set, we provide insights into the factors that affect chimera formation such as sequence abundance, the extent of similarity between 16S genes, and PCR conditions. Chimeras were found to reproducibly form among independent amplifications and contributed to false perceptions of sample diversity and the false identification of novel taxa, with less-abundant species exhibiting chimera rates exceeding 70%. Shotgun metagenomic sequences of our mock community appear to be devoid of 16S chimeras, supporting a role for shotgun metagenomics in validating novel organisms discovered in targeted sequence surveys.
Article
Full-text available
Deep sequencing of PCR amplicon libraries facilitates the detection of low-abundance populations in environmental DNA surveys of complex microbial communities. At the same time, deep sequencing can lead to overestimates of microbial diversity through the generation of low-frequency, error-prone reads. Even with sequencing error rates below 0.005 per nucleotide position, the common method of generating operational taxonomic units (OTUs) by multiple sequence alignment and complete-linkage clustering significantly increases the number of predicted OTUs and inflates richness estimates. We show that a 2% single-linkage preclustering methodology followed by an average-linkage clustering based on pairwise alignments more accurately predicts expected OTUs in both single and pooled template preparations of known taxonomic composition. This new clustering method can reduce the OTU richness in environmental samples by as much as 30-60% but does not reduce the fraction of OTUs in long-tailed rank abundance curves that defines the rare biosphere.