Technical ReportPDF Available

A novel, single-tube enzymatic fragmentation and library construction method enables fast turnaround times and improved data quality for microbial whole-genome sequencing

Authors:

Abstract

Next-generation whole genome sequencing of microbes demands rapid, robust, and scalable library construction workflows, capable of generating high-quality sequence data across a wide range of genome sizes, complexities and genomic GC content. In this Application Note, we describe a streamlined library preparation method that results in minimal bias, high uniform coverage, and facilitates de novo assembly of microbial genomes.
A novel, single-tube enzymatic fragmentation
and library construction method enables fast
turnaround times and improved data quality
for microbial whole-genome sequencing
Next-generation whole genome sequencing of microbes demands rapid, robust,
and scalable library construction workows, capable of generating high-quality
sequence data across a wide range of genome sizes, complexities and genomic
GC content. In this Application Note, we describe a streamlined library preparation
method that results in minimal bias, high uniform coverage, and facilitates
de novo assembly of microbial genomes.
Microbial whole-genome sequencing
Application Note
Bordetella pertussis (68% GC)
Escherichia coli (51% GC)
Clostridium difficile (29% GC)
Bordetella pertussis (68% GC) Escherichia coli (51% GC)
Clostridium difficile (29% GC)
Bordetella pertussis (68% GC)
Escherichia coli (51% GC) Clostridium difficile (29% GC)
Figure 1. Bacterial species
used for library preparation
and sequencing.
INTRODUCTION
Whole-genome sequencing (WGS) of microbes employing next-
generation sequencing (NGS) technologies enables pathogen
identication, differentiation, and surveillance on an unprecedented
scale and level of resolution—thereby profoundly impacting
diagnostic microbiology and public health. To fully capitalize on
the benets of greater sequencing capacity, faster sequencing
technologies and lower per-genome costs, rapid and robust NGS
library construction workows are needed to support both de novo
and re-sequencing applications.
A major focus area in NGS library construction for microbial WGS
has been the elimination of mechanical shearing—which requires
expensive, specialized equipment and consumables, and is both
laborious and difcult to scale. Alternative, enzymatic fragmentation
solutions based on transposases (“tagmentation”) or mixtures of
DNA endonucleases and nicking enzymes, offer signicant benets
in terms of throughput and turnaround times. However, these often
come at a cost to reproducibility, control over fragment length, and
sequence data quality (coverage depth and uniformity)—particularly
for organisms with extreme (highly GC- or AT-rich) genomic content.
The KAPA HyperPlus Kit is a robust and versatile kit for the
construction of DNA libraries for Illumina sequencing from a range
of sample types and inputs (1 ng – 1 µg). The streamlined, one-
tube workow—which includes enzymatic fragmentation with
a novel enzyme cocktail—offers the speed and convenience of
tagmentation-based methods, but the control and performance of
ligation-based library construction from Covaris-sheared DNA.
Authors
Bronwen Miller
Victoria van Kets
R & D Scientists
Beverley van Rooyen
Senior Support & Applications
Scientist
Heather Whitehorn
Support & Applications Scientist
Piet Jones
Bioinformatics Scientist
Martin Ranik
R & D Team Leader
Adriana Geldart
Support & Applications Manager
Eric van der Walt
R & D Manager
Maryke Appel
Technical Director
Kapa Biosystems
Cape Town, South Africa and
Wilmington, MA, USA
Nguyet Kong
Research Associate
Carol Huang
Research Associate
Dylan Storey
Post-doctoral Fellow
Bart C. Weimer
Professor and Director
UC Davis School of Veterinary
Medicine & 100K Pathogen
Genome Project
Davis, CA, USA
2 | Microbial whole-genome sequencing
EXPERIMENTAL DESIGN
Current Illumina® library construction methods employing
non-mechanical solutions for DNA fragmentation have three
key limitations, namely: (i) poor control over fragment length,
which is related to sensitivity with respect to DNA input; (ii)
low library construction efciency; and (iii) sequence biases,
introduced during fragmentation and/or compulsory library
amplication. Combined, these factors limit the amount
and alter the representation of input DNA that is converted
to usable reads—ultimately affecting coverage depth and
uniformity, and the quality and completeness of de novo
genomes.
To address these concerns and illustrate the benets of the
KAPA HyperPlus workow for the production of high-quality
libraries for microbial WGS, we sequenced the genomic DNA of
three bacteria from whole genome shotgun libraries prepared
using four different fast library construction strategies. The
four methods were compared with respect to key library
construction, sequencing, and de novo assembly metrics.
The bacterial species (Figure 1), Clostridium difcile (29% GC),
Escherichia coli (51% GC), and Bordetella pertussis (68% GC),
are all relevant for human health and were selected to
represent a wide range of genomic GC content.
Library preparation methods are summarized in Table 1. The
KAPA Hyper Prep Kit with Covaris-sheared DNA represents
the industry standard for high-efciency DNA library
preparation. The KAPA HyperPlus Kit contains the novel KAPA
Frag reagent for enzymatic fragmentation, developed by Kapa
Biosystems (Wilmington, MA) to overcome the drawbacks
of current non-mechanical fragmentation solutions, and
work synergistically with the KAPA Hyper Prep chemistry to
improve library construction efciency. Fragmentase from
New England Biolabs (Ipswich, MA) employs a combination
of a dsDNA nicking enzyme and an endonuclease. Both the
KAPA Hyper Prep and NEBNext Ultra kits offer streamlined,
single-tube, ligation-based library preparation protocols.
The Nextera XT DNA Library Preparation Kit from Illumina
(San Diego, CA) is based on tagmentation technology.
To demonstrate the practical utility and benets of the KAPA
HyperPlus workow for large-scale microbial genome projects,
sequencing metrics for selected draft genomes, released by
the 100K Human Pathogen Genome Project (UC Davis, Davis,
CA) are included at the end of this Note.
Table 1. Library construction methodologies used in this study.
Abbreviation
Fragmentation
method/kit
Library
preparation kit
Prep
time
Hyper Prep Covaris shearing KAPA Hyper Prep Kit 4 h
HyperPlus
KAPA Frag reagent
for Enzymatic
Fragmentation
KAPA Hyper Prep Kit 3 h
NEBNext NEBNext dsDNA
Fragmentase
NEBNext Ultra DNA
Library Prep Kit
for Illumina
4 h
Nextera Tagmentation,
Nextera XT DNA Library Preparation Kit 2.5 h
MATERIALS AND METHODS
Comparative Library Construction
Libraries were prepared in duplicate from 1 ng of bacterial
genomic DNA (the optimal input for the Nextera XT chemistry),
obtained from the American Type Culture Collection (ATCC;
Manassas, VA). Strains and accession numbers were as follows:
C. difcile (Hall and O’Toole) Prevot, strain 630 (BAA-1382);
E. coli (Migula) Castellani and Chalmers, strain MG1655
(700926) and B. pertussis (Bergey, et al.) Moreno-Lopez, strain
Tohama 1 (BAA-589).
Unless indicated otherwise, library construction was
performed with reagents supplied in the respective library
preparation kits, following standard protocols.
Hyper Prep workow: Input DNA was sheared in 130 µL
microtubes with a Covaris E220 Focused Ultrasonicator
(Covaris; Woburn, MA), using parameters optimized for a
median fragment length of 500 bp. Fragmented DNA was used
directly for library construction using the KAPA Hyper Prep
Kit (Kapa Biosystems; Wilmington, MA).
HyperPlus workow: Libraries were prepared with the KAPA
HyperPlus Library Preparation Kit (Kapa Biosystems), with
enzymatic fragmentation for 5min at 37°C.
Dual-indexed adapter oligos used for both the Hyper Prep
and HyperPlus methods were obtained from Integrated
DNA Technologies (IDT; Coralville, IA). For both workows,
post-ligation size selection (0.5 – 0.7X) was performed with
Agencourt AMPure XP reagent (Beckman Coulter; Beverly,
MA). Libraries were amplied for 14 cycles.
NEBNext workow: Input DNA was digested with NEBNext
dsDNA Fragmentase (New England Biolabs; Ipswich, MA) for
32.5 min at 37°C, followed by library preparation with the
NEBNext Ultra DNA Library Prep Kit for Illumina. Post-ligation
size selection was performed with parameters recommended
for an insert size range of 500 – 700 bp. Libraries were
amplied for 15 cycles.
Microbial whole-genome sequencing | 3
Nextera workow: Libraries were prepared according to
the standard protocol, which includes no size selection and
12 cycles of library amplication.
All libraries were quantied after the post-amplication
cleanup with the qPCR-based KAPA Library Quantication
Kit for Illumina platforms (Kapa Biosystems). Library size
distributions were conrmed with a 2100 Bioanalyzer
instrument and Agilent High Sensitivity DNA Kit (Agilent
Technologies; Santa Clara, CA).
Libraries were normalized and combined into four separate
pools for 2 x 300 bp paired-end sequencing on an MiSeq
Desktop Sequencer, using a MiSeq Reagent Kit v3 (Illumina;
San Diego, CA).
Adapter and quality trimming was performed using
Trimmomatic v. 0.30. GC bias was calculated using Picard
v. 1.128, and coverage with Bedtools genomecov v. 2.22. For
reference genome assembly, reads were trimmed and aligned
with BWA MEM v. 0.7.12 and down-sampled to the lowest
common number of reads (~900,000). De novo assembly was
performed using Spades v. 3.5, and metrics collected using
Quast v. 2.3.
100K Pathogen Genome Project Workow
High-molecular weight genomic DNA was extracted from
cultured bacterial isolates; and DNA concentration and
quality assessed using previously described methods (Kong,
et al., Agilent Technologies Application Note 5991-3722EN;
Jeannotte, et al., Agilent Technologies Application Note 5991-
4003EN). Input into library construction ranged between
200 – 400 ng.
Libraries were prepared with the KAPA HyperPlus Library
Preparation Kit according to the manufacturer’s instructions.
DNA was fragmented enzymatically for 10 min at 37°C.
NEXex-96 DNA Barcodes (Bioo Scientic; Austin, TX) were
used for adapter ligation. Dual-SPRI size selection (0.6 – 0.8X)
was performed after the post-ligation cleanup. Libraries
were amplied with KAPA HiFi HotStart ReadyMix and
KAPA Library Amplication Primer Mix for Illumina (Kapa
Biosystems), using 8 cycles of amplication.
Library size distributions were conrmed with a 2100
Bioanalyzer instrument and Agilent High Sensitivity DNA Kit.
The KAPA Library Quantication Kit for Illumina platforms
was used for qPCR-based library quantication, prior to
normalization and pooling (96 libraries/lane), for 2 x 100 bp
paired-end sequencing on a HiSeq 2500 Sequencer, using v4
HiSeq SBS and Cluster Kits (Illumina; San Diego, CA).
Reads were de-multiplexed and basic quality control
performed. De novo assembly and annotation were carried out
using Abyss and Prokka, respectively.
RESULTS AND DISCUSSION
Comparative Library Construction Metrics
Average yields of puried, amplied libraries for the Hyper
Prep and HyperPlus workows ranged between 190 – 290 ng
for all three bacterial species, whereas yields for the NEBNext
and Nextera workows were signicantly lower (20 – 150 ng)
and more variable (Table 2). When taking the number of
amplication cycles into account, the NEBNext workow
performed worst. Yields obtained with the Kapa workows
were much higher than needed for library QC and multiplexed
sequencing (theoretically, ~30 ng of each library would have
sufced), indicating that the number of amplication cycles
could have been reduced by 2 – 3 cycles for these workows.
Higher consistency across species suggests that the Hyper
Prep and HyperPlus workows are more robust, and better
suited for high-throughput pipelines than the NEBNext and
Nextera methods.
Table 2. Final library yields
Species and
GC content
Average yield (ng) and number of amplication cycles
Hyper Prep
(14 cycles)
HyperPlus
(14 cycles)
NEBNext
(15 cycles)
Nextera
(12 cycles)
C. difcile (29%) 267 223 85 22
E. coli (51%) 279 290 148 97
B. pertussis (68%) 264 189 25 54
Average
(all species) 270 ± 8 237 ± 51 43 ± 37 58 ± 38
Electropherograms of nal libraries generated with each of
the four workows are given in Figure 2 on the next page.
Mode fragment lengths from the electrophoretic analysis vs.
mode insert sizes calculated from trimmed, aligned reads are
summarized in Table 3.
Fragment lengths determined with the Bioanalyzer for size-
selected libraries prepared with ligation-based methods
(Hyper Prep, HyperPlus and NEBNext) were within the
expected range of 600 – 800 bp, and very similar for all three
of the bacterial species. In contrast, Nextera libraries had a
mode library fragment length >1 kb, and displayed a wide
variation across bacteria. Since long library molecules are not
expected to cluster and sequence efciently, the effective
yield of sequenceable library achieved with the Nextera
workow is lower than reected in Table 2.
4 | Microbial whole-genome sequencing
C. difficile (29% GC)
0
50
100
200
150
250
300
350
[FU]
35 50 100 150 200 300 400 1000600 3000 10380 [bp]
Hyper Prep
HyperPlus
NEBNext
Nextera
E. coli (51% GC)
0
100
200
300
400
500
600
700
800
900
[FU]
35 50 100 150 200 300 400 1000600 3000 10380 [bp]
B. pertussis (68% GC)
0
100
200
300
400
500
[FU]
35 50 100 150 200 300 400 1000600 3000 10380 [bp]
Figure 2. Size distribution of nal libraries
Libraries prepared from C. difcile, E. coli, and B. pertussis gDNA using the
Hyper Prep, HyperPlus, NEBNext and Nextera workows were analyzed
using a 2100 Bioanalyzer instrument and High Sensitivity DNA Kit (Agilent
Technologies). Peak sizes do not correspond with mass-based library yields
given in Table 2, as libraries were recovered in different nal volumes for
different workows. Nextera libraries were analyzed without dilution;
NEBNext libraries were diluted 1/5, whereas Hyper Prep and HyperPlus
libraries were diluted to 5 ng/µL for analysis.
Sequencing Metrics
The four library construction methods used in this study
were compared with respect to three key sequencing metrics,
namely start site bias, GC bias, and coverage uniformity. Biases
associated with fragmentation—which has traditionally been a
concern with non-mechanical methods—and bias introduced
during library amplication are two major factors that impact
the depth and uniformity with which genomes are covered.
Typically, genomic regions with a balanced GC content are
Table 3. Mode library fragment lengths, determined by electrophoretic
analysis (BioA) or from sequence data (Seq).
Species and
GC content
Average length (bp)
Hyper Prep HyperPlus NEBNext Nextera
BioA Seq BioA Seq BioA Seq BioA Seq
C. difcile
(29%) 659 385 712 358 694 478 872 566
E. coli
(51%) 650 361 749 383 774 438 1563 444
B. pertussis
(68%) 629 361 683 374 788 452 1905 532
Average
(all species) 646 369 715 371 752 456 1447 514
Std dev (bp) 13 11 27 10 41 16 430 51
Mode fragment lengths determined by electrophoretic analysis of nal, amplied
libraries are inclusive of adapters, whereas mode lengths calculated from sequencing
metrics are not.
“easy” to sequence, resulting in surplus coverage for these
regions—at the expense of AT- and GC-rich regions, which are
underrepresented or often absent from draft genomes.
Start site complexity plots (Figure 3) show the nucleotide
content of all aligned reads in a 40-bp window (-10 to +30 bp)
relative to the alignment start. As expected, the Hyper Prep
workow—which employed mechanical shearing—displayed
the least amount of start site bias for all three bacteria,
whereas enzymatic fragmentation methods all displayed
varying degrees of start site bias. The KAPA Frag reagent for
Enzymatic Fragmentation used in the HyperPlus workow
performed signicantly better than Fragmentase (NEBNext
workow) and the tagmentation-based Nextera workow.
Start site bias potentially impacts library diversity (number
of unique reads representing each genome position). Other
library construction parameters that impact library diversity
are the amount and quality of input DNA (identical for all
four methods in this study), and the efciency with which
sequenceable adapter-anked molecules are generated.
Library amplication only creates duplicates (but is necessary
to complete adapter sequences and/or generate a sufcient
amount of material for QC and sequencing if the input into
library construction is low), and can profoundly skew the ratio
in which unique adapter-anked fragments are represented in
the nal library due to intrinsic biases of DNA polymerases.
Besides factors intrinsic to the sequencing platform, these are
the primary determinants of coverage depth and uniformity,
and ultimately the amount of sequencing that has to be done.
All of the enzymatic fragmentation methods displayed more
start site bias than Covaris shearing (the current industry
standard). Nevertheless, coverage uniformity plots (Figure 4)
and GC bias plots (Figure 5) indicated the following:
Microbial whole-genome sequencing | 5
Bias associated with enzymatic fragmentation in the
HyperPlus workow had no impact on overall coverage
depth and uniformity, or GC bias—which were virtually
identical for Hyper Prep and HyperPlus. In the HyperPlus
method, minor start site bias is offset by the integrated
workow (which eliminates the physical transfer of
material between fragmentation and library construction,
and the associated loss of input DNA), and synergy
between the fragmentation and library construction
chemistries.
The NEBNext workow battles with both AT- and GC-
rich genomes, whereas Nextera performs poorly with AT-
rich sequence—presumably as the result of more biased
fragmentation and library amplication. This results
in lumpy coverage and coverage hotspots, i.e., over-
representation of “easy” (more GC-balanced) regions
and under-representation of “difcult” (AT- and GC-rich)
regions. With these methods, more sequencing has to
be performed to achieve the requisite coverage for these
regions, which increases cost and turnaround times.
Integrative Genomics Viewer (IGV) plots (Figure 6 on
p. 6) illustrate the impact of different library construction
methodologies on a more local level, for specic regions of
the genome. Coverage tracks for 7 8 kb portions of the
C. difcile toxin genes (tcdA and tcdB); and an 11 kb genomic
region of B. pertussis spanning genes encoding the pertussis
toxin, are given in these examples. With the Hyper Prep and
HyperPlus workows, similar and highly uniform coverage
was achieved across each region. The NEBNext and Nextera
methods yielded a much more uneven distribution of aligned
reads, especially for the AT-rich C. difcile toxin genes. The
NEBNext data has two gaps in the GC-rich B. pertussis toxin
encoding region, for which virtually no reads are available.
(%)
Posion
50
40
30
30
20
20
10
10-10 0
0
C
A
G
T
(%)
Posion
50
40
30
30
20
20
10
10-10 0
0
C
A
G
T
(%)
Posion
50
40
30
30
20
20
10
10-10 0
0
C
A
G
T
(%)
Posion
50
40
30
30
20
20
10
10-10 0
0
(%)
Posion
50
40
30
30
20
20
10
10-10 0
0
(%)
Posion
50
40
30
30
20
20
10
10-10 0
0
(%)
Posion
50
40
30
30
20
20
10
10-10 0
0
(%)
Posion
50
40
30
30
20
20
10
10-10 0
0
(%)
Posion
50
40
30
30
20
20
10
10-10 0
0
(%)
Posion
50
40
30
30
20
20
10
10-10 0
0
(%)
Posion
50
40
30
30
20
20
10
10-10 0
0
(%)
Posion
50
40
30
30
20
20
10
10-10 0
0
Figure 3. Start site complexity plots
Nucleotide content over a 40 bp window (-10 to +30 bp relative to read alignment start) for C. difcile (29% GC), E. coli (51% GC), and B. pertussis (68% GC), for
libraries prepared with the Hyper Prep, HyperPlus, NEBNext and Nextera workows.
If all three library construction processes (fragmentation, adapter addition, and library amplication) were completely sequence non-specic, each base (A, C,
G, and T) would be represented by a perfectly at, horizontal line with a y-axis value corresponding to the average prevalence of that nucleotide in the genome.
For example, the A and T plots for C. difcile (29% genomic GC content) would be superimposed, and have a value of ~35% for each position, whereas the C and
G plots would both have a value of ~15% at each position.
Hyper Prep HyperPlus NEBNext Nextera
B. pertussis E. coli C. difcile
6 | Microbial whole-genome sequencing
Normalized Coverage
GC% of 100 Window Bins
40 45 5010 3530252015
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Number of Windows (x 10
3
)
Normalized Coverage
Number of Windows (x 10
3
)
Normalized Coverage
Number of Windows (x 10
3
)
300
250
200
150
100
50
0
300
250
200
150
100
50
0
300
250
200
150
100
50
0
GC% of 100 Window Bins
25 50 55 60 65 7045403530
0.6
0.7
0.8
0.9
1.0
1.1
1.2
GC% of 100 Window Bins
45 70 75 80 8565605550
0.0
0.4
0.2
0.6
0.8
1.0
1.4
1.2
1.6
1.8
Hyper Prep
Nextera
HyperPlus
NEBNext
Figure 5. GC-bias plots
Plots were generated with Picard CollectGCBiasMetrics. Gray histograms represent the distribution of genomic GC content for each bacterium, calculated
for the reference sequence in 100 bp bins. GC bias was assessed by plotting normalized coverage for each bin—for the Hyper Prep, HyperPlus, NEBNext, and
Nextera workows. If all sample-to-data processes (fragmentation, adapter addition, library amplication, cluster amplication, sequencing, and data analysis)
were completely unbiased, all bins would be equally represented, i.e., the plot for each workow would be a horizontal distribution centered on a normalized
coverage of 1.
For C. difcile, near-perfect data was obtained for both the Hyper Prep and HyperPlus workows. In contrast, bins with a more balanced GC content (30 – 50% GC)
were over-represented in the NEBNext and Nextera C. difcile data, at the expense of bins with extremely low GC content (<30% GC). The NEBNext and Nextera
workows generally performed better in balanced and slightly GC-rich regions (40 – 65% GC), as compared to AT-rich regions. All workows performed poorly
with respect to the extremely GC-rich (>70% GC) bins of B. pertussis, where limitations inherent to the sequencing technology start to dominate.
Figure 6. Selected IGV plots
Coverage tracks generated with the Hyper Prep, HyperPlus, NEBNext, and
Nextera workows, for 7 – 8 kb portions of the two C. difcile toxin genes,
tcdA and tcdB, and an 11 kb region containing pertussis toxin-encoding
genes. Each magenta (read 1) and purple (read 2) line represents a trimmed,
aligned read. Gray plots represent coverage depth, whereas the colored
track at the bottom of each plot represents the DNA sequence (A = green,
C = blue, G = yellow and T = red). Areas of low or lumpy coverage for the
AT-rich C. difcile toxin genes with the NEBNext and Nextera workows are
highlighted, as are two regions of the GC-rich pertussis toxin genomic region
for which virtually no reads were obtained with NEBNext.
11 kb
3,990 kb 3,992 kb 3,994 kb 3,996 kb 3,998 kb
Chromosome
[0 - 109]
[0 - 110]
[0 - 123]
[0 - 118]
Sequence
Gene
CAE44038 CAE44039 CAE44041 CAE44042 CAE44044 CAE44045 CAE44046 CAE44047 CAE44049 CAE44050 CAE44051
Hyper Prep
HyperPlus
NEBNext
Nextera
B. pertussis
toxin genes
Chromosome
[0 - 123]
[0 - 44]
[0 - 88]
[0 - 90]
CAJ67494
0,140 bp
796,000 bp 797,000 bp 798,000 bp 799,000 bp 800,000 bp 801,000 bp 802,000 bp 803,000 bp
Sequence
Gene
Hyper Prep
HyperPlus
NEBNext
Nextera
C. difficile
tcdA
Chromosome
[0 - 101]
[0 - 84]
[0 - 60]
[0 - 106]
CAJ67492
Sequence
Gene
7,089 bp
788,000 bp 789,000 bp 790,000 bp 791,000 bp 792,000 bp 793,000 bp 794,000 bp
Hyper Prep
HyperPlus
NEBNext
Nextera
C. difficile
tcdB
C. difcile E. coli B. pertussis
Hyper Prep
Nextera
HyperPlus
NEBNext
0.000
0.005
0.010
0.015
0.020
0.025
Fracon of Genome
Depth of Coverage
80 1000 4020 60
0.000
0.010
0.005
0.020
0.015
0.030
0.025
0.040
0.035
0.045
Fracon of Genome
Depth of Coverage
80 1000 4020 60
0.000
0.01
0.02
0.03
0.04
0.05
Fracon of Genome
Depth of Coverage
80 1000 4020 60
Figure 4. Coverage uniformity plots
Data for all libraries were down-sampled to ~900,000 reads and coverage calculated using Bedtools. The Hyper Prep and HyperPlus workows yielded highly
similar coverage proles, with a sharp peak and negligible tails for all three bacteria, indicating uniform coverage. In contrast, the NEBNext and Nextera workows
yielded a broader distribution for the genomes with unbalanced GC content, and/or lower mode coverage depth.
C. difcile E. coli B. pertussis
Microbial whole-genome sequencing | 7
De novo Assembly
Microbial WGS often requires de novo sequence assembly,
e.g., when a reference genome is not available or novel genes
are being interrogated. Since longer inserts with a tight size
distribution facilitate de novo assembly, library construction
protocols that provide for tunable fragmentation and size
selection are essential. All three of the ligation-based library
construction methods used in this study (Hyper Prep,
HyperPlus, and NEBNext) met these criteria, whereas the
Nextera protocol offered signicantly less exibility.
The four library construction methods were compared with
respect to key de novo assembly metrics, namely number of
contigs, length of longest contig, and N50 length (Figure 7).
The Hyper Prep and HyperPlus workows outperformed
NEBNext and Nextera with respect to all metrics.
The GC-rich B. pertussis genome proved to be the most difcult
to assemble from data generated with all four methods, with
more and shorter contigs and a signicantly lower N50 length
as compared to the E. coli and C. difcile genomes.
Despite the fact that the Hyper Prep and HyperPlus methods
yielded the shortest mode fragment length (see Table 3),
higher and more uniform coverage translated to fewer and
longer contigs, and longer N50 lengths, particularly for the
AT-rich C. difcile genome. The HyperPlus method performed
similarly or better than the industry-leading Hyper Prep
method across all three assemblies, conrming that minor
start site bias associated with enzymatic fragmentation is
efciently offset by other benets of the integrated workow.
Meeting the Challenges of Large-scale Genome Projects
The 100K Pathogen Genome Project is an innovative
collaboration between the US Food and Drug Administration
(FDA); the University of California, Davis; Agilent Technologies,
and the Centers for Disease Control and Prevention (CDC).
The project aims to create the largest public database of
foodborne pathogen draft genomes—to support public health
and research activities related to pathogen surveillance and
outbreak management; the diagnosis and epidemiology
of emerging pathogens, microbial genome variation and
evolution, and new gene discovery.
The 100K Project originally selected the KAPA HTP Library
Preparation Kit for library construction, due to the high library
construction efciency and coverage uniformity achievable
with Kapa’s optimally formulated and evolved enzymes, and
highly-optimized “with bead” protocol. Approximately 10,000
draft genomes have been assembled to date from libraries
prepared with this chemistry. However, mechanical shearing
proved to be a major bottleneck in establishing a fast and
robust sample preparation pipeline, prompting the transition
to the streamlined, fully automatable KAPA HyperPlus
workow, with integrated enzymatic fragmentation.
Initial evaluation of the KAPA HyperPlus workow focused
on the quality and utility of data generated using Kapa’s novel
enzymatic fragmentation solution instead of Covaris shearing.
Routine quality control analysis (not shown) conrmed that the
standard workow with post-ligation size selection produced
high-quality sequence data. Slight bias in the nucleotide
distribution for the rst 10 positions of trimmed reads was
observed, but this did not appear to have any impact on the
uniqueness or GC content distribution of reads.
0
200
400
600
800
1000
1200
1400
Number of congs
Hyper Prep
Nextera
HyperPlus
NEBNext
B. pertussisE. coliC. difficile
0
100
200
300
400
500
600
800
700
900
Largest cong (kbp
)
B. pertussisE. coliC. difficile
0
50
100
150
200
250
300
350
400
N50 length (kbp
)
B. pertussisE. coliC. difficile
Figure 7. De novo assembly metrics
The Hyper Prep, HyperPlus, NEBNext, and Nextera workows were compared
with respect to three key de novo assembly metrics. De novo assembly is
achieved by the appropriate arrangement of overlapping contigs (collections
of overlapping reads without gaps). High coverage depth and uniformity, and
low bias results in longer and fewer contigs, and longer N50 lengths, which
facilitate assembly. The N50 length is a weighted median contig length
(50% of the entire assembly is contained in contigs equal to or larger than
this value).
8 | Microbial whole-genome sequencing
Headquarters, United States
Wilmington, Massachusetts
Tel: 781.497.2933
Fax: 781.497.2934
sales@kapabiosystems.com
International Office
Cape Town, South Africa
Tel: +27.21.448.8200
Fax: +27.21.448.6503
sales@kapabiosystems.com
United Kingdom Office
London, England
Tel: +44.845.512.0641
Fax: +44.203.745.5862
uksales@kapabiosystems.com
De novo assembly data for fourteen representative genomes,
spanning a genome GC content range from 30 – 70%, are
given in Table 4. Average coverage was high (~170X) and very
consistent across all the genomes. Calculated GC content
correlated extremely well with predicted GC content, where
available. The number of assembled contigs ranged between
10 and 80, with Vibrio (43% GC) proving the most difcult to
assemble. The average N50 length for the set of assemblies
(426,813) was signicantly (3X) longer than the N50 length
achieved in the comparative library construction experiment.
With the KAPA HyperPlus workow, automated liquid
handling, and high-throughput QC assays, the turnaround time
for 96 samples—from isolated DNA to sequencing-ready pool—
has been reduced by approximately 50%, with a concomitant
improvement in success rates. A higher degree of multiplexing
(384 libraries/lane) is currently being implemented to further
optimize coverage and overall sequencing cost.
CONCLUSIONS
The KAPA HyperPlus Library Preparation Kit is the ideal
solution for high-throughput microbial whole genome
sequencing. The rapid, one-tube protocol is fully automatable;
robust across a wide range of genome GC contents; and
offers exibility with respect to the amount of input DNA,
library fragment size, adapter design, barcoding strategy, and
library amplication—also supporting PCR-free workows.
The combination of a novel, low-bias fragmentation reagent,
highly efcient library construction chemistry, and a low-
bias amplication enzyme yields high and uniform coverage,
thereby facilitating de novo assembly and maximizing
sequencing cost.
Table 4. De novo assembly metrics for representative genomes produced by the 100K Pathogen Genome Project
Genus and species
Read
count
(PF reads)
Predicted
coverage
Average
coverage
(calculated
from assembly)
Predicted
genome
%GC
Calculated
%GC
Estimated
genome size
(bp)
Assembly
size (bp)
Assembled
contigs
N50
length
Number
of
annotated
genes
Staphylococcus areus 3,658,490 252 168 32 32 2,900,000 3,098,172 44 195,208 3,055
Staphylococcus areus 3,263,540 225 169 32 32 2,900,000 2,936,030 45 303,273 2,798
Micrococcus sp. 917,292 73 172 N/A 34 2,500,000 2,686,627 41 236,126 2,737
Micrococcus sp. 1,168,170 93 176 N/A 34 2,500,000 2,675,965 44 477,672 2,709
Listeria monocytogenes 2,404,730 161 164 38 36 2,990,000 2,926,186 16 477,671 2,999
Listeria monocytogenes 2,433,480 163 165 38 36 2,990,000 2,915,543 21 475,683 2,983
Vibrio parahaemolytica 6,058,830 233 166 43 44 5,200,000 5,398,332 77 225,233 4,996
Vibrio parahaemolytica 3,200,250 123 167 43 44 5,200,000 5,370,088 72 251,504 4,977
Salmonella bnamdala 4,360,100 194 173 51 50 4,500,000 4,879,757 55 266,584 4,605
Pseudomonas tremae 1,421,780 46 170 66 61 6,200,000 6,846,272 25 832,752 6,293
Pseudomonas tremae 5,258,040 170 174 66 62 6,200,000 6,846,700 26 768,971 6,288
Microbacterium sp. 1,676,190 91 175 65-75 69 3,700,000 3,504,750 13 786,597 3,396
Microbacterium sp. 1,583,760 86 171 65 70 3,700,000 3,502,614 26 251,299 3,396
© 2015 Kapa Biosystems. All trademarks are the property of their respective owners. APP109001 - v1.15 05/15
... The total RNA was extracted and sequenced as described above. The total DNA was extracted and sequenced as described elsewhere 10,48,[58][59][60][61][62] . The Illumina HiSeq 2000 with 100 paired-end chemistry was used for MFMB-03 and MFMB-08. ...
Article
Full-text available
In this work, we hypothesized that shifts in the food microbiome can be used as an indicator of unexpected contaminants or environmental changes. To test this hypothesis, we sequenced the total RNA of 31 high protein powder (HPP) samples of poultry meal pet food ingredients. We developed a microbiome analysis pipeline employing a key eukaryotic matrix filtering step that improved microbe detection specificity to >99.96% during in silico validation. The pipeline identified 119 microbial genera per HPP sample on average with 65 genera present in all samples. The most abundant of these were Bacteroides, Clostridium, Lactococcus, Aeromonas , and Citrobacter . We also observed shifts in the microbial community corresponding to ingredient composition differences. When comparing culture-based results for Salmonella with total RNA sequencing, we found that Salmonella growth did not correlate with multiple sequence analyses. We conclude that microbiome sequencing is useful to characterize complex food microbial communities, while additional work is required for predicting specific species’ viability from total RNA sequencing.
... After complete mixing, the samples were used to extract total RNA as described by Chen et al. 23 and total DNA as described elsewhere. [24][25][26][27][28][29] Total RNA purity (A 260/230 and A 260/280 ratios ≥ 1.8) and integrity were confirmed with Nanodrop (Nanodrop Technologies, Wilmington, DE, USA) and BioAnalyzer RNA Kit (Agilent Technologies Inc., Santa Clara, CA, USA). 23 Subsequently, cDNA was constructed using RNA (4-15 µg total input) and SuperScript Double Stranded cDNA Synthesis kit (Invitrogen, Catalog no. ...
Article
Full-text available
Here we propose that using shotgun sequencing to examine food leads to accurate authentication of ingredients and detection of contaminants. To demonstrate this, we developed a bioinformatic pipeline, FASER (Food Authentication from SEquencing Reads), designed to resolve the relative composition of mixtures of eukaryotic species using RNA or DNA sequencing. Our comprehensive database includes >6000 plants and animals that may be present in food. FASER accurately identified eukaryotic species with 0.4% median absolute difference between observed and expected proportions on sequence data from various sources including sausage meat, plants, and fish. FASER was applied to 31 high protein powder raw factory ingredient total RNA samples. The samples mostly contained the expected source ingredient, chicken, while three samples unexpectedly contained pork and beef. Our results demonstrate that DNA/RNA sequencing of food ingredients, combined with a robust analysis, can be used to find contaminants and authenticate food ingredients in a single assay.
... Libraries were indexed using Integrated DNA Technologies Weimer 384 TS-LT DNA barcodes, which allowed multiplexing up to 384 isolates. Sequencing was done at the UC Davis Genome Center (Davis, CA, USA) on the HiSeq 3000 instrument using a paired-end 150 protocol (Illumina, Inc., San Diego, CA, USA) (21,22). Paired-end reads were assembled using ABySS 1.5.2 using k ϭ 64 (23). ...
Article
Full-text available
Campylobacter jejuni is an intestinal bacterium that can cause abortion in livestock. This publication announces the public release of 15 Campylobacter jejuni genome sequences from isolates linked to abortion in livestock. These isolates are part of the 100K Pathogen Genome Project and are from clinical cases at the University of California (UC) Davis.
... Genomic sequencing and comparison. Each isolate was sequenced as described by Ludeke et al. 38 as defined by the methods of the 100 K Pathogen Genome Sequencing Project [39][40][41][42][43][44][45][46][47] . Abyss 1.5.2 was used to assemble the paired end reads using k = 64 48 . ...
Article
Full-text available
Complex glycans cover the gut epithelial surface to protect the cell from the environment. Invasive pathogens must breach the glycan layer before initiating infection. While glycan degradation is crucial for infection, this process is inadequately understood. Salmonella contains 47 glycosyl hydrolases (GHs) that may degrade the glycan. We hypothesized that keystone genes from the entire GH complement of Salmonella are required to degrade glycans to change infection. This study determined that GHs recognize the terminal monosaccharides (N-acetylneuraminic acid (Neu5Ac), galactose, mannose, and fucose) and significantly (p < 0.05) alter infection. During infection, Salmonella used its two GHs sialidase nanH and amylase malS for internalization by targeting different glycan structures. The host glycans were altered during Salmonella association via the induction of N-glycan biosynthesis pathways leading to modification of host glycans by increasing fucosylation and mannose content, while decreasing sialylation. Gene expression analysis indicated that the host cell responded by regulating more than 50 genes resulting in remodeled glycans in response to Salmonella treatment. This study established the glycan structures on colonic epithelial cells, determined that Salmonella required two keystone GHs for internalization, and left remodeled host glycans as a result of infection. These data indicate that microbial GHs are undiscovered virulence factors.
... Deletions were not detected while insertions (red bars) were extremely low for all kits. The most common source of error is mismatches (blue bars) which vary between 0.08 % and 0.2 % depending on the kit used provided with the KAPA HyperPlus kit appears much more reliable, controllable and less susceptible to bias (Additional file 1: Figure S5, [26]). KAPA HyperPlus isn't the only kit using such a streamlined protocol and subsequently we have tested other kits such as the NEB UltraII that also exhibits very high ligation yields in early testing (>85 %, data not shown). ...
Article
Full-text available
Background The emergence of next-generation sequencing (NGS) technologies in the past decade has allowed the democratization of DNA sequencing both in terms of price per sequenced bases and ease to produce DNA libraries. When it comes to preparing DNA sequencing libraries for Illumina, the current market leader, a plethora of kits are available and it can be difficult for the users to determine which kit is the most appropriate and efficient for their applications; the main concerns being not only cost but also minimal bias, yield and time efficiency. ResultsWe compared 9 commercially available library preparation kits in a systematic manner using the same DNA sample by probing the amount of DNA remaining after each protocol steps using a new droplet digital PCR (ddPCR) assay. This method allows the precise quantification of fragments bearing either adaptors or P5/P7 sequences on both ends just after ligation or PCR enrichment. We also investigated the potential influence of DNA input and DNA fragment size on the final library preparation efficiency. The overall library preparations efficiencies of the libraries show important variations between the different kits with the ones combining several steps into a single one exhibiting some final yields 4 to 7 times higher than the other kits. Detailed ddPCR data also reveal that the adaptor ligation yield itself varies by more than a factor of 10 between kits, certain ligation efficiencies being so low that it could impair the original library complexity and impoverish the sequencing results. When a PCR enrichment step is necessary, lower adaptor-ligated DNA inputs leads to greater amplification yields, hiding the latent disparity between kits. Conclusion We describe a ddPCR assay that allows us to probe the efficiency of the most critical step in the library preparation, ligation, and to draw conclusion on which kits is more likely to preserve the sample heterogeneity and reduce the need of amplification.
... Libraries were indexed using Bioo Scientific NEXTflex-96 DNA barcodes version 13.05 (Bioo Scientific Corp., Austin, TX) and Integrated DNA Technologies Weimer 384 TS-LT DNA barcodes. Sequencing was done by BGI@UCDavis (Sacramento, CA, USA) on an Illumina HiSeq 2000 platform using paired-end 100 bp (PE100) reads (Illumina, Inc., San Diego, CA, USA) (44,45). ...
Article
Full-text available
Importance: Campylobacter jejuni is the most common cause of gastroenteritis in industrialized countries. Despite efforts to reduce colonization of poultry flocks and eventual infection of humans, the incidence of human C. jejuni infection has remained high. Because wild birds can harbor strains of C. jejuni that eventually infect humans, there has long been speculation that wild birds might act as an important reservoir in the C. jejuni infection cycle. We simultaneously studied infection prevalence, social behavior, and movement ecology in wild American crows (Corvus brachyrhynchos). We found that social behavior contributed to patterns of infection and that movement behavior resulted in some areas having high risk of transmission while others had low risk. Incorporating ecological data into studies of C. jejuni in wild birds has the potential to resolve when and how wild birds contribute to domestic animal and human C. jejuni infection, leading to better control of initial poultry contamination.
Preprint
Full-text available
In this work, we hypothesized that shifts in the food microbiome can be used as an indicator of unexpected contaminants or environmental changes. To test this hypothesis, we sequenced total RNA of 31 high protein powder (HPP) samples of poultry meal pet food ingredients. We developed a microbiome analysis pipeline employing a key eukaryotic matrix filtering step that improved microbe detection specificity to >99.96% during in silico validation. The pipeline identified 119 microbial genera per HPP sample on average with 65 genera present in all samples. The most abundant of these were Bacteroides , Clostridium , Lactococcus , Aeromonas , and Citrobacter . We also observed shifts in the microbial community corresponding to ingredient composition differences. When comparing culture-based results for Salmonella with total RNA sequencing, we found that Salmonella growth did not correlate with multiple sequence analyses. We conclude that microbiome sequencing is useful to characterize complex food microbial communities, while additional work is required for predicting specific species' viability from total RNA sequencing.
Article
Full-text available
The 100K Pathogen Genome Project is producing draft and closed genome sequences from diverse pathogens. This project expanded globally to include a snapshot of global bacterial genome diversity. The genomes form a sequence database that has a variety of uses from systematics to public health.
Article
Full-text available
Importance: This study examined the link between public health and genomic variation of Campylobacter in relation to disease in humans, primates, and livestock. Use of large-scale whole genome sequencing enabled population level assessment to find new genes that are linked to livestock disease. With 184 Campylobacter genomes we assessed virulence traits, antibiotic resistance susceptibility, and potential for zoonotic transfer to observe there is a 'generalist' genotype that may move between host species.
ResearchGate has not been able to resolve any references for this publication.