ArticlePDF Available

Long-read, whole-genome shotgun sequence data for five model organisms

Authors:

Abstract and Figures

Single molecule, real-time (SMRT) sequencing from Pacific Biosciences is increasingly used in many areas of biological research including de novo genome assembly, structural-variant identification, haplotype phasing, mRNA isoform discovery, and base-modification analyses. High-quality, public datasets of SMRT sequences can spur development of analytic tools that can accommodate unique characteristics of SMRT data (long read lengths, lack of GC or amplification bias, and a random error profile leading to high consensus accuracy). In this paper, we describe eight high-coverage SMRT sequence datasets from five organisms (Escherichia coli, Saccharomyces cerevisiae, Neurospora crassa, Arabidopsis thaliana, and Drosophila melanogaster) that have been publicly released to the general scientific community (NCBI Sequence Read Archive ID SRP040522). Data were generated using two sequencing chemistries (P4C2 and P5C3) on the PacBio RS II instrument. The datasets reported here can be used without restriction by the research community to generate whole-genome assemblies, test new algorithms, investigate genome structure and evolution, and identify base modifications in some of the most widely-studied model systems in biological research.
Content may be subject to copyright.
Long-read, whole-genome shotgun
sequence data for five model
organisms
Kristi E. Kim
1
, Paul Peluso
1
, Primo Babayan
1
, P. Jane Yeadon
2
, Charles Yu
3
,
William W. Fisher
3
, Chen-Shan Chin
1
, Nicole A. Rapicavoli
1
, David R. Rank
1
, Joachim Li
4
,
David E.A. Catcheside
2
, Susan E. Celniker
3
, Adam M. Phillippy
5
, Casey M. Bergman
6
& Jane M. Landolin
1
Single molecule, real-time (SMRT) sequencing from Pacic Biosciences is increasingly used in many areas of
biological research including de novo genome assembly, structural-variant identication, haplotype phasing,
mRNA isoform discovery, and base-modication analyses. High-quality, public datasets of SMRT sequences
can spur development of analytic tools that can accommodate unique characteristics of SMRT data (long
read lengths, lack of GC or amplication bias, and a random error prole leading to high consensus
accuracy). In this paper, we describe eight high-coverage SMRT sequence datasets from ve organisms
(Escherichia coli,Saccharomyces cerevisiae,Neurospora crassa,Arabidopsis thaliana, and Drosophila
melanogaster) that have been publicly released to the general scientic community (NCBI Sequence Read
Archive ID SRP040522). Data were generated using two sequencing chemistries (P4C2 and P5C3) on the
PacBio RS II instrument. The datasets reported here can be used without restriction by the research
community to generate whole-genome assemblies, test new algorithms, investigate genome structure and
evolution, and identify base modications in some of the most widely-studied model systems in biological
research.
Design Type(s) observation design genome sequencing Shotgun Sequencing
Measurement Type(s) DNA sequencing
Technology Type(s) PacBio RS II
Factor Type(s)
Sample Characteristic(s)
Escherichia coli str. K-12 substr. MG1655 Saccharomyces cerevisiae W303
Neurospora crassa OR74A Neurospora crassa Arabidopsis thaliana
Drosophila melanogaster
1
Pacic Biosciences of California Inc., 1380 Willow Road, Menlo Park, California 94025, USA.
2
Flinders University,
School of Biological Sciences, PO Box 2100, Adelaide, South Australia 5001, Australia.
3
Department of Genome
Dynamics, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, California 94720, USA.
4
Department of Microbiology and Immunology, UCSF, San Francisco, California 94158, USA.
5
National
Biodefense Analysis and Countermeasures Center, 110 Thomas Johnson Drive, Frederick, Maryland 21702,
USA.
6
Faculty of Life Sciences, University of Manchester, Oxford Road, Manchester M13 9PT, UK.
Correspondence and requests for materials should be addressed to J.M.L. (email: jlandolin@pacicbiosciences.com)
OPEN
SUBJECT CATEGORIES
» Genomics
» Genome assembly
algorithms
» DNA sequencing
» Experimental organisms
Received: 08 August 2014
Accepted: 03 October 2014
Published: 25 November 2014
www.nature.com/scientificdata
SCIENTIFIC DATA |1:140045 |DOI: 10.1038/sdata.2014.45 1
Background & Summary
Single-molecule, real-time (SMRT®) DNA sequencing occurs by optically detecting a uorescent signal
when a nucleotide is being incorporated by a DNA polymerase
1
. This relatively new technology enables
detection of DNA sequences that have unique characteristics, such as long read lengths, lack of CG bias,
random error proles, and can yield highly accurate consensus sequences. Kinetic information such as
pulse width and interpulse duration are also recorded and can be used to detect base modications
2,3
.
Since its introduction, investigators have published on a range of applications using SMRT sequencing.
For example, the developers of GATK (Genome Analysis Toolkit) demonstrated that single nucleotide
polymorphisms (SNPs) could be detected using SMRT sequences
4,5
due to their lack of context-specic
bias and systematic error
5,6
. Likewise, the developers of PBcR (PacBio error correction)
7,8
showed that
complete bacterial genome assemblies using SMRT sequence data were achievable and had greater than
Q60 consensus base quality
8
. PBcR was later incorporated as the pre-assemblystep in the HGAP
(hierarchical genome assembly process) system
9
, followed by consensus polishing using the Quiver
algorithm
9
to produce a complete assembly pipeline in SMRT Analysis, a free and open-source software
suite released by Pacic Biosciences. In addition, other third-party tools now support long reads for
various applications such as mapping
10,11
, scaffolding
12
, structural-variation discovery
13
, and genome
assembly
7,14
. Other applications such as 16S rRNA sequencing
15
, characterization of entire
transcriptomes in chickens
16
and humans
17
, genome-editing studies
18
, base-modication studies
1922
,
and validation of CRISPR targets
23
have also been published. Several datasets from this publication have
already been used to develop the MinHash Alignment Process (MHAP), a new method for fast and
efcient overlap of long reads for assembling large genomes
24
.
To encourage interest in further applications and tool development for SMRT sequence data, we
report here the release of eight whole-genome shotgun-sequence datasets from ve model organisms
(E. coli,S. cerevisiae,N. crassa,A. thaliana, and D. melanogaster). These organisms have among the most
complete and well-annotated reference genome sequences, due to continual renement by dedicated
teams of scientists. Despite continued improvement of these genome sequences with new technologies,
few are completely nished with fully contiguous assemblies of all chromosomes. The gaps remaining
arise from complex structures such as transposable elements, repeats, segmental duplications, or other
dynamic regions of the genome that cannot be easily assembled. Structural differences in these regions
can account for variability in millions of nucleotides within every genome, and mounting evidence
suggest that such mutations are important for human diversity and disease susceptibility in many
complex traits
25
including autism and schizophrenia
26
. SMRT sequencing data can therefore play an
important role in the completion of these and other reference genomes, providing a platform for new
insights into genome biology.
Methods
We generated eight whole-genome shotgun-sequence datasets from ve model organisms using the P4C2
or P5C3 polymerase and chemistry combinations, totaling nearly 1000 gigabytes (GB) of raw data (See
Data Records section). Genomic DNA was either purchased from commercial sources or generously
provided by collaborators.
Genomic DNA sample summaries are provided in Table 1. DNA from the reference K12 strain of
E. coli was purchased from Lofstrand Labs Limited (K12 MG1655 E. coli, cat# L3-4001SP2). DNA from
the reference OR74A strain of N. crassa was purchased from the Fungal Genetics Stock Center (FGSC).
A standard Ler-0 strain of A. thaliana plants was grown from seeds purchased from Lehle seeds (WT-04-
19-02) and DNA was extracted at Pacic Biosciences. Theprotocol is available on Sample Net
27
and
Dataset Name Sample ID DNA extraction gDNA size
(kb)
Shearing Size selection
E. coli MG1655 P4C2 SAMN02951645 ammonium acetate or SDS, proteinase K, phenol-chloroform 17 none Blue Pippin (7 kb)
E. coli MG1655 P5C3 SAMN02743420 ammonium acetate or SDS, proteinase K, phenol-chloroform 17 none Blue Pippin (7 kb)
S. cerevisiae 9464 P4C2 SAMN02731377 Qiagen genomic DNA buffer set 40 g-TUBE Blue Pippin (17 kb)
N. crassa OR74A P4C2 SAMN02724975 BashingBeads, Zymo Research kit 6 none Blue Pippin (4 kb)
N. crassa T1 P4C2 SAMN02724976 SDS, proteinase K, phenol-chloroform, RNAase, isopropanol 15 none Blue Pippin (7 kb)
A. thaliana Ler-0 P4C2 SAMN02731378 CTAB, chloroform:isoamyl, isopropanol precip. 40 g-TUBE Blue Pippin (7 kb)
A. thaliana Ler-0 P5C3 SAMN02724977 CTAB, chloroform:isoamyl, isopropanol precip. 40 g-TUBE Blue Pippin (15 kb)
D. melanogaster ISO1
P5C3
SAMN02614627 SDS, phenol-chloroform, CsCl banding, ethanol precip. 40 g-TUBE Blue Pippin (17 kb)
Table 1. Summary of DNA samples. The NCBI sample ID associated with each dataset is provided. DNA
was extracted in a species-specic manner, yielding genomic DNA of various sizes. All DNA was size
selected using the Blue Pippin system (Sage Sciences), and select samples were sheared with g-TUBEs
(Covaris).
www.nature.com/sdata/
SCIENTIFIC DATA |1:140045 |DOI: 10.1038/sdata.2014.45 2
summarized in the organism-specic methods section of this paper. DNA from the 9464 strain of
S. cerevisiae was provided by J. Li at University of California San Francisco. The 9464 strain is a daughter
of the reference WG303 strain. DNA from the T1 strain of N. crassa was obtained from D. Catcheside at
Flinders University who has an interest in polymorphic genes regulating recombination. The T1 strain is
an A mating type strain which, like OR74A, was derived from a cross between the Em a 5297 and Em A
5256 strains. DNA from the ISO1 strain
28
of D. melanogaster was obtained from S. Celniker at Lawrence
Berkeley National Laboratory. This is the reference strain of D. melanogaster that was originally chosen to
be the rst large genome to be sequenced and assembled using a whole-genome shotgun approach
29
.It
continues to serve as the reference strain in subsequent releases and numerous annotations of the
D. melanogaster genome.
DNA extraction methods were species-specic and optimized for each organism (see organism-
specic methods below). In general, the steps are: (1) remove debris and particulate material, (2) lyse
cells, (3) remove membrane lipids, proteins and RNA, (4) DNA purication.
SMRTbelllibraries for sequencing
4
were prepared using either 10 kb
30,31
or 20 kb
32
preparation
protocols to optimize for the most high-quality and longest reads. The main steps for library preparation
are: (1) shearing, (2) DNA damage repair, (3) blunt end-ligation with hairpin adapters supplied in the
DNA Template Prep Kit 2.0 (Pacic Biosciences), (4) size selection, and (5) binding to polymerase using
the DNA Sequencing Kit 3.0 (Pacic Biosciences).
E. coli collection, DNA extraction, and SMRTbell library preparation
Both P4C2 and P5C3 samples were prepared in the same way. E. coli K12 genomic DNA was ordered and
puried by Lofstrand Labs Limited (K12MG1655 E. coli, cat# L3-4001SP2). Field Inversion Gel
Electrophoresis (FIGE) was run to ensure presence of high-molecular-weight gDNA. Ten micrograms of
gDNA was sheared using g-TUBE devices (Covaris, Inc) spun at 5,500 r.p.m. or 2029 g on the MiniSpin
Plus (Eppendorf) for 1 min. Three microliters of elution buffer (EB) was added to rinse the upper
chamber, spun at 6,000 r.p.m., or 2415 g and spun again at 5,500 r.p.m. or 2029 g on the MiniSpin Plus
(Eppendorf) after inverting the g-TUBE device. SMRTbell libraries were created using the Procedure &
Checklist20 kb Template Preparation using BluePippinSize Selectionprotocol
32
. Briey, the library
was run on a BluePippin system (Sage Science, Inc., Beverly, MA, USA) to select for SMRTbell templates
greater than 10 kb. The resulting average insert size was 17 kb based on 2100 Bioanalyzer instrument
(Agilent Technologies Genomics, Santa Clara, CA, USA). Sequencing primers were annealed to the
hairpins of the SMRTbell templates followed by binding with the P5 sequencing polymerase and
MagBeads (Pacic Biosciences, Menlo Park, CA, USA). One SMRT Cell was run on the PacBio®
RS II system with an on-plate concentration of 150 pM using P5C3 chemistry and a 180-minute
data-collection mode.
S. cerevisiae collection, DNA extraction, and SMRTbell library preparation
The 9464 strain is a MAT a haploid strain derived from a w303 reference strain following three
integration events: 1) a construct (pTef2-dTomato-kanCMX6) was inserted near the CEN4 gene; 2) a
construct (pTEF2-eGFP, natMX) was inserted near the ESP gene; and 3) the pds1 gene was deleted and
replaced with the URA3 gene from Kluyveromyces lactis. Cells were grown to an OD600 of ~2 and 350
OD units of cells corresponding to roughly 7 × 10
9
cells were harvested by centrifugation. Cells were
washed in 4 ml of TE then resuspended in 4 ml of Buffer Y1 (Qiagen genomic DNA prep) and
spheroplasted by addition of 250 units of Zymolyase 100T (Seikagaku 120493) for 40 min at 30 °C.
Spheroplasts were pelleted and re-suspended in 5 ml of Qiagen Buffer G2 containing 300 micrograms of
Qiagen RNAse to lyse the cells. 2 mg of proteinase K was then added to the lysate, which was incubated at
50 °C for 30 min. The lysate was then centrifuged at 5000 G for 10 min at 4 °C, and the supernatant was
puried on a Qiagen 100/G genomic prep tip as per Qiagen instructions. The eluted DNA was spooled by
addition of 3.5 ml of isopropanol to the 5 ml of eluated. The DNA was washed in 70% EtOH, air dried,
and re-suspended in 200 microliters of TE by slowly dissolving overnight at room temperature. SMRTbell
libraries were created using the Procedure and Checklist10 kb Template Preparation and Sequencing
(with Low-Input DNA)protocol
30
. Twelve SMRT Cells were run on the PacBio RS II system using P4C2
chemistry and a 180-minute data collection mode.
N. crassa OR74A, collection, DNA extraction, and SMRTbell library preparation
N. crassa OR74A was purchased from FGSC# 2489. Cells were inoculated to a density of 4× 10
6
conidia
in 75 ml Vogels minimal medium (Medium N)
33
and incubated at room temperature with gentle shaking
for approximately 48 h. Visual inspection shows the culture prior to harvest and demonstrates that there
was only vegetative tissue and no asexual sporulation or induction of aerial hyphae. Mycelia was blotted
dried on sterile paper toweling and pulverized for approximately 30 s at half of the maximum setting in a
Biospec Products Mini BeadBeater tissue disruptor using the disrupting beads provided with the Zymo
Research ZR fungal/bacterial DNA midi prep kit. Tissue was removed with a sterile inoculating stick and
100 mg of dried mycelia per sample was processed according to the manufactures instructions. DNA was
eluted into 100 ul sterile water and DNA from two samples was pooled. Yield was quantied using a nano
drop system and also validated by agarose gel electrophoresis. The concentration was 32.57 ng/ul, A260
was 0.651, A280 was 0.373, 260/280 was 1.75 and 260/230 was 0.91. The genomic DNA was
www.nature.com/sdata/
SCIENTIFIC DATA |1:140045 |DOI: 10.1038/sdata.2014.45 3
approximately 6 kb and was not sheared. SMRTbell libraries were created using the Procedure and
Checklist10 kb Template Preparation and Sequencing (with Low-Input DNA)protocol
30
. Two SMRT
Cells were run on the PacBio RS II system using P4C2 chemistry and a 180-minute data collection mode.
N. crassa T1 collection, DNA extraction, and SMRTbell library preparation
The T1 strain of N. crassa, is an A mating type strain derived by DG Catcheside from a cross between the
Em a 5297 and Em A 5256 strains he obtained from Stirling Emerson in 1955. The fungus was grown in
shake culture for 72 h at 25 °C in 500 ml VogelsN
34
minimal medium containing 2% sucrose. Mycelium
was harvested by ltration, ground in liquid nitrogen, resuspended in 10 ml of a buffer containing 0.15 M
NaCl, 0.1 M EDTA, 2% SDS at pH 9.5, and incubated overnight at 37 °C with 1 mg protease K. Debris
was precipitated by centrifugation and 10 ml distilled water was added to the supernatant, which was
extracted once with an equal volume of water-saturated phenol and once with chloroform. Nucleic acids
were precipitated from the aqueous phase with 0.6 volumes of isopropanol. Following centrifugation, the
pellet was dried and dissolved in 1 ml TE buffer (TRIS 10 mM, 1 mM EDTA pH 8.0). RNA and protein
were digested by overnight incubation at 37 °C with RNAase (50 μg) followed by addition of protease K
(50 μg) and further incubation for 2 h. The digest was extracted once with water-saturated phenol and
once with chloroform. DNA was collected by precipitation with 0.6 volumes of isopropanol and,
following centrifugation, the pellet was dried, dissolved in 500 μl TE buffer and stored at 4 °C. Field
Inversion Gel Electrophoresis (FIGE) was run to ensure presence of high-molecular-weight gDNA. The
genomic DNA was approximately 15 kb and was not sheared. SMRTbell libraries were created using the
Procedure and Checklist10 kb Template Preparation and Sequencing (with Low-Input DNA)
protocol
30
. Eighteen SMRT Cells were run on the PacBio RS II system using P4C2 chemistry and a
180-minute data-collection mode.
A. thaliana collection, DNA extraction, and SMRTbell library preparation
Plants were grown from seeds provided by Lehle seeds (WT-04-19-02). Shoots and leaves were harvested
at three weeks and ground in liquid nitrogen using a mortar and pestle. The complete protocol is
described in the Preparing Arabidopsis Genomic DNA for Size-Selected ~20 kb SMRTbellLibraries
protocol
35
. This protocol can be used to prepare puried Arabidopsis genomic DNA for size-selected
SMRTbell templates with average insert sizes of 1020 kb. We recommend starting with 2040 grams of
three-week-old Arabidopsis whole plants, which can generate >100 μg of puried genomic DNA.
SMRTbell libraries were created using the Procedure & Checklist20 kb Template Preparation using
BluePippinSize Selectionprotocol
32
. Eighty-ve SMRT Cells were run on the PacBio RS II system using
P4C2 chemistry and a 180-minute data-collection mode. Forty-six SMRT Cells were run on the PacBio
RS II system using P5C3 chemistry and a 180-minute data-collection mode.
D. melanogaster collection, DNA extraction, and SMRTbell library preparation
A total of 1.2 g of adult male ISO1 ies corresponding to 1950 animals were collected, starved for 90120
min and frozen. The ies ranged in age from 07 days based on four collections (1) 02 days old, 500
males, 0.33 g; (2) 04 days old, 500 males, 0.29 g; (3) 07 days old, 500 males, 0.29 g; (4) 02 days old, 450
males, 0.29 g. Flies were ground in liquid nitrogen to a ne powder and genomic DNA was puried by
phenol-chloroform extraction and CsCl banding in the ultracentrifuge. Briey, the pulverized y extract
was gently re-suspended in 15 ml of HB buffer (7 M Urea, 2% SDS, 50 mM Tris pH7.5, 10 mM EDTA
and 0.35 M NaCl) and 15 ml of 1:1 phenol/chloroform. The mixture was shaken slowly for 30 min and
then centrifuged at 23,600 g for 10 min at 20 °C in a Sorvall HB-4 rotor. The aqueous phase was
re-extracted twice as above and then precipitated by adding two volumes of ethanol and centrifuging at
23,600 g for 10 min at 20 °C in a Sorvall HB-4 rotor. The pellet was re-suspended in 3 ml of TE (10 mM
Tris 1 mM EDTA pH 8.0) by gentle inversion for 3 h. To the re-suspended DNA, 3 g CsCl and 0.3 ml of
10 mg/ml ethidium bromide (EtBr) were added and the mixture centrifuged at 199,000 g for 16 h at 15 °C
in a Beckman VTi 65.2 rotor. The EtBr was removed by extraction with water-saturated butanol,
performed 3 times in a Beckman JA-12 rotor at 13,000 g for 5 min. at 4 degree C each time. The DNA was
diluted three-fold with TE, 1/10 vol, 4 M NaCl was added and the DNA precipitated with two volumes of
ethanol. Centrifugation was done in a JA-12 rotor at 16,000 g for 30 min at 4 degree C for the
precipitation step. After centrifugation, the pellet was washed in 70% ethanol. The DNA was resuspended
in 100 μl TE at a concentration of 1.4 μg/μl and quantied using a Nanodrop instrument. This protocol
routinely yields at least 10 ng DNA per mg of ies with an estimated DNA size >100 kb.Genomic DNA
was sheared using a g-TUBE device (Covaris) at 4800 r.p.m. or 1545 g on the MiniSpin Plus (Eppendorf),
150 ng/μl and puried using 0.45 × volume ratio of AMPure PB beads. SMRTbell libraries were created
using the Procedure & Checklist20 kb Template Preparation using BluePippinSize Selection
32
.
Libraries were ligated with excess adapters and an overnight incubation was performed to increase the
yield of ligated fragments larger than 20 kb. Smaller fragments and adapter dimers were then removed by
>15 kb size selection using the BluePippin DNA size selection system by Sage Science. Forty-two SMRT
Cells were run on the PacBio RS II system. The rst run was composed of four SMRT Cells, loaded at 75
pM, 150 pM, 300 pM, and 400 pM in order to determine the optimal loading concentration of the
sample. The remaining 38 SMRT Cells were loaded at 400 pM.
www.nature.com/sdata/
SCIENTIFIC DATA |1:140045 |DOI: 10.1038/sdata.2014.45 4
Data Records
After DNA extraction, libraries were generated and sequenced at Pacic Biosciences of California,
uploaded to Amazon Web Services' Simple Storage Service (S3), and then submitted to the Sequence Read
Archive (SRA) at NCBI under Project ID SRP040522 (Data Citation 1). The corresponding accession
numbers and le sizes are listed in Table 2. More detailed information including md5 checksums and
links to download the original data from AWS S3 are provided in Supplementary Table 1.
Raw data was transferred from the instrument to a storage location and organized rst by the run
name, and then by the SMRT Cell directory. Each run contained one or more SMRT Cells. Each SMRT
Cell produced a metadata.xml le that recorded the run conditions and barcodes of sequencing kits, three
bax.h5 les that contained base call and quality information of actual sequenced data, and one bas.h5 le
that acted as a pointer to consolidate the three bax.h5 les. The h5sufx denotes that these are
Hierarchical Data format 5 (HDF5) les. The specic contents and structure of a PacBio bax.h5 le is
described in more detail in online documentation
36
.
Recall the SMRT bellstructure that underwent sequencing was created by the library preparation
process
4
. Sequenced SMRT Bells corresponded to raw reads that may pass around the same base multiple
times. A raw read could therefore have a structure that is composed of left adapter DNA insert
right adapter reverse complement of DNA insert left adapter DNA insert and so on. This
raw read is typically processed downstream to remove adapters and create subreads composed of the
DNA sequence of interest to the investigator. See the Technical Validation section for details on ltering
parameters and free software used to analyze and quality check the data.
All datasets were ltered and mapped using SMRT Analysis v2.0.1, v2.1.0, or v2.2. There are no
changes to the ltering or mapping parameters in these versions, and detailed parameters used are
discussed in the Technical Validation section. SMRT Analysis is a software suite that is free and can be
downloaded from the Pacic Biosciences Developers Community Network Website (DevNet)
37
. SMRT
Analysis includes the SMRT Portal graphical user interface, as well as SMRT View genome browser.
Extensive documentation and technical support can also be accessed via the DevNet website.
The post-lter statistics of each dataset are listed in Table 3. While read lengths reect the true
sequencing capacity of the instrument, only subreads are summarized in Table 3 because it is relevant and
used for downstream analysis algorithms such as de novo assemblers. Multiple subreads can be contained
within one raw read, and subreads exclude adapters and low quality sequence. N50 is a statistic used to
describe the length distribution of a collection of reads, contigs, or scaffolds, and is dened as the length
where 50% of all bases are contained in sequences longer than that length. The N50 ltered subread
lengths ranged from 7.610.5 kb for datasets generated with P4C2 chemistry and ranged from 12.214.2
kb for datasets generated with P5C3 chemistry. With the exception of N. crassa OR74A, all datasets were
sequenced to high-coverage (>68X) and sufcient for de novo genome assembly applications. The
N. crassa OR74A dataset was sequenced to 25X coverage and should be sufcient for mapping, consensus
SNP calling, and testing other applications.
Technical Validation
DNA and sample preparation
To assess the quality of genomic DNA received, we used Qbit (Life Technologies) and Nanodrop
(Thermo Scientic) to measure the concentration of genomic DNA. Ideal samples had similar
concentration estimates on both platforms, with A
230/260/230
ratios close to 1:1.8:1, corresponding to what
is expected of pure DNA. All samples presented here passed this screening criterion.
Next we assessed the size of the genomic DNA received. For genomic DNA where the size range was less
than 17 kb, we used the Bioanalyzer 21000 (Agilent) to determine the actual size distribution. For genomic
DNA where the size range was greater than 17 kb, we opted for pulse eld gel electrophoresis to better
estimate the larger size distributions. The sizes of the genomic DNA for each sample are listed in Table 1.
Organism Strain Origin Polymerase & Chemistry
Library kits
SRA Accession Size (GB)
E. coli MG1655 Lofstrand Labs P4C2 SRX669475 6.0
E. coli MG1655 Lofstrand Labs P5C3 SRX533603 3.8
S. cerevisiae 9464 J. Li P4C2 SRX533604 38
N. crassa OR74A FGSC P4C2 SRX533605 29
N. crassa T1 D. Catcheside P4C2 SRX533606 143
A thaliana Ler-0 Lehle Seeds P4C2 SRX533608 263
A. thaliana Ler-0 Lehle Seeds P5C3 SRX533607 252
D. melanogaster ISO1 S. Celniker P5C3 SRX499318 187
Table 2. Summary of datasets. Eight datasets from ve organisms are described in this paper. Data can
be accessed from SRA using the accession numbers provided.
www.nature.com/sdata/
SCIENTIFIC DATA |1:140045 |DOI: 10.1038/sdata.2014.45 5
To ensure that the library insert sizes were in the optimal size range, we sheared genomic DNA using
gTubes if the apparent size was greater than 40 kb. Alternatively, if the size was less than 40 kb, then the
DNA was not sheared and carried straight through to library preparation. Extremely small fragments
(o100 bp) and adapter dimers are eliminated by Ampure Beads. Adapter dimers (010 bp) and small
inserts (11100 bp) represented less than 0.01% of all the reads sequenced in all datasets. We additionally
use the Blue Pippin (Sage Science) to ensure that the libraries had a physical size of 10 kb or greater. The
size cutoffs used for each sample are listed in Table 1.
Analysis and quality ltering
To assess the quality of the libraries sequenced, we examined the percent of bases ltered by a standard
QC procedure. Filtering conditions for high-quality SMRT sequence data are read score >0.8, read length
>500 nt, subread length >500 nt. In addition, the ends of reads are trimmed if they are outside of high-
quality (HQ) regions, and adapter sequences between subreads are removed. All samples retained
7197% of the bases after ltering. High-quality regions are dened by the base caller in primary analysis
(on the PacBio RS II instrument) and indicate the contiguous region of the trace that contains high
quality sequence data
38
. All datasets were ltered with the parameters above using SMRT Analysis v2.0.1,
v2.1.0, or v2.2.0. There are no changes to the ltering protocol in these versions.
To ensure that the sequences matched the model organism of interest, we examined the percent
of post-lter bases that were mapped to the closest reference genome available. All datasets were
mapped using blasr
11
from SMRT Analysis v2.2.0. Plots showing the distributions of mapped
subread concordances are provided in Figure 1, and are a rough measure of how well the reads agree
with the reference genome. Note that these numbers are an underestimate of the true accuracies
of the reads because the DNA was not always from the same strain as the reference genome, and
new stock may have evolved such that certain bases or structural repeats are different from the
reference strain that was rst sequenced decades ago. The depth of coverage for each chromosome is also
plotted in Figure 1.
We used reference genomes that were from the closest strain that was available in the public domain.
For E. coli, we used the sequence from NC_000913 at NCBI GenBank, which was rst sequenced by
Blattner et al.
39
(Data Citation 2). For S. cerevisae, we downloaded release R6411 of the S288c reference
from the Saccharomyces Genome Database Project (http://downloads.yeastgenome.org/sequence/
S288C_reference/genome_releases/) (Data Citation 3). The changes and updates to the reference genome
are reviewed by Engel et al.
40
For N. crassa, we downloaded the genome sequence from the N. crassa
database hosted by the Broad institute, which uses the same sequences contained in the AABX00000000.3
project at NCBI GenBank (Data Citation 4). This data was from the OR74A strain and rst sequenced by
Galagan et al.
41
For A. thaliana, we used the reference sequences from The Arabidopsis Information
Resource
42
(Data Citation 5) (TAIR) version 10 (ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_-
chromosomes/), which was originally sequenced and analyzed by The Arabidopsis Genome Initiative
43
.
For D. melanogaster, we used sequences from RELEASE 5 of the Drosophila reference genome
downloaded from the Berkeley Drosophila Genome Project website (http://www.fruity.org/sequence/
release5genomic.shtml). This data was from the ISO1 strain and has been updated since the referenced
RELEASE 3 version by Celniker et al.
29
(Data Citation 6).
All samples had a mapping rate of 8195%, with the exception of the Neurospora T1 sample that had a
mapping rate of 62%. This sample may have some damaged DNA as it had been stored in a freezer for
over 20 years. Nonetheless, preliminary unpublished results show that the sequence from the Neurospora
T1 sample can be successfully assembled into a genome that is more contiguous than the existing
Dataset Name Number of ltered
subreads
N50 ltered subread
length (nt)
Maximum ltered subread
length (nt)
Total ltered
subread (nt)
Estimated genome
size (Mb)
Fold coverage
E. coli MG1655 P4C2 61,019 7,586 22,609 331,516,965 5 66X
E. coli MG1655 P5C3 43,063 12,041 28,647 373,874,428 5 75X
S. cerevisiae 9464 P4C2 269,145 8,821 30,164 1,597,871,118 12 133X
N. crassa OR74A P4C2 175,926 7,617 30,845 981,884,113 40 25X
N. crassa T1 P4C2 210,480 10,462 36,227 11,497,185,440 40 287X
A. thaliana Ler-0 P4C2 1,338,320 8,769 41,753 8,129,670,483 120 68X
A. thaliana Ler-0 P5C3 2,067,212 12,188 47,445 17,714,447,516 120 148X
D. melanogaster ISO1 P5C3 1,561,929 14,214 44,766 15,194,174,294 160 95X
Table 3. Summary statistics of ltered data. Results shown for each dataset are based on output of
SMRT Portal analysis using the default ltering parameters (see text for details). Fold coverage is calculated
relative to the estimated genome size.
www.nature.com/sdata/
SCIENTIFIC DATA |1:140045 |DOI: 10.1038/sdata.2014.45 6
reference genome for Neurospora
44
(http://gshare.com/articles/ENCODE_like_study_using_PacBio_
sequencing/928630).
Usage Notes
The eight datasets are available for download in two locations: (1) Amazon S3 repositories, which contain
primary analysis data in the original formats provided by the PacBio RS II instrument (*.metadata.xml, *.
bas.h5, & *.bax.h5 les), and (2) Sequence Read archive entries, which contain unltered subread base
calls in sra format and can be converted to unltered subread fasta and fastq les using the SRA toolkit.
Links to Amazon S3 repositories as well as SRA accession numbers are provided in Supplementary
Table 1. While fastq les can be used to assess basic read characteristics and analyzed by most third-party
tools, original formats are still needed for more sophisticated analyses using SMRT Portal or PacBio-
specic algorithms. Download the primary data from Amazon S3 if applications such as Quiver
consensus base calling or base modication analyses are desired. These analyses require additional
information encoded within.bax.h5 les such as quality values, pulse width, and inter-pulse duration.
While bax.h5 les can also be converted to fasta or fastq formats, download data from the SRA if sra,
fasta, or fastq formatted les are desired.
The sequence IDs provided in the original formats (bax.h5 les) are different from those provided
by the SRA in sra or fastq formats. The sequence IDs in the original formats contain information
E. coli MG1655 P4C2
coverage
count (1000)
0
1
2
50
concordance
0.70 0.80 0.90 1.00
0
1
2
3
4
# subreads (1000)
S. cerevisiae 9464 P4C2
coverage
0 100 200
chrI
chrII
chrIII
chrIV
chrIX
chrV
chrVI
chrVII
chrVIII
chrX
chrXI
chrXII
chrXIII
chrXIV
chrXV
chrXVI
mitochondrion
D. melanogaster ISO1 P5C3 chromosome
chr2L
chr2LHet
chr2R
chr2RHet
chr3L
chr3LHet
chr3R
chr3RHet
chr4
chrM
chrU
chrUextra
chrX
chrXHet
chrYHet
0 100 200
coverage
count (1000)
0
1
2
count (1000)
0
25
50
75
100
chromosome
concordance
0.70 0.80 0.90 1.00
0
4
8
12
# subreads (1000)
concordance
0.70 0.80 0.90 1.00
0
40
80
120
# subreads (1000)
100
Figure 1. Mapped Subread Concordance and Coverage. The distribution of mapped subread concordances
and mapped subread coverages are plotted for E. coli MG1655 P4C2 (a), S. cerevisiae 9464 P4C2 (b), and
D. melanogaster ISO1 P5C3 (c). The coverage distribution is similar among all chromosomes in S. cerevisiae,
whereas the coverage distribution is half in chrX (50X) compared to the autosomes (100X) in
D. melanogaster. ChrU and chrUextra are assembled contigs that could not be placed to physical
chromosomes, and have very low coverages in general.
www.nature.com/sdata/
SCIENTIFIC DATA |1:140045 |DOI: 10.1038/sdata.2014.45 7
about the sequencing run itself. The date, time and instrument id is tracked by a mprex; the
SMRT Cell barcode, 8 pack number, and other information is tracked by a cprex; and the sand p
prexes are now deprecated. For example, a subread with the ID m130227_130322_
42141_c100505662540000001823074808081362_s1_p0/234/3_13049indicates
that the sequencing run was started in Feburary 27th 2013 (m130227) at 1:03:22PM (130322) on
instrument number 42141, using SMRT Cell ID c100505662540000001823074808081362. This subread
also originates from zero mode wave guide (ZMW) number 234 on the SMRT cell, and corresponds to
bases 313049 of the raw read in the ZMW. The same read will be re-named by SRA to an arbitrary index
prexed by the SRR accession number, and the header line in the fasta le will, for example, appear as
>SRR1284620.6 length =13046in the les downloaded from the SRA.
The datasets described in this paper were rst released on DevNet
37
, the PacBio Software Developer
Community Network website, with brief descriptions on the PacBio blog. DevNet typically hosts open-
source software, while SampleNet
27
, the PacBio Sample Preparation Community Network website,
typically hosts protocols for DNA extraction and library preparation. These websites provide valuable
data and documentation about the technology, but are not considered a part of the traditional academic
record. This Data Descriptor in Scientic Data provides an opportunity to describe the methodology and
characteristics of the eight datasets in more detail and creates a citable entity for the scientic community.
DNA sequencing instruments and chemistries change rapidly, and PacBio SMRT sequencing is no
exception. The datasets presented here are from P4C2 and P5C3 polymerase-chemistry combinations,
spanning release dates from late-2013 to early-2014. These datasets represent some of the longest read
lengths to date for these chemistries, and can be used to benchmark and develop new algorithms and the
state of the art as the technology evolves.
References
1. Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133138 (2009).
2. Clark, T. A. et al. Characterization of DNA methyltransferase specicities using single-molecule, real-time DNA sequencing.
Nucleic Acids Res. 40, e29 (2011).
3. Flusberg, B. A. et al. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods 7,
461465 (2010).
4. Travers, K. J. et al. Aexible and efcient template format for circular consensus sequencing and SNP detection. Nucleic Acids
Res. 38, e159 (2010).
5. Carneiro, M. O. et al. Pacic biosciences sequencing technology for genotyping and variation discovery in human data. BMC
Genomics 13, 375 (2012).
6. Roberts, R. J., Carneiro, M. O. & Schatz, M. C. The advantages of SMRT sequencing. Genome Biol. 14, 405 (2013).
7. Koren, S. et al. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 30,
693700 (2012).
8. Koren, S. et al. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biol. 14,
R101 (2013).
9. Chin, C. S. et al. Nonhybrid, nished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10,
563569 (2013).
10. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589595 (2010).
11. Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive renement
(BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).
12. English, A. C. et al. Mind the gap: upgrading genomes with Pacic Biosciences RS long-read sequencing technology. PLoS ONE 7,
e47768 (2012).
13. English, A. C., Salerno, W. J. & Reid, J. G. PBHoney: Identifying genomic variants via long-read discordance and interrupted
mapping. BMC Bioinformatics 15, 180 (2014).
14. Bankevich, A. et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J. Comput. Biol. 19,
455477 (2012).
15. Mosher, J. J. et al. Improved performance of the PacBio SMRT technology for 16S rDNA sequencing. J. Microbiol. Methods 104C,
5960 (2014).
16. Thomas, S., Underwood, J. G., Tseng, E. & Holloway, A. K. Long-read sequencing of chicken transcripts and identication of new
transcript isoforms. PLoS ONE 9, e94650 (2014).
17. Tilgner, H., Grubert, F., Sharon, D. & Snyder, M. P. Dening a personal, allele-specic, and single-molecule long-read tran-
scriptome. Proc. Natl Acad. Sci. USA 111, 98699874 (2014).
18. Voit, R. A., Hendel, A., Pruett-Miller, S. M. & Porteus, M. H. Nuclease-mediated gene editing by homologous recombination of
the human globin locus. Nucleic Acids Res. 42, 1365 (2013).
19. Bendall, M. L. et al. Exploring the roles of DNA methylation in the metal-reducing bacterium Shewanella oneidensis MR-1.
J. Bacteriol. 195, 49664974 (2013).
20. Fang, G. et al. Genome-wide mapping of methylated adenine residues in pathogenic Escherichia coli using single-molecule real-
time sequencing. Nat. Biotechnol. 30, 12321239 (2012).
21. Kozdon, J. B. et al. Global methylation state at base-pair resolution of the Caulobacter genome throughout the cell cycle. Proc.
Natl Acad. Sci. USA 110, E4658 (2013).
22. Song, C. X. et al. Sensitive and specic single-molecule sequencing of 5-hydroxymethylcytosine. Nat. Methods 9, 7577 (2012).
23. Brown, S. D. et al. Comparison of single-molecule sequencing and hybrid approaches for nishing the genome of Clostridium
autoethanogenum and analysis of CRISPR systems in industrial relevant Clostridia. Biotechnol. Biofuels 7, 40 (2014).
24. Berlin, K. et al. Assembling large genomes with single molecule sequencing and locality sensitive hashing. Preprint at bioRXiv
http://dx.doi.org/10.1101/008003 (2014).
25. Itsara, A. et al. Population analysis of large copy number variants and hotspots of human genetic disease. Am. J. Hum. Genet. 84,
148161 (2009).
26. Stankiewicz, P. & Lupski, J. R. Structural variation in the human genome and its role in disease. Annu. Rev. Med. 61,
437455 (2010).
27. Pacic Biosciences, Sample Preparation Community Network, http://www.smrtcommunity.com/SampleNet (2014).
www.nature.com/sdata/
SCIENTIFIC DATA |1:140045 |DOI: 10.1038/sdata.2014.45 8
28. Brizuela, B. J. et al. Genetic analysis of the brahma gene of Drosophila melanogaster and polytene chromosome
subdivisions 72AB. Genetics 137, 803813 (1994).
29. Celniker, S. E. et al. Finishing a whole-genome shotgun: release 3 of the Drosophila melanogaster euchromatic genome sequence.
Genome Biol. 3, Research0079.10079.14 (2002).
30. Pacic Biosciences, Procedure & Checklist10 kb Template Preparation and Sequencing (with Low-Input DNA) https://na5.
salesforce.com/sfc/p/#70000000IVif/a/70000000PVYH/qX1CL1upbnO0rvoeVbk6ZtPPmY4018nY1JzHJKaMYe0 =(2014).
31. Pacic Biosciences, Procedure & ChecklistGreater Than 10 kb Template Preparation Using AMPure PB Beads, https://na5.
salesforce.com/sfc/p/#70000000IVif/a/70000000PYNC/heYx8OfGiFWX1PwhotTAfUjROSOwZaRMP4FJUXJD6tc =(2014).
32. Pacic Biosciences, Procedure & Checklist20 kb Template Preparation Using BluePippinTM Size Selection System, https://na5.
salesforce.com/sfc/p/70000000IVif/a/70000000PYNR/UM0ZNjFScqg8WtjFaR2f4YsQTbBVyXIRCjCu9kxLpLM =(2014).
33. Vogel, H. J. A convenient growth medium for Neurospora (Medium N). Microbial Genetics Bulletin 13, 42 (1956).
34. Vogel, H. J. Distrbution of lysine pathways among fungi: Evolutionary implications. Am. Naturalist 98, 435446 (1964).
35. Pacic Biosciences, Preparing Arabidopsis Genomic DNA for Size-Selected ~20 kb SMRTbellLibraries, http://www.
smrtcommunity.com/servlet/servlet.FileDownload?le =00P7000000KMpFEEA1 (2014).
36. Pacic Biosciences, .bas.h5 File Reference Guide, http://les.pacb.com/software/instrument/2.0.0/bas.h5%20Reference%20Guide.
pdf.
37. Pacic Biosciences, Software Developer's Community Network, http://www.smrtcommunity.com/DevNet (2014).
38. Pacic Biosciences, Statistics Output Guide, http://les.pacb.com/software/instrument/1.3.1/Statistics%20Output%20Guide.pdf
(2014).
39. Blattner, F. R. et al. The complete genome sequence of Escherichia coli K-12. Science 277, 14531462 (1997).
40. Engel, S. R. et al. The reference genome sequence of Saccharomyces cerevisiae: then and now. G3 (Bethesda) 4, 389398 (2013).
41. Galagan, J. E. et al. The genome sequence of the lamentous fungus Neurospora crassa. Nature 422, 859868 (2003).
42. Lamesch, P. et al. The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nucleic Acids Res. 40,
D1202 (2011).
43. The Arabidopsis Genome Initiative. Analysis of the genome sequence of the owering plant Arabidopsis thaliana. Nature 408,
796815 (2000).
44. Yeadon, P. J. et al. Integrative Biology of a Fungus: User PacBio SMRT Sequencing to Interrogate the Genome, Epigenome, and
Transcriptome of Neurospora Crassa. FigShare http://gshare.com/articles/ENCODE_like_study_using_PacBio_sequencing/
928630 (2013).
Data Citations
1. NCBI Sequence Read Archive SRP040522 (2014).
2. GenBank NC_000913 (2006).
3. NCBI Assembly GCF_000146045.2 (2011).
4. GenBank AABX00000000.3 (2013).
5. NCBI Assembly GCF_000001735.3 (2011).
6. NCBI Assembly GCF_000001215.2 (2007).
Acknowledgements
The contributions of AMP were funded under Agreement No. HSHQDC-07-C-00020 awarded by
the Department of Homeland Security Science and Technology Directorate (DHS/S&T) for the
management and operation of the National Biodefense Analysis and Countermeasures Center (NBACC),
a Federally Funded Research and Development Center. The views and conclusions contained in this
document are those of the authors and should not be interpreted as necessarily representing the ofcial
policies, either expressed or implied, of the U.S. Department of Homeland Security. In no event shall the
DHS, NBACC, or Battelle National Biodefense Institute (BNBI) have any responsibility or liability for any
use, misuse, inability to use, or reliance upon the information contained herein. The Department of
Homeland Security does not endorse any products or commercial services mentioned in this publication.
CMB was supported by Human Frontier Science Program Young Investigator grant RGY0093/2012.
We thank J. Korlach and E. Hauw for assistance in manuscript preparation, R. Stainer for Neurospora
T1 sample preparation, and J. Trow at NCBI for assistance with data submission.
Author Contributions
K.E.K. prepared libraries, sequenced, and analyzed data for the N. crassa OR74A, N. crassa T1, and
D. melanogaster datasets. P.P. grew plants from seed, prepared libraries, and sequenced DNA for the
A. thaliana P4C2 and A. thaliana P5C3 datasets. He also sequenced DNA for the S. cerevisae 9464
dataset. P.B. prepared libraries and sequenced the E. coli datasets. P.J.Y. provided DNA for the N. crassa
T1 dataset. C.Y. extracted DNA for the D. melanogaster dataset. W.F. collected male ies for the
D. melanogaster dataset. C.-S.C. analyzed data. N.A.R. extracted DNA, prepared libraries, and
coordinated the S. cerevisae 9464 dataset. D.R.R. grew plants from seed, extracted DNA and coordinated
the A. thaliana P4C2 and P5C3 datas000ets. J.L. extracted DNA and prepared libraries for the S. cerevisae
9464 dataset. D.C. provided DNA for the N. crassa T1 dataset. S.E.C. extracted DNA and coordinated the
D. melanogaster dataset. A.M.P. analyzed data, coordinated the project, and prepared the manuscript.
C.M.B. analyzed data, coordinated the project, and prepared the manuscript. J.M.L. deposited data to the
SRA, analyzed data, coordinated the project, and prepared the manuscript.
Additional information
Supplementary information accompanies this paper at http://www.nature.com/sdata
Competing nancial interests: The authors declare competing nancial interests. K.E.K., P.P., P.B.,
C.-S.C., N.A.R., D.R.R., and J.M.L. are employees of Pacic Biosciences of California, Inc., a company
commercializing DNA sequencing technologies.
www.nature.com/sdata/
SCIENTIFIC DATA |1:140045 |DOI: 10.1038/sdata.2014.45 9
How to cite this article: Kim, K. E. et al. Long-read, whole-genome shotgun sequence data for ve model
organisms. Sci. Data 1:140045 doi: 10.1038/sdata.2014.45 (2014).
This work is licensed under a Creative Commons Attribution 4.0 International License. The
images or other third party material in this article are included in the articles Creative
Commons license, unless indicated otherwise in the credit line; if the material is not included under the
Creative Commons license, users will need to obtain permission from the license holder to reproduce the
material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0
Metadata associated with this Data Descriptor is available at http://www.nature.com/sdata/ and is released
under the CC0 waiver to maximize reuse.
www.nature.com/sdata/
SCIENTIFIC DATA |1:140045 |DOI: 10.1038/sdata.2014.45 10
... Recently, the contiguity of de novo assemblies has been greatly improved with long-read sequencing, such as Oxford Nanopore Technologies (ONT) (reviewed in (6,7)) and PacBio single-molecule real-time (SMRT) in the original continuous long read (CLR) sequencing mode (8), despite both of them having relatively high per-base error rates of individual sequencing reads. To date, the genomes of 16 A. thaliana accessions sequenced with PacBio CLR technology have been published (9)(10)(11)(12)(13)(14)(15)(16)(17)(18). These assemblies commonly achieved several chromosome arm-level contigs, but they invariably stopped short of assembling through centromeric and pericentromeric regions as well as rDNA clusters. ...
Article
Full-text available
Although long-read sequencing can often enable chromosome-level reconstruction of genomes, it is still unclear how one can routinely obtain gapless assemblies. In the model plant Arabidopsis thaliana, other than the reference accession Col-0, all other accessions de novo assembled with long-reads until now have used PacBio continuous long reads (CLR). Although these assemblies sometimes achieved chromosome-arm level contigs, they inevitably broke near the centromeres, excluding megabases of DNA from analysis in pan-genome projects. Since PacBio high-fidelity (HiFi) reads circumvent the high error rate of CLR technologies, albeit at the expense of read length, we compared a CLR assembly of accession Eyach15-2 to HiFi assemblies of the same sample. The use of five different assemblers starting from subsampled data allowed us to evaluate the impact of coverage and read length. We found that centromeres and rDNA clusters are responsible for 71% of contig breaks in the CLR scaffolds, while relatively short stretches of GA/TC repeats are at the core of >85% of the unfilled gaps in our best HiFi assemblies. Since the HiFi technology consistently enabled us to reconstruct gapless centromeres and 5S rDNA clusters, we demonstrate the value of the approach by comparing these previously inaccessible regions of the genome between the Eyach15-2 accession and the reference accession Col-0.
... Next generation sequencing technologies are used on a large scale in whole-genome sequencing (WGS) projects, but short reads fail to cover the entire genome, often leaving gaps or producing assembly errors in repetitive regions [27,28]. Instead, long-read sequencing technologies have been tested for sequencing of large genomes, mainly those of model organisms, in order to simplify genome assembly and to resolve low-complexity regions [29,30]. For example, using ONT for genome sequencing of the experimental model Arabidopsis thaliana, Debladis et al. [31] generated a number of 118,554 reads with a minimum length of 6 nt, a maximum of 691,915 nt and a median of 4.6 kb. ...
Article
Full-text available
To date, different strategies of whole-genome sequencing (WGS) have been developed in order to understand the genome structure and functions. However, the analysis of genomic sequences obtained from natural populations is challenging and the biological interpretation of sequencing data remains the main issue. The MinION device developed by Oxford Nanopore Technologies (ONT) is able to generate long reads with minimal costs and time requirements. These valuable assets qualify it as a suitable method for performing WGS, especially in small laboratories. The long reads resulted using this sequencing approach can cover large structural variants and repetitive sequences commonly present in the genomes of eukaryotes. Using MinION, we performed two WGS assessments of a Romanian local strain of Drosophila melanogaster, referred to as Horezu_LaPeri (Horezu). In total, 1,317,857 reads with a size of 8.9 gigabytes (Gb) were generated. Canu and Flye de novo assembly tools were employed to obtain four distinct assemblies with both unfiltered and filtered reads, achieving maximum reference genome coverages of 94.8% (Canu) and 91.4% (Flye). In order to test the quality of these assemblies, we performed a two-step evaluation. Firstly, we considered the BUSCO scores and inquired for a supplemental set of genes using BLAST. Subsequently, we appraised the total content of natural transposons (NTs) relative to the reference genome (ISO1 strain) and mapped the mdg1 retroelement as a resolution assayer. Our results reveal that filtered data provide only slightly enhanced results when considering genes identification, but the use of unfiltered data had a consistent positive impact on the global evaluation of the NTs content. Our comparative studies also revealed differences between Flye and Canu assemblies regarding the annotation of unique versus repetitive genomic features. In our hands, Flye proved to be moderately better for gene identification, while Canu clearly outperformed Flye for NTs analysis. Data concerning the NTs content were compared to those obtained with ONT for the D. melanogaster ISO1 strain, revealing that our strategy conducted better results. Additionally, the parameters of our ONT reads and assemblies are similar to those reported for ONT experiments performed on various model organisms, revealing that our assembly data are appropriate for a proficient annotation of the Horezu genome.
... Sometimes it is still useful to fragment HMW to smaller fragments of~20 kb to improve yields of sequencing, or to allow for multi-pass HiFi reads. In such cases, specific mechanical shearing devices such as the Megaruptor system are used as they improve consistency and reproducibility of the fragment lengths [79,80]. Post fragmentation, size selection for the desired fragment lengths is often performed. ...
Article
Full-text available
The microbial communities present within fermented foods are diverse and dynamic, producing a variety of metabolites responsible for the fermentation processes, imparting characteristic organoleptic qualities and health-promoting traits, and maintaining microbiological safety of fermented foods. In this context, it is crucial to study these microbial communities to characterise fermented foods and the production processes involved. High Throughput Sequencing (HTS)-based methods such as metagenomics enable microbial community studies through amplicon and shotgun sequencing approaches. As the field constantly develops, sequencing technologies are becoming more accessible, affordable and accurate with a further shift from short read to long read sequencing being observed. Metagenomics is enjoying wide-spread application in fermented food studies and in recent years is also being employed in concert with synthetic biology techniques to help tackle problems with the large amounts of waste generated in the food sector. This review presents an introduction to current sequencing technologies and the benefits of their application in fermented foods.
... For genome sequencing, libraries of .10 kb were prepared for Pacific Biosciences (PacBio) sequencing using 5 mg of genomic DNA as reported previously and according to the PacBio template preparation and sequencing guide (13,14). The sheared DNA was treated with exonuclease, followed by end repair and ligation of blunt adapters using the SMRTbell template preparation kit v1.0. ...
Article
Full-text available
The halotolerant and osmotolerant yeast Zygosaccharomyces rouxii can produce multiple volatile compounds and has the ability to grow on lignocellulosic hydrolysates. We report the annotated genome sequence of Z. rouxii NRRL Y-64007 to support its development as a platform organism for biofuel and bioproduct production.
Article
Bioremediation of phenol has been developed for years, however, its actual application is challenged by low efficiency and environmental intolerance. In addition, although multiple phenol degradation pathways have been reported to coexist in bacteria, the potential significance and value have not been clearly revealed. Here, highly efficient phenol catalytic patterns were identified, and multi-pathway catalytic mechanism was investigated. Parageobacillus thermoglucosidasius W-36 employing a dual-pathway pattern exhibited superior degradation performance than P. thermoglucosidasius W-2 that employed a single-pathway pattern. An exogenous copy of degradation pathway in dual-pathway pattern was obtained via horizontal gene transfer. Based on homology modeling and molecular docking, the three-dimensional structures and substrate-binding sites of exogenous key enzymes were distinct from those of isofunctional enzymes encoded by the chromosome. The dual-pathway pattern was expressed in response to substrate levels in a time-dependent manner, indicating that synergistic effect of degradation pathways conferred strain adaptability to high-concentration phenol. Evolutionary analysis showed that the artful single/dual-pathway patterns in this species may be a natural selection phenomenon caused by high-concentration phenol. Collectively, this naturally evolved pattern enlightens the new strategy and feasibility of introducing auxiliary metabolic pathways to improve degradation efficiency of pollutants.
Chapter
Trichoderma is a soil-borne fungal pathogen distributed in diverse climatic zones of India. They are economically most important fungi which are widely used in commercial and agricultural sector. They produce some enzymes, antibiotics and secondary metabolites and therefore with the help of biotechnological applications these are applied against several fungal pathogens. These fungi are used as a good alternative to chemically produced pesticides and insecticides and are useful for sustainable agriculture. Till date, 375 Trichoderma species have been identified and among them some species including T. harzianum,T. viride, T. atroviride, T. reesei, T. hamatum, T. Longibrachiatum and T. asperellum are widely used as biocontrol, biofertilizer and growth-promoting agents. Among these species, a few viz. T. reesei, T. atroviride, T. virens, T. harzianum and T. Asperellum are the most studied and their whole genome is sequenced. T. reesei with a genome size of 33 Mb is the first Trichoderma species whose genome is completely sequenced. The aim of the present chapter is to give current status of the diversity of Trichoderma species in India and their uses on different agricultural crops to protect from pathogens.
Article
Full-text available
Centromeres are essential chromosomal regions that mediate the accurate inheritance of genetic information during eukaryotic cell division. Despite their conserved function, centromeres do not contain conserved DNA sequences and are instead epigenetically marked by the presence of the centromere-specific histone H3 variant CENP-A (centromeric protein A). The functional contribution of centromeric DNA sequences to centromere identity remains elusive. Previous work found that dyad symmetries with a propensity to adopt non-canonical secondary DNA structures are enriched at the centromeres of several species. These findings lead to the proposal that such non-canonical DNA secondary structures may contribute to centromere specification. Here, we analyze the predicted secondary structures of the recently identified centromere DNA sequences from Drosophila melanogaster. Although dyad symmetries are only enriched on the Y centromere, we find that other types of non-canonical DNA structures, including melted DNA and G-quadruplexes, are common features of all D. melanogaster centromeres. Our work is consistent with previous models suggesting that non-canonical DNA secondary structures may be conserved features of centromeres with possible implications for centromere specification.
Article
Full-text available
Plant genomes are known to be mainly composed of repetitive DNA sequences. Regardless of the non-genic function of these sequences, they are important for chromosome structure and stability during cell-cycle. Based on the recent available whole-genome assembly of white lupin ( Lupinus albus L.; WL), we have in silico annotated and in situ mapped the main classes of DNA repeats identified with RepeatExplorer. A highly diverse and an abundance of satellite DNAs were found representing more than 10 families, where three of them were highly associated with CENH3-immunoprecipitated chromatin. Applying a strategy of several re-hybridization steps with different combinations of satDNA, rDNA, and LTR-RTs probes, we were able to construct a repeat-based chromosome map for the identification of most chromosome pairs. Two families of LTR retrotransposons, Ty1/copia SIRE and Ty3/gypsy Tekay, were highly abundant at pericentromeric regions, while the centromeric retrotransposon of WL (CRWL) from the CRM clade showed strong centromere-specific localization in most chromosomes and was also highly enriched with CENH3-immunoprecipitated chromatin. FISH mapping of repeat DNA showed some incongruences with the reference genome, which can be further used for improving the current version of the genome. Our results demonstrate that despite the relatively small genome of WL, a high diversity of pericentromeric repeats was found, emphasizing the rapid evolution of repeat sequences in plant genomes.
Article
Full-text available
The cost of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in today’s genomic research. In spite of the increasing popularity of third-generation sequencing, the existing algorithms for compressing long reads exhibit a minor advantage over the general-purpose gzip. We present CoLoRd, an algorithm able to reduce the size of third-generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyses. CoLoRd achieves high compression rates for long-read sequencing data without affecting downstream analyses.
Preprint
Full-text available
We report reference-grade de novo assemblies of four model organisms and the human genome from single-molecule, real-time (SMRT) sequencing. Long-read SMRT sequencing is routinely used to finish microbial genomes, but the available assembly methods have not scaled well to larger genomes. Here we introduce the MinHash Alignment Process (MHAP) for efficient overlapping of noisy, long reads using probabilistic, locality-sensitive hashing. Together with Celera Assembler, MHAP was used to reconstruct the genomes of Escherichia coli, Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster, and human from high-coverage SMRT sequencing. The resulting assemblies include fully resolved chromosome arms and close persistent gaps in these important reference genomes, including heterochromatic and telomeric transition sequences. For D. melanogaster, MHAP achieved a 600-fold speedup relative to prior methods and a cloud computing cost of a few hundred dollars. These results demonstrate that single-molecule sequencing alone can produce near-complete eukaryotic genomes at modest cost.
Conference Paper
Full-text available
PacBio SMRT® Sequencing has the unique ability to directly detect base modifications in addition to the nucleotide a sequence of DNA. Because eukaryotes use base modifications to regulate gene expression, the absence or presence of epigenetic events relative to the location of genes is critical to elucidate the function of the modification. Therefore an integrated approach that combines multiple omic-scale assays is necessary to study complex organisms. Here, we present an integrated analysis of three sequencing experiments: 1) DNA sequencing, �2) base-modification detection, and 3) Iso-seq™ analysis, in Neurospora crassa, a filamentous fungus that has been used to make many landmark discoveries in biochemistry and genetics. We show that de novo assembly of a new strain yields complete assemblies of entire chromosomes, and additionally contains entire centromeric sequences. Base-modification analyses reveal candidate sites of increased interpulse duration (IPD) ratio, that may signify regions of 5mC, 5hmC, or 6mA base modifications. The Iso-Seq method provides full-length transcript evidence for comprehensive gene annotation, as well as context to the base-modifications in the newly assembled genome. Projects that integrate multiple genome-wide assays could become common practice for identifying genomic elements and understanding their function in new strains and organisms.
Article
Full-text available
Personal transcriptomes in which all of an individual’s genetic variants (e.g., single nucleotide variants) and transcript isoforms (transcription start sites, splice sites, and polyA sites) are defined and quantified for full-length transcripts are expected to be important for understanding individual biology and disease, but have not been described previously. To obtain such transcriptomes, we sequenced the lymphoblastoid transcriptomes of three family members (GM12878 and the parents GM12891 and GM12892) by using a Pacific Biosciences long-read approach complemented with Illumina 101-bp sequencing and made the following observations. First, we found that reads representing all splice sites of a transcript are evident for most sufficiently expressed genes ≤3 kb and often for genes longer than that. Second, we added and quantified previously unidentified splicing isoforms to an existing annotation, thus creating the first personalized annotation to our knowledge. Third, we determined SNVs in a de novo manner and connected them to RNA haplotypes, including HLA haplotypes, thereby assigning single full-length RNA molecules to their transcribed allele, and demonstrated Mendelian inheritance of RNA molecules. Fourth, we show how RNA molecules can be linked to personal variants on a one-by-one basis, which allows us to assess differential allelic expression (DAE) and differential allelic isoforms (DAI) from the phased full-length isoform reads. The DAI method is largely independent of the distance between exon and SNV—in contrast to fragmentation- based methods. Overall, in addition to improving eukaryotic transcriptome annotation, these results describe, to our knowledge, the first large-scale and full-length personal transcriptome.
Article
Long-read, single-molecule real-time (SMRT) sequencing is routinely used to finish microbial genomes, but available assembly methods have not scaled well to larger genomes. We introduce the MinHash Alignment Process (MHAP) for overlapping noisy, long reads using probabilistic, locality-sensitive hashing. Integrating MHAP with the Celera Assembler enabled reference-grade de novo assemblies of Saccharomyces cerevisiae, Arabidopsis thaliana, Drosophila melanogaster and a human hydatidiform mole cell line (CHM1) from SMRT sequencing. The resulting assemblies are highly continuous, include fully resolved chromosome arms and close persistent gaps in these reference genomes. Our assembly of D. melanogaster revealed previously unknown heterochromatic and telomeric transition sequences, and we assembled low-complexity sequences from CHM1 that fill gaps in the human GRCh38 reference. Using MHAP and the Celera Assembler, single-molecule sequencing can produce de novo near-complete eukaryotic assemblies that are 99.99% accurate when compared with available reference genomes.