ArticlePDF Available

Targeted RNA sequencing reveals the deep complexity of the human transcriptome

Authors:

Abstract and Figures

Transcriptomic analyses have revealed an unexpected complexity to the human transcriptome, whose breadth and depth exceeds current RNA sequencing capability. Using tiling arrays to target and sequence select portions of the transcriptome, we identify and characterize unannotated transcripts whose rare or transient expression is below the detection limits of conventional sequencing approaches. We use the unprecedented depth of coverage afforded by this technique to reach the deepest limits of the human transcriptome, exposing widespread, regulated and remarkably complex noncoding transcription in intergenic regions, as well as unannotated exons and splicing patterns in even intensively studied protein-coding loci such as p53 and HOX. The data also show that intermittent sequenced reads observed in conventional RNA sequencing data sets, previously dismissed as noise, are in fact indicative of unassembled rare transcripts. Collectively, these results reveal the range, depth and complexity of a human transcriptome that is far from fully characterized.
Content may be subject to copyright.
nature biotechnologyVOLUME 30 NUMBER 1 JANUARY 2012 99
LETTERS
Transcriptomic analyses have revealed an unexpected
complexity to the human transcriptome, whose breadth and
depth exceeds current RNA sequencing capability1–4. Using
tiling arrays to target and sequence select portions of the
transcriptome, we identify and characterize unannotated
transcripts whose rare or transient expression is below the
detection limits of conventional sequencing approaches.
We use the unprecedented depth of coverage afforded by
this technique to reach the deepest limits of the human
transcriptome, exposing widespread, regulated and remarkably
complex noncoding transcription in intergenic regions, as well
as unannotated exons and splicing patterns in even intensively
studied protein-coding loci such as p53 and HOX. The data
also show that intermittent sequenced reads observed in
conventional RNA sequencing data sets, previously dismissed
as noise, are in fact indicative of unassembled rare transcripts.
Collectively, these results reveal the range, depth and
complexity of a human transcriptome that is far from
fully characterized.
RNA sequencing (RNA-Seq) technologies can provide an unbiased
profile of the human transcriptome, with techniques of ab initio
transcript assembly capable of identifying novel transcripts and
expanding our catalog of genes and their expressed isoforms. These
technologies provide an opportunity to assemble a complete annota-
tion of the human transcriptome1, thereby providing a full account
of the functional output of the genome and the identification of
the differences in gene expression that drive and specify variation
between cells. These include not only protein-coding transcripts
but also an expanding catalog of long noncoding RNAs (lncRNAs)
that are intergenic, overlapping or antisense to annotated genes2,3.
However, despite recent technological advances, we have still not
yet reached the limits of the transcriptome nor realized its full scale
and complexity, fueling ongoing debate as to the extent to which the
genome is transcribed and the biological relevance of transcripts that
are expressed at low levels4–6.
To profile such rare transcriptional events and thereby assess the
full depth of the transcriptome, we employed a targeted RNA capture
and sequencing strategy, which for brevity we term RNA CaptureSeq,
that is similar to previous in-solution capture methods7 and analogous
to exome sequencing approaches8. Briefly, RNA CaptureSeq involves
the construction of tiling arrays across genomic regions of interest,
against which cDNAs are hybridized, eluted and sequenced. Although
this ability to isolate and target RNA has been used in genetic analysis
for some time9,10, here we combine this ability with deep-sequencing
technology to provide saturating coverage and permit the robust
assembly of rare and unannotated transcripts.
To inform the design of arrays and as a comparative reference, con-
ventional RNA-Seq was initially performed on a primary human foot
fibroblast cell line11 using the Illumina GAII platform (Supplementary
Table 1). Ab initio transcript assembly12 of the resulting ~20.4 million
alignable paired-end reads yielded 48,091 multiexon transcripts, of
which 88.3% correspond to annotated gene models (Supplementary
Data 1). From these annotations, we selected ~50 loci that included
both annotated protein-coding genes and functionally characterized
lncRNAs (such as HOTAIR13, TUG1 and MEG3) for inclusion on the
array (Supplementary Fig. 1a and Supplementary Tables 2 and 3).
In addition, we also included intergenic regions that exhibited little
or no transcriptional activity. In total, 2,265 contiguous regions
that together comprise ~0.77 Mb were represented on the array. To
validate the array design we first conducted capture sequencing of
matched foot fibroblast genomic DNA (Supplementary Results and
Supplementary Fig. 1bd), confirming the specificity, sensitivity,
uniformity and reproducibility of the capture arrays, comparable to
previous DNA capture and sequencing studies14,15.
Targeted RNA capture and sequencing was then carried out on
matched foot fibroblast cDNA. To permit direct comparison, we
applied the same sequencing and alignment methods as for pre-
capture RNA-Seq libraries, yielding ~25.8 million alignable paired-
end reads generated on an Illumina GAII instrument. In total, 80.7%
of captured reads aligned within probed regions, resulting in a mean
~4,607-fold coverage. By comparison, only 0.21% of precapture reads
aligned to probed regions (Supplementary Fig. 2a). A comparison
between RNA-Seq- and CaptureSeq-sequenced libraries showed that
the capture protocol did not substantially diminish library diversity
or introduce PCR amplification bias (Supplementar y Results and
Supplementary Fig. 2b). Given that RNA CaptureSeq achieved
a ~380-fold enrichment for alignment coverage across targeted
Targeted RNA sequencing reveals the deep complexity
of the human transcriptome
Tim R Mercer1, Daniel J Gerhardt2, Marcel E Dinger1, Joanna Crawford1, Cole Trapnell3, Jeffrey A Jeddeloh2,
John S Mattick1 & John L Rinn3
1Institute for Molecular Bioscience, University of Queensland, Brisbane, Australia. 2Roche NimbleGen Inc., Research and Development, Madison, Wisconsin, USA.
3Department of Stem Cell & Regenerative Biology, Harvard University, Cambridge, Massachusetts, USA. Correspondence should be addressed to
J.A.J. (jeff rey.jeddeloh@roche.com), J.S.M. (j.mattick@uq.edu.au) or J.L.R. (john_rinn@harvard.edu).
Received 5 August; accepted 4 October; published online 13 November 2011; doi:10.1038/nbt.2024
npg © 2012 Nature America, Inc. All rights reserved.
10 0   VOLUME 30 NUMBER 1 JANUARY 2012 nature biotechnology
LETTERS
regions of the transcriptome, we extrapolate that ~10 billion aligned
sequenced reads from a single sample by conventional RNA-Seq
would be required to achieve an equivalent coverage depth across
this targeted transcriptional region (Supplementary Fig. 2c).
We next investigated the advantage conferred by the increased
sequencing depth of RNA CaptureSeq in ab initio transcript assem-
bly (Fig. 1), initially focusing on regions containing well-annotated
protein-coding genes. We reconstructed all genes assembled within
precapture RNA-Seq data with a similar uniformity of transcript cover-
age (100% of transcript chains reconstructed; Fig. 1a, Supplementary
Fig. 2d and Supplementary Data 2). We identified an additional 204
unannotated isoforms of 55 protein-coding loci, alone representing
a 2.8-fold increase over the current catalog of isoforms for these loci
and demonstrating that for even well-characterized loci, consider-
able complexity remains to be resolved16. Indeed, many of the newly
identified exons were entirely undetected within our initial RNA-Seq
libraries (24.7% undetected with a further 10.4% only detected by a
single read)17. For example, previously three splicing variations gen-
erating up to nine alternative isoforms, each with alternate functional
consequences, have been described for the p53 gene18 (Fig. 2a). By
RNA CaptureSeq we identified additional alternative isoforms of p53,
whose unannotated exon junctions were subsequently validated by
RT-PCR and sequencing (Supplementary Table 4 and Supplementary
Fig. 3a), three of which modified the domain structure of the protein,
such as the exclusion of the tetramerization domain required for intra-
p53 interactions or modification of the p53 transactivation domain19.
As a class, the newly identified isoforms exhibit weaker expression
(mean 2.4-fold decrease) (Fig. 2b,c) and conservation (mean 1.8-fold
decrease) relative to previously annotated isoforms, but a similarly
stringent enrichment for canonical splice junctions (Supplementary
Fig. 4ag). A subset of these rare isoforms also has limited coding
potential, representing noncoding variants of the dominant mRNA
transcript20. Lastly, we also resolved an additional 163 neighboring
and antisense lncRNAs around protein-coding genes21.
The sequencing depth of RNA CaptureSeq permitted us to
assemble ab initio transcripts exhibiting a complex array of splic-
ing patterns. To confirm the intricate structure of assembled iso-
forms, we performed matched RNA CaptureSeq using a 454 GSFLX
Titanium instrument, whose longer read length provides greater
power to resolve complex gene structures, yielding ~314,707 reads
that aligned to the genome (Supplementary Table 1). Despite
this much shallower sequencing depth, ab initio assembly of these
longer reads validated the existence of most (64.8% transcript chains
reconstructed) newly described isoforms and neighboring lncRNAs
(Supplementary Data 3). This approach also revealed that, like
mRNAs, alternative splicing of lncRNAs can modulate the inclusion
or exclusion of specific functional domains. For example, the lncRNA
HOTAIR exhibited an alternative splice site that eliminates the poly-
comb repressor complex binding domains22,23, as well as small exon
length variations supported by canonical intronic polypyrimidine
tracts and splice junctions (Fig. 3a)22,23. Lastly, we also undertook an
assembly that incorporated both long 454 and short Illumina reads
(Supplementary Data 4). The synergistic combination of long and
short reads has been previously shown to provide additional accuracy
in delineating complex and rare spliced isoforms and estimating their
relative abundance24.
132246
132248
132250
132252
132254
132256
132258
132260
132262
132264
132266
Chr9:
(kb)
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
6
Zoom detail
(chr9:132239276-132274662)
54360
54362
54364
54366
54368
54370
54372
54374
54376
54378
Chr12:
(kb)
HOTAIR
HOXC10
HOXC11
Zoom detail
(chr12:54353842-54386898)
Tracks
(from outer edge):
Annotated gene RNA-Seq Assembled exon Splice junction Probed region RNA-Seq Novel assembled exon Novel splice junction
Postcapture
a b
Precapture
Figure 1 Circle plots illustrating the prevalence and complexity of captured transcripts at genic (a) and intergenic (b) loci. Successive tracks from outer
edge indicate the following features: (i) genomic position (colored bars indicate different chromosomes and black ticks demarcate 5 kb); (ii) previous
gene annotations (black bars on green background); (iii) frequency distribution of sequenced read alignments from precapture library (green histogram
on gray background); (iv) assembled transcript structures from precapture library (green bars indicate exons and links indicate splice junctions);
(v) probed regions represented on capture array (black bars on blue background); (vi) frequency distribution of sequenced read alignments from
CaptureSeq library (blue histogram on gray background); and (vii) assembled transcript structures from CaptureSeq library (green bars and links
correspond to exons and splice junctions identified in both pre- and CaptureSeq libraries, blue bars and links correspond to exons and splice junctions
exclusively identified in CaptureSeq libraries). Inset shows detail of selected regions. Plot generated using Circos software (http://www.circos.ca/).
npg © 2012 Nature America, Inc. All rights reserved.
nature biotechnologyVOLUME 30 NUMBER 1 JANUARY 2012 101
LETTERS
We next assessed whether RNA CaptureSeq
retains the differential gene expression profile
of the original uncaptured sample, thereby per-
mitting the quantitative analysis of captured
transcripts. We first confirmed the repro-
ducibility between two CaptureSeq technical
replicates by quantitative reverse transcriptase
(qRT)-PCR (r2 = 0.99) and within sequenced
libraries (r2 = 0.94 and r2 = 0.97) and the uni-
form enrichment of transcripts after capture
(Supplementary Results and Supplementary
Fig. 5). Next, to determine the ability of RNA
CaptureSeq to compare gene expression
between alternative cell types, we applied RNA
CaptureSeq to human fetal lung fibroblast cells,
which show a distinct gene expression program
consistent with their alternative location within
the body (Supplementary Table 1)11. We per-
formed sequencing and assembly using the 454 platform as before,
assembling in total 430 multiexon transcripts (Supplementary Data 5)
captured in association with genic probed regions. In comparison to
foot fibroblasts, we find 37% of captured genes undergo significant
(P < .05) differential expression (Supplementary Fig. 6a)25. This is aptly
illustrated by the opposing polar transcriptional enrichment across the
HOX loci that reflect the positional differences in the origin of foot
and fetal lung fibroblasts along the body axis (Fig. 3b). Despite high
cross-hybridization potential, RNA CaptureSeq faithfully maintains the
transcriptional boundary that pivots between the HOXA7 and HOXA9
genes, consistent with previous reports using alternative methods11,13.
Next, we confirmed by qRT-PCR that the relative enrichment of HOX
genes along this linear axis was closely maintained after capture (Fig. 3c).
We observed a close correlation between differential expression pro-
files for HOX before and after capture that was additionally concordant
with estimates of gene abundance obtained from CaptureSeq (Fig. 3c,
Supplementary Results and Supplementary Fig. 6b). In addition, we
confirmed by qRT-PCR the differential expression between foot and lung
fibroblasts of six intergenic transcripts expressed at low levels. Taken
together, these results indicate that after both phases of the CaptureSeq
approach, capture and sequencing, are completed, the gene expression
profiles of the original sample are maintained with fidelity, permitting
the application of this technique for quantitative analysis.
The existence of intermittent sequenced reads that align within inter-
genic regions has fueled recent controversy as to whether they represent
the low-frequency sampling of authentic transcripts, biological
noise from spurious nascent transcription or technical noise from
sequencing and alignment4–6. Having established the fidelity of the
RNA CaptureSeq approach, we applied its sensitivity to characterize
these rare transcriptional events within intergenic regions and thereby
help resolve this ongoing debate. To do this, we included numerous
intergenic regions for interrogation within our array which, despite
overlapping active chromatin domains (marked by H3K4me3 and
H3K36me3 )25, showed little or no evidence of transcription accord-
ing to publicly available transcriptomic resources or our own initial
precapture RNA-Seq analysis (Fig. 1b and Supplementary Fig. 6c).
We found aligned captured reads that covered almost all intergenic
probed bases (98.1%), similar in extent to genic regions (94.1% of
bases, Supplementary Fig. 7a). However, for our analysis we con-
sidered only those regions with evidence of post-transcriptional
splicing, retaining in total 45% (443) of intergenic probed regions.
The rationale was twofold. First, this filter removed the potential for
genomic DNA contamination (we observed that <1.3% of sequenced
reads showed evidence of artifactual spliced alignments within our
corresponding control capture using genomic DNA); and, second,
it omitted the potential for ‘spurious’ transcriptional noise because
we reasoned that formulaic and reproducible splicing of transcripts
necessitates attentive post-transcriptional regulation. Indeed, these
regions with evidence of splicing exhibit a 37.5-fold enrichment in
aligned read frequency relative to excluded regions with no evidence
Expression r.p.k.m. (×102)
c
Foot fibroblast
Lung fibroblast
393 aa
354 aa
386 aa
393 aa
369 aa
0 5 10 50 100
p53
i
iv
ii
iii
Transactivation
Oligomerization
DNA-binding
Nuclear local signal
VariantWTDomain:
IntronExon
Known:
Novel:
γ
β
i
ii
iii
iv
a
bKnown alternative
exons/splicing
Novel alternative
exons/splicing
340
RNA-Seq
Capture
Seq
99,224
α
Figure 2 Resolution of unannotated p53
isoforms. (a) Genome-browser view of the
p53 gene. The coverage and relative expression
as determined by conventional RNA-Seq
is indicated by upper red histogram.
(b) Genome-browser view showing unannotated
alternative splicing (blue; i–iv) identified using
RNA CaptureSeq. The relative coverage and
expression as determined by RNA CaptureSeq
are also indicated by upper histogram (blue).
(c) Relative expression of alternative unannotated
p53 isoforms. The annotated (known, red) and
unannotated (novel, blue) isoforms of p53, along
with expected modifications to characterized
protein domains are indicated in left panel. The
relative expression of annotated and unannotated
isoforms is indicated in right panel (error
bars indicate upper and lower bound of 95%
confidence interval).
npg © 2012 Nature America, Inc. All rights reserved.
10 2   VOLUME 30 NUMBER 1 JANUARY 2012 nature biotechnology
LETTERS
of splicing (Supplementary Fig. 7b). In total, we captured 798 splice
junctions26 within intergenic probed regions, of which 95.7% were
not identified in precaptured libraries or preexisting gene annota-
tions. Despite being unreported previously, these junctions exhibit
similar enrichment for canonical splice motifs as annotated genes
(Supplementary Fig. 4).
To resolve the complex isoforms that utilize such intricate splicing
parameters, we performed ab initio transcript assembly, constructing
257 multiexonic captured transcripts (Fig. 1b and Supplementar y
Data 2). The full length of almost all (76.7%) transcripts was inde-
pendently verified by the longer read 454 sequencing (Supplementary
Data 3). Captured intergenic transcripts comprise an average of 3.6
exons, with an average size of 428 bp and mature full length of 1.54 kb.
Lastly, RT-PCR and sequencing of amplified products independently
validated the existence of almost all tested (13 of 15) assembled inter-
genic transcripts (Supplementary Fig. 3b). Although the captured
intergenic transcripts exhibit lower evolutionary conser vation than
protein-coding sequences (as evidenced by coverage of phastCons ele-
ments; see Online Methods), they exhibit a similar level of conserva-
tion to that of annotated functional lncRNAs, and are more conserved
than intronic or surrounding intergenic sequences (Supplementary
Fig. 7c). A range of metrics, including the presence, size and struc-
ture of ORFs, homolog y of predicted ORFs to known proteins, and
synonymous-to-nonsynonymous nucleotide substitution rate confi-
dently ascribed the majority (92.3%) of these transcripts as noncoding
RNAs (Fig. 3d and Supplementary Fig. 7d).
To contextualize the rarity of captured intergenic transcripts in rela-
tion to the whole human transcriptome, we first normalized expression
profiles between conventional RNA-Seq and RNA CaptureSeq libraries
according to shared genes (Supplementary Fig. 5i). Captured lncRNAs
exhibited a mean expression of only 0.011 reads per kilobase per million
reads, 463-fold less than the median gene expression within fibroblasts
(Fig. 3e). We performed quantitative RT-PCR using our precapture
RNA sample to provide an informed estimate of lncRNA transcript
copy number. Assuming an average human fibroblast cell contains
~300 fg of mRNA per cell27, we estimate that the lncRNAs we dis-
covered were present at an average of ~0.0006 transcripts per cell,
indicating expression in only a small subpopulation of the cells
sampled. By comparison, we calculate HOXA to be present at an average
~0.13 transcripts per cell, consistent with previous estimates27.
Given that these intergenic transcripts represent some of the rarest
transcriptional events characterized to date, we next considered
Figure 3 Identification of unannotated exon
variants and rare intergenic noncoding RNAs
by targeted RNA capture and sequencing.
(a) Genome-browser view of HOTAIR showing
six unannotated isoforms (i), including fine-
scale alternate splicing events (ii; zoom
detail) that generate 16 additional unannotated
isoforms. Relative abundance and coverage
in RNA-Seq (upper blue histogram) and
CaptureSeq (upper red histogram) libraries
from foot fibroblast cell line indicated.
(iii) Relative abundance of exon variants.
(b) Differential expression across HOXA loci
(black bars show gene annotations) between
lung and foot fibroblasts, reflecting the
different anatomical origin of each cell line.
Coverage and relative abundance by RNA
CaptureSeq (histograms) is indicated for each
cell line. (c) Relative enrichment of HOXA
genes and lncRNAs (1–7) between foot (F)
and lung (L) fibroblasts as determined by
CaptureSeq (dark gray) or qRT-PCR using
precapture (light gray) or postcapture (medium
gray) RNA samples. (d) Cumulative frequency
distribution showing codon substitution
frequency of full-length transcripts assembled
from captured libraries (blue), coding genes
(green) and known noncoding RNAs (red)
for reference. (e) Cumulative frequency
distribution indicates the normalized expression
of full-length unannotated intergenic ncRNAs
(red) relative to subset of genes captured
on array (blue; captured) or genes identified
by conventional RNA-Seq (green; all).
(f) Cumulative frequency distribution showing
the raw sequenced read frequency aligning to
captured intergenic transcripts from both
RNA-Seq (dashed red) and CaptureSeq
(blue) and all assembled transcripts from
RNA-Seq (solid red). The large difference in
raw alignment frequency suggests saturated
coverage achieved by CaptureSeq. (g) Pie
chart indicating the proportion of RNA-Seq
reads assigned to assembled transcripts, previous gene annotations, or unassignable reads occurring in intronic or intergenic regions. Bar indicates
the proportion of unassigned intronic or intergenic reads ‘rescued’ by incorporation into rare transcript exons.
f
g
0
0.2
0.4
0.6
0.8
1.0
104
103
102
101
103104105106107
102
101
100
10–2
10–3 10–1
100
10–2 10–1
10–4
Intergenic ncRNAs
Genes (captured)
Genes (all)
Full-length transcripts
Expression (r.p.k.m.)
Cumulative fraction
Previous gene
annotations (8%)
Introns
(5%)
Intergenic
(10%)
Distribution of foot fibroblast aligned reads
Assembled
transcripts
(77%)
Fraction ‘rescued’ within
captured regions
0 0.4 0.6 0.80.2 1.0
d
e
chr7: 27140000 27240000
3,537
Foot
(raw)
Fetal
lung
(raw)
A5 A6 A7 A9 A10 A11 A13A1 A2 A3 A4HOX
3,625
b
HOXA13
c
HOXA5
HOXA6
HOXA7
HOXA9
1
2
3
4
5
6
7
0
2
4
6
8
10
Fold change (log2) between
lung (L) and foot (F) fibroblasts
(L/F) (F/L)(L/F) (F/L)
CaptureSeq
qRT-PCR:
Precapture
Postcapture
Intergenic ncRNAs
–10,000 –5,000 0 5,000 10,000
CSF score
0
0.2
0.4
0.6
0.8
1.0
Fraction
Captured
transcripts
Coding genes
Noncoding RNAs
Aligned reads (raw count)
RNA-Seq
(All transcripts)
RNA-Seq
(captured
transcripts)
CaptureSeq
Fraction
0
0.2
0.4
0.6
0.8
1.0
Novel alternative
exons/splicing AG TT Pyrimidine
tract
Splice
junction
ii.
iii.
0
50
100
150
Read freq.
a
RNA-Seq
(raw)
Capture
Seq
(raw)
i.
Zoom detail
107
236,799
HOTAIR
Exon boundary variant
α
α
β
β
γ
γ
Exon boundary variant:
CGGCTCACCCCCGGTAAAGGAAGGAGGGGCGTCTTTATTTTTTTAAGGCCC
npg © 2012 Nature America, Inc. All rights reserved.
nature biotechnologyVOLUME 30 NUMBER 1 JANUARY 2012 103
LETTERS
whether our application of RNA CaptureSeq had achieved full
coverage and therefore reached the limits of the fibroblast tran-
scriptome. Within our initial (precapture) RNA-Seq we found only
minimal and intermittent coverage of the transcripts expressed
at low levels, which is indicative of low-frequency sampling and
nonsaturating coverage. Indeed, only 31.3% of captured intergenic
transcripts were even detected in precapture RNA-Seq libraries, with
a further 9.7% detected by a single alignment. By comparison, even
transcripts expressed at low levels are represented by large num -
bers of aligned reads in the RNA CaptureSeq data set, indicating an
asymptotic transcript discovery rate associated with the approach
of coverage saturation (the bottom 5th percentile of transcripts are
each represented by an average 160.8 aligned reads; Fig. 3f and
Supplementary Fig. 8a,b).
During precapture RNA-Seq, we found a substantial component
(14.8%) of alignable reads that could not be assigned into assembled or
previously annotated gene models (Fig. 3g). These unassigned reads
have been previously thought to represent biological or technical arti-
facts because they appear to have characteristics of random sampling
from a low-level background6. We first considered unassigned reads
aligning to intronic regions that have been previously dismissed as
collateral from splicing by-products6. We found a significant over-
lap between unassigned intronic reads from the precapture RNA-
Seq and newly identified isoforms (7.02-fold enrichment; two-tailed
P < 0.0001 χ2 test expecting random distribution of read alignments
throughout introns; Fig. 3g), thereby rescuing 61.5% of unassigned
reads aligning within captured transcript introns. We further vali-
dated the existence of 4 (of 4) of these unannotated exons that rescue
intronic reads (Supplementary Fig. 8c), confirming they are mature
spliced transcripts rather than background unprocessed intronic
intermediates. In addition, 53.4% of unassigned reads aligning to
probed intergenic regions were also incorporated within assembled
intergenic lncRNAs identified by RNA CaptureSeq. We reason that a
similarly significant proportion of unassigned reads from our initial
RNA-Seq data set that fall outside captured regions also correspond
to the low-frequency sampling of intergenic transcription in a small
subset of the cell population. When projected across the genome as a
whole, this suggests a sizeable expansion to the borders of the human
transcriptome and anticipates a scale of transcriptional complexity
that surpasses even previous reports1.
In this context of an expanded transcriptome, the RNA CaptureSeq
approach provides considerable value because it allows one to focus on
and comprehensively interrogate regions of interest. For example, it can
comprehensively profile haplotype blocks identified by genome-wide
association studies to be associated with complex diseases or pheno-
types, many of which occur outside of coding genes28 so as to iden-
tify all gene products produced from these regions as the next step in
determining causality. In addition, the combination of CaptureSeq with
multiplex sample preparation can permit high-throughput transcrip-
tional profiling of large numbers of samples at a fraction of conventional
sequencing costs (Supplementary Fig. 8d), thereby providing molecu-
lar signatures across a wide range of samples in a single sequencing run.
Given these advantages, and the challenge of understanding the full
range of gene products expressed from the human and other genomes,
we foresee RNA CaptureSeq as an important approach with a wide
range of research and clinical applications.
Our data strongly suggest that the full extent of the human tran-
scriptome dynamically expressed in different cells, tissues and
developmental stages is still far from being characterized. Indeed the
low expression of many bona fide transcripts implies that there are
substantial transcriptomic differences between cells, even those in
clonal cell culture, suggesting that each cell has an individual if not
unique transcriptomic signature. This in turn challenges the notion
that there may be a single, stable transcriptome by which a cell can
be characterized, although broad cell types, such as fibroblasts, may
show similar patterns. These conclusions converge with recent find-
ings from single-cell transcriptomics29 and a transcriptional model
characterized by rapid-bursting dynamics30, and advocate a model of
the human transcriptome that embraces highly specific ontogeny and
positional identity, dynamism, plasticity and diversity.
METHODS
Methods and any associated references are available in the online version
of the paper at http://www.nature.com/naturebiotechnology/.
Accession code. All sequencing data have been submitted to GEO
(GSE29041).
Note: Supplementary information is available on the Nature Biotechnology website.
ACKNOWLEDGMENTS
The authors would like to thank M. Garber for his design of the arrays and
constructive contribution to the manuscript; M. Koziol and K. Thomas for
preparing cDNA and sequencing libraries; T. Albert, T. Arnold, J. Affourtit,
B. Dessany, T. Jarvie, D. Green and T. Millard provided sequencing support.
The authors would like to thank the following funding sources: Human
Frontiers Science Program (to T.R.M.); Queensland Government Department of
Employment, Economic Development and Innovation Smart Futures Fellowship
(to M.E.D.); Australian Research Council/University of Queensland co-sponsored
Federation Fellowship (FF0561986; to J.S.M.); Australian National Health and
Medical Research Council Australia Fellowship (631668; to J.S.M.) and Career
Development Award (CDA631542; to M.E.D.); Damon Runyon-Rachleff, Searle,
Smith Family Foundation and Richard Merkin Foundation Scholar (to J.L.R.); and
US National Institutes of Health (1DP2OD00667-01; to J.L.R. and C.T.).
AUTHOR CONTRIBUTIONS
T.R.M., J.A.J., J.S.M. and J.L.R. designed the experiments. D.J.G. performed array
capture, quality assessments and supported the sequencing teams. J.C. performed
RT-PCR. M.E.D., T.R.M. and C.T. performed alignment, transcript assembly and
analysis. T.R.M., M.E.D., J.A.J., J.S.M. and J.L.R. wrote the manuscript.
COMPETING FINANCIAL INTERESTS
The authors declare competing financial interests: details accompany the full-text
HTML version of the paper at http://www.nature.com/nbt/index.html.
Published online at http://www.nature.com/nbt/index.html.
Reprints and permissions information is available online at http://www.nature.com/
reprints/index.html.
1. Birney, E. et al. Identification and analysis of functional elements in 1% of the
human genome by the ENCODE pilot project. Nature 447, 799–816 (2007).
2. Carninci, P. et al. The transcriptional landscape of the mammalian genome. Science
309, 1559–1563 (2005).
3. Katayama, S. et al. Antisense transcription in the mammalian transcriptome.
Science 309, 1564–1566 (2005).
4. van Bakel, H., Nislow, C., Blencowe, B.J. & Hughes, T.R. Response to “the reality
of pervasive transcription”. PLoS Biol. 9, e1001102 (2011).
5. Clark, M.B. et al. The reality of pervasive transcription. PLoS Biol. 9, e1000625
(2011).
6. van Bakel, H., Nislow, C., Blencowe, B.J. & Hughes, T.R. Most “dark matter”
transcripts are associated with known genes. PLoS Biol. 8, e1000371 (2010).
7. Levin, J.Z. et al. Targeted next-generation sequencing of a cancer transcriptome
enhances detection of sequence variants and novel fusion transcripts. Genome Biol.
10, R115 (2009).
8. Teer, J.K. et al. Systematic comparison of three genomic enrichment methods for
massively parallel DNA sequencing. Genome Res. 20, 1420–1431 (2010).
9. Yehle, C.O. et al. A solution hybridization assay for ribosomal RNA from bacteria
using biotinylated DNA probes and enzyme-labeled antibody to DNA:RNA. Mol.
Cell. Probes 1, 177–193 (1987).
10. Crider-Miller, S.J. et al. Novel transcribed sequences within the BWS/WT2 region
in 11p15.5: tissue-specific expression correlates with cancer type. Genomics 46,
355–363 (1997).
11. Rinn, J.L., Bondre, C., Gladstone, H.B., Brown, P.O. & Chang, H.Y. Anatomic
demarcation by positional variation in fibroblast gene expression programs.
PLoS Genet. 2, e119 (2006).
npg © 2012 Nature America, Inc. All rights reserved.
10 4   VOLUME 30 NUMBER 1 JANUARY 2012 nature biotechnology
LETTERS
12. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals
unannotated transcripts and isoform switching during cell differentiation.
Nat. Biotechnol. 28, 511–515 (2010).
13. Rinn, J.L. et al. Functional demarcation of active and silent chromatin domains in
human HOX loci by noncoding RNAs. Cell 129, 1311–1323 (2007).
14. Li, Y. et al. Resequencing of 200 human exomes identifies an excess of
low-frequency non-synonymous coding variants. Nat. Genet. 42, 969–972
(2010).
15. Ng, S.B. et al. Targeted capture and massively parallel sequencing of 12 human
exomes. Nature 461, 272–276 (2009).
16. Kapranov, P. et al. Examples of the complex architecture of the human transcriptome
revealed by RACE and high-density tiling arrays. Genome Res. 15, 987–997
(2005).
17. Pan, Q., Shai, O., Lee, L.J., Frey, B.J. & Blencowe, B.J. Deep surveying of alternative
splicing complexity in the human transcriptome by high-throughput sequencing.
Nat. Genet. 40, 1413–1415 (2008).
18. Khoury, M.P. & Bourdon, J.C. The isoforms of the p53 protein. Cold Spring Harb.
Perspect. Biol. 2, a000927 (2010).
19. Olivares-Illana, V. & Fahraeus, R. p53 isoforms gain functions. Oncogene 29,
5113–5119 (2010).
20. Kloc, M., Foreman, V. & Reddy, S.A. Binary function of mRNA. Biochimie 93,
1955–1961 (2011).
21. Mercer, T.R., Dinger, M.E. & Mattick, J.S. Long non-coding RNAs: insights into
functions. Nat. Rev. Genet. 10, 155–159 (2009).
22. Tsai, M.C. et al. Long noncoding RNA as modular scaffold of histone modification
complexes. Science 329, 689–693 (2010).
23. Hiller, M. & Platzer, M. Widespread and subtle: alternative splicing at short-distance
tandem sites. Trends Genet. 24, 246–255 (2008).
24. Schatz, M.C., Delcher, A.L. & Salzberg, S.L. Assembly of large genomes using
second-generation sequencing. Genome Res. 20, 1165–1173 (2010).
25. Khalil, A.M. et al. Many human large intergenic noncoding RNAs associate with
chromatin-modifying complexes and affect gene expression. Proc. Natl. Acad. Sci.
USA 106, 11667–11672 (2009).
26. Trapnell, C., Pachter, L. & Salzberg, S.L. TopHat: discovering splice junctions with
RNA-Seq. Bioinformatics 25, 1105–1111 (2009).
27. Carter, M.G. et al. Transcript copy number estimation using a mouse whole-genome
oligonucleotide microarray. Genome Biol. 6, R61 (2005).
28. Hindorff, L.A. et al. Potential etiologic and functional implications of genome-wide
association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106,
9362–9367 (2009).
29. Islam, S. et al. Characterization of the single-cell transcriptional landscape by highly
multiplex RNA-seq. Genome Res. 21, 1160–1167 (2011).
30. Hah, N. et al. A rapid, extensive, and transient transcriptional response to estrogen
signaling in breast cancer cells. Cell 145, 622–634 (2011).
npg © 2012 Nature America, Inc. All rights reserved.
nature biotechnology
doi:10.1038/nbt.2024
ONLINE METHODS
Cell culture. Primar y human female fetal lung fibroblasts and human male
foot fibroblasts in DMEM supplemented with 10% FBS at 37 °C with 5% CO2
as previously described11. Total RNA was purified from each cell culture using
TRIzol according to the manufacturer’s instructions (Invitrogen).
Double-stranded cDNA librar y preparation. RNA was oligo-dT reverse
transcribed with SuperScript III Reverse Transcriptase (Invitrogen), RNaseH
digested and second-strand synthesis was carried out using DNA polymerase
according to the manufacturer’s instructions (Invitrogen).
Custom microarray design. Targeted regions were selected from annotated
protein coding genes and uncharacterized human intergenic regions exhibiting
H3K4/H3K36 domains25 according to their transcriptional status as deter-
mined by RNA-Seq (Supplementary Table 1). We employed the Titanium
Optimized Sequence Capture 385K Array, designed by Nimblegen, for RNA
capture. Array design and probe selection for tiling designs was conducted by
Nimblegen using window-based rank selection, retaining probes that received
the highest score as determined from a combination of frequency, Tm and
uniqueness information. A detailed description of the strategies by which
probes are appraised can be found in Technical Note: Roche Nimblegen Probe
Design Fundamentals (http://www.nimblegen.com/products/lit/probe_design_
2008_06_04.pdf ). This includes the filtering of probes with variable Tm and
repetitive sequences. In addition to the WindowMasker program31 employed
by Nimblegen during array design, we also omitted any sequences overlapping
RepeatMasker annotated elements from our design.
Capture library preparation and prehybridization amplification (for 454
sequencing). The 454 GS-FLX Titanium Sequencing librar y was constructed
using the 454 LifeSciences (454 hereafter) GSFLX Titanium Kit as described in
the user’s guide. All of the single-stranded DNA product from this library prep-
aration (e.g., sst Librar y) was used as a template in a PreHybridization Linker
Mediated PCR (LMPCR) reaction to ensure that the plurality of the molecules
contained adaptors on both sides of the putative cDNA inserts. The LMPCR con-
ditions consisted of five reactions each containing 5 µl 100058 Platinum High
Fidelity Polymerase Buffer from Invitrogen, 2.5 µl MgSO4, 1 µl 25nM dNTP’s
from Epicentre, 1 µl of 25 µM Primer A 5-CCATCTCATCCCTGCGTGTC,
1 µl of 25 µM Primer B 5-CCTATCCCCTGTGTGCCTTG and 0.4 µl Platinum
High Fidelity Polymerase. DNA in equal amounts was apportioned for each of
the five reactions and water added to 50 µl. The master mix was pipetted into
0.2 ml strip tubes and then placed into a thermal cycler. The reactions were
then subjected to 94 °C for 4 min followed by 8 cycles of the following pattern:
94 °C for 30 s, 1 min at 58 °C and 1.5 min at 68 °C. The last step was an exten-
sion at 72 °C for 5 min. The reactions were then kept at 4 °C until further
processing. The amplified material was recovered with a Qiagen Qiaquick
column according to the manufacturer’s instructions except the DNA were
eluted in 50 µl water instead of the elution buffer. The DNA was quantified
using the NanoDrop-1000 and the library was evaluated electrophoretically
with an Agilent Bioanalyzer 2100 using a DNA 7500 chip. The library fragment
sizes were found to be between 500–700 bp.
Optimized cDNA sequence capture array processing (for 454 sequencing).
Prior to array hybridization the following components were added to a 1.5 ml
tube: 3 µg of library material, 0.65 µl of 1,000 µM Enhancing Oligo A 5-CC
ATCTCATCCCTGCGTGTCTCCGACTCAG/3ddc/ and 0.65 µl of 1000 µM
Enhancing Oligo B 5-CCTATCCCCTGTGTGCCTTGGCAGTCTCAG/3ddc/,
and 100 µg of CoT-1 DNA, Invitrogen. Samples were dried down by puncturing
a hole in the 1.5 ml tube cap with a 20 gauge needle and processing in an
Eppendorf Vacufuge set to 60 °C for 40 min. To each dried sample 4.8 µl
of water was added and it was then placed in a heating block at 70 °C for
10 min to resuspend. Samples were subjected to vigorous vortex mixing for
30 s and centrifuged to recollect any dispersed sample. To each sample tube
8 µl NimbleGen SC Hybridization Buffer and 3.2 µl NimbleGen Hybridization
component A was added, and the sample was vortexed for 30 s, centrifuged and
placed in a heating block at 95 °C s C for 10 min. The samples were again mixed
for 10 s, spun down and placed in a Roche NimbleGen Hybridization System
at 42 °C until ready for hybridization. The capture array contained 385,000
features and was overlaid with an X1 mixer according to manufacturer’s
instructions, and 16 µl of the hybridization mixture (library, C0t-1, enhancing
oligos, SC Hybridization Buffer and SC Component A) was pipetted onto
the array field. The loading and vent holes were covered with port seals,
and each array sample was hybridized for 72 h at 42 °C on Hybridization
Station setting “B.” Slide washing and sample library elution were done as
previously described32.
Posthybridization LMPCR (for 454 sequencing). Posthybridization amplifica-
tion (e.g., LMPCR via 454 adapters) consisted of ten reactions for each sample
using the same enzyme and primer concentrations as the precapture ampli-
fication. Posthybridization amplification consisted of 16 cycles of PCR with
identical cycling conditions as used in the prehybridization LMPCR. Following
the completion of the amplification reaction, the samples were purified using a
2 Qiagen Qiaquick column according to the manufacturer’s recommended proto-
col and eluate from each column was combined into one tube. DNA was quanti-
fied spectrophotometrically using the NanoDrop-1000, and electrophoretically
evaluated with an Agilent Bioanalyzer 2100 using a DNA 7500 chip. The resulting
postcapture enriched sequencing libraries were sequenced on 454’s Genome
Sequencer FLX System using Titanium chemistry.
Read alignment and transcript assembly (by 454 sequencing). Roche 454
reads were first aligned to the human genome (hg19) using Blat (http://users.
soe.ucsc.edu/~kent/) with the following nondefault parameters; minIdentity =
90, minScore = 100. Highest scoring alignments were selected and resultant
*.psl files were converted into *.sam files using bedTools33 and SAMtools34.
Gaps smaller than 30 nt were removed from alignments. The direction of reads
spanning putative alignments was inferred according to the direction of the
canonical splice motifs (GT-AG). Reads spanning introns with noncanonical
introns, from which direction could not be inferred, were discarded. Reads not
spanning intron were retained as unstranded. Cufflinks12 was employed to
assembled transcripts from resultant *.sam files according to the following non-
default parameters: --min-isoform-fraction = 0.01,--min-intron-fraction = 0.01,
-r hg19.FA, --small-anchor-fraction = 0.05,--min-frags-per-transfrag = 5.
These options were chosen given the longer read length and lower read depth
of 454 sequencing and our aim to identify minor isoform variants. Cuffdiff12
was employed to determine differences in transcript abundance between foot
and lung fibroblast libraries using foot transcript annotations as reference.
Cuffcompare12 was employed to compare structural differences between foot
and lung fibroblast libraries using foot transcript annotations as reference.
cDNA capture library preparation and prehybridization amplification (for
Illumina sequencing). Illumina paired-end libraries were constructed from
the same double-stranded cDNA prep using Illumina’s PE Kit with the follow-
ing modifications. The prescribed agarose gel excision was done at 350–300
base pairs to produce libraries with an approximate insert size of 340 bp. DNA
was purified from the agarose using a Qiagen, Qiaquick column and eluted
in 30 µl of water. The entire recovery product was used as template in the
prehybridization library amplification by the Illumina sequencing adapters
(LMPCR). Prehybridization LMPCR consisted of one reaction containing
50 µl Phusion High Fidelity PCR Master Mix (New England BioLabs), 2 µM
of primers Illumina PE 1.0: 5-AATGATACGGCGACCACCGAGATCTAC
ACTCTT TCCCTACACGACGCTCTT CCG ATC* T and 2.0: 5-CAAGCA
GAAGACGGCATACGAGATCGGTCTCGGCAT TCCTGCTGAACCGCT
CTTCCGATC* T (asterisk denotes phosphorothioate bond), 30 µl DNA,
and water up to 100 µl. PCR cycling conditions were as follows: 98 °C for
30 s, followed by 8 cycles of 98 °C for 10 s, 65 °C for 30 s, and 72 °C for 30 s.
The last step was an extension at 72 °C for 5 min. The reaction was then
kept at 4 °C until further processing. The amplified material was recovered
with a Qiagen Qiaquick column according to the manufacturer’s instructions,
except the DNA was eluted in 50 µl water. The DNA was quantified using the
NanoDrop-1000 and the library was evaluated electrophoretically with an
Agilent Bioanalyzer 2100 using a DNA1000 chip. The mean library fragment
size was found to be 328 bp.
Capture array processing (for Illumina sequencing). Before array hybrid-
ization the following components were added to a 1.5 ml tube: 3 µg of library
npg © 2012 Nature America, Inc. All rights reserved.
nature biotechnology doi:10.1038/nbt.2024
material, 6.5 µl of 100 µM Illumina primer PE 1.0 and PE 2.0 at, and 100 µl
of CoT-1 DNA (Invitrogen). Samples were dried down by puncturing
a hole in the 1.5 ml tube cap with a 20 gauge needle and processing in an
Eppendorf Vacufuge set to 60 °C for 20 min. To each dried sample 4.8 µl of
water was added and, it was then placed in a heating block at 70 °C for 10 min
to resuspend sample. Samples were subjected to vigorous vortex mixing for
30 s and centrifuged to recollect any dispersed sample. To each sample tube
8 µl NimbleGen SC Hybridization Buffer and 3.2 µl NimbleGen Hybridization
component A was added, and the sample was vortexed for 30 s, centrifuged and
placed in a heating block at 95 °C for 10 min. The samples were again mixed
for 10 s, spun down and placed in a Roche NimbleGen Hybridization System
at 42 °C until ready for hybridization. The capture array contained 385,000
features and was overlayed with an X1 mixer according to manufacturer’s
instructions, and 16 µl of the hybridization mixture (librar y, C0t-1, enhanc-
ing oligos, SC Hybridization Buffer, and SC Component A was pipetted onto
the array field. The loading and vent holes were covered with port seals, and
each array sample was hybridized for 72 h at 42 °C on Hybridization Station
setting “B.” Slide washing and sample library elution were done as previously
described34.
Posthybridization LMPCR (for Illumina sequencing). Posthybridization
amplification (e.g., LMPCR via Illumina adaptors) consisted of two reactions for
each sample using the same enzyme and primer concentrations as the precap-
ture amplification, but a modified version of the Illumina PE 1.0 and 2.0 primers
were employed: forward primer 5-AATGATACGGCGACCACCGAGA and
reverse primer 5-CAAGCAGAAGACGGCATACGAG. Posthybridization
amplification consisted of 16 cycles of PCR with identical cycling conditions
as used in the prehybridization LMPCR; with one exception the annealing
temperature was lowered from 65 °C to 60 °C. Following the completion of
the amplification reaction, the samples were purified using a Qiagen Qiaquick
column, using the manufacturer’s recommended protocol, and the DNA was
quantified spectrophotometrically using the NanoDrop-1000, and electro-
phoretically evaluated with an Agilent Bioanalyzer 2100 using a DNA1000
chip. The resulting postcapture-enriched sequencing libraries were diluted
to 10 nM and used in cluster formation on an Illumina cB ot, and paired-end
sequencing was done using the Genome Analyzer IIX. Both cluster forma-
tion and 76 bp paired-end sequencing were done using the manufacturer’s
provided protocols.
Read alignment and transcript assembly (by Illumina sequencing). Illumina
76 bp paired-end sequenced reads from precapture RNA-Seq and postcapture
RNA CaptureSeq were aligned and assembled using identical parameters.
Illumina*.fastq files were first aligned to the human genome (hg19) using
TopHat26 with the following nondefault parameters: -r 242 --min-isoform-
fraction 0.01 -G RefSeq.gtf (downloaded from UCSC hg19 October 2010).
Cufflinks12 was employed to assemble transcripts from resultant *.sam files
according to the following parameters: --min-isoform-fraction = 0.01,--min-
intron-fraction = 0.01, -r hg19.fa,--min-frags-per-transfrag = 5. Cuffdiff
was employed to determine differences in transcript abundance between
precapture and postcapture libraries using precapture annotations as refer-
ence. Cuffcompare was employed to compare structural differences between
precapture and postcapture libraries and identify isoforms using precapture
annotations as reference.
Transcript characterization. PhastCons annotations (Vertebrate Conserved
Elements, 28-Way Multiz Alignment;35 ) were retrieved from the UCSC
(October 2010; http://hgdownload.cse.ucsc.edu/downloads.html) and inter-
sected with transcript annotations using overlapSelect (http://users.soe.ucsc.
edu/~kent/) to determine fractional coverage.
Coding potential of transcripts was assessed by two approaches. First, tran-
script sequences were submitted to the Coding Potential Calculator36 (CPS)
that scores each transcript according to the potential to encode a protein
according to a range of different metrics including the presence, size and
integrity of an ORF, matches to known protein domains and the conserva-
tion of these matches with a single frame, transcript coverage and structure
with reference to predicted ORF. Second, we performed a codon substitution
frequency analysis37 (CSF) to determine synonymous to nonsynonymous
substitutions within transcripts and thereby provide evidence of selective
evolutionary pressure acting on transcript sequences to preserve putative
ORF. Input *.maf files were retrieved from UCSC (October 2010; http://
hgdownload.cse.ucsc.edu/downloads.html). Both CPC and CSF were installed
and implemented locally using the UniRef90 database (November 2010;
(ref. 38)) for BLASTX searches.
Gene expression by RT-PCR. For nonquantitative expression analysis, 1 ng
postcapture cDNA was PCR amplified for 35 cycles and products visualized
after electrophoresis in a 2.5% agarose gel (primer sequences are documented
in Supplementary Table 4). Quantitative PCR reactions were done with a
final 0.1 ng/µl concentration of cDNA using SYBRTM green PCR master mix
(Applied Biosystems). Amplification and cycling conditions were as recom-
mended by the manufacturer. The standard curve method, using a 1 pg to 10 ng
serial dilution, was used for absolute quantization of transcript expression.
31. Morgulis, A., Gertz, E.M., Schaffer, A.A. & Agarwala, R. WindowMasker: window-
based masker for sequenced genomes. Bioinformatics 22, 134–141 (2006).
32. Fu, Y. et al. Repeat subtraction-mediated sequence capture from a complex genome.
Plant J. 62, 898–909 (2010).
33. Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing
genomic features. Bioinformatics 26, 841–842 (2010).
34. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25,
2078–2079 (2009).
35. Siepel, A. et al. Evolutionarily conserved elements in vertebrate, insect, worm, and
yeast genomes. Genome Res. 15, 1034–1050 (2005).
36. Kong, L. et al. CPC: assess the protein-coding potential of transcripts using sequence
features and support vector machine. Nucleic Acids Res. 35, W345 (2007).
37. Lin, M.F. et al. Revisiting the protein-coding gene catalog of Drosophila melanogaster
using 12 fly genomes. Genome Res. 17, 1823–1836 (2007).
38. Suzek, B.E., Huang, H., McGarvey, P., Mazumder, R. & Wu, C.H. UniRef:
comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23,
1282–1288 (2007).
npg © 2012 Nature America, Inc. All rights reserved.
... Copyright 2020, John Wiley and Sons. e Design of a singleend duplex unique molecular identifier (UMI) adapter [87]. Copyright 2019, Springer Nature to 20,000 multiplex PCR reactions in a single tube while maintaining over 95% specificity and uniformity [87]. ...
... e Design of a singleend duplex unique molecular identifier (UMI) adapter [87]. Copyright 2019, Springer Nature to 20,000 multiplex PCR reactions in a single tube while maintaining over 95% specificity and uniformity [87]. ...
... Additionally, they utilize SPE technology to enhance design flexibility and precisely enrich target regions. Notably, SPE accommodates variable amplicon sizes without predefined limits, providing an efficient solution for specific enrichment needs [87]. ...
Article
Full-text available
Gene fusions are vital biomarkers for tumor diagnosis and drug development, with precise detection becoming increasingly important. This review explores the links between gene fusions and common tumors, systematically evaluating detection technologies like fluorescence in situ hybridization (FISH), polymerase chain reaction (PCR), immunohistochemistry (IHC), electrochemiluminescence (ECL), and next-generation sequencing (NGS). FISH is the gold standard for DNA-level rearrangements, while PCR and NGS are widely used, with PCR confirming known fusions and NGS offering comprehensive genome-wide detection. Bioinformatic tools like STAR-Fusion, FusionCatcher, and Arriba are assessed for diagnostic accuracy. The review highlights how artificial intelligence (AI), particularly deep learning (DL) technologies like convolutional neural networks (CNNs) and recurrent neural networks (RNNs), is transforming gene fusion research by accurately detecting and annotating genes from genomic data, eliminating biases. Finally, we present an overview of advanced technologies for gene fusion analysis, emphasizing their potential to uncover unknown gene fusions.
... This is particularly evident from the numerous novel transcripts, many of which are located in the genomic "dark matter" regions, discovered through both bulk and single-cell transcriptomic analyses [3][4][5][6][7][8][9][10][11]. The increasing recognition of the transcriptome's complexity has further highlighted the limitations of current annotations [12][13][14][15][16]. This hinders our understanding of human biology and presents significant challenges to understanding the basic mechanisms of development and disease [17,18], underscoring the urgent need for improvement of our understanding of genes and transcripts encoded in the human genome. ...
... We believe that the lessons learnt here with InSETe-60 are applicable to any genomic sequence and emphasize the need for targeted RNA enrichment methods and advanced sequencing techniques to refine our understanding of the human genome. Targeted RNA sequencing strategies such as RACE-Nano-Seq or Cap-tureSeq [13,14] should be routinely used to uncover the true complexity of genomic regions of interest, especially those that are currently poorly understood. Finally, our results show that the current list of the human proteincoding genes is incomplete and raise a question of how many such genes still remain to be discovered. ...
Article
Full-text available
Background Accurate and comprehensive genomic annotation, including the full list of protein-coding genes, is vital for understanding the molecular mechanisms of human biology. We have previously shown that the genome contains a multitude of yet hidden functional exons and transcripts, some of which might represent novel mRNAs. These results resonate with those from other groups and strongly argue that two decades after the completion of the first draft of the human genome sequence, the current annotation of human genes and transcripts remains far from being complete. Results Using a targeted RNA enrichment technique, we showed that one of the novel functional exons previously discovered by us and currently annotated as part of a long non-coding RNA, is actually a part of a novel protein-coding gene, InSETG-4, which encodes a novel human protein with no known homologs or motifs. We found that InSETG-4 is induced by various DNA-damaging agents across multiple cell types and therefore might represent a novel component of DNA damage response. Despite its low abundance in bulk cell populations, InSETG-4 exhibited expression restricted to a small fraction of cells, as demonstrated by the amplification-based single-molecule fluorescence in situ hybridization (asmFISH) analysis. Conclusions This study argues that yet undiscovered human protein-coding genes exist and provides an example of how targeted RNA enrichment techniques can help to fill this major gap in our knowledge of the information encoded in the human genome.
... This approach is particularly effective for classifying and supporting the diagnosis of Ewing sarcoma, synovial sarcoma, and other solid tumors. Previous research has highlighted the accessibility and advantages of this methodology [32][33][34][35]. ...
Article
Full-text available
Follicular dendritic cell sarcoma (FDCS), an infrequent malignancy, poses diagnostic challenges due to its nonspecific clinical presentations and propensity for recurrence and metastasis, particularly when assessed through imaging modalities. Accurate diagnosis relies heavily on pathological morphology and immunohistochemical analysis. This study examines two FDCS cases from the Affiliated Hospital of Zunyi Medical University. Next-generation sequencing (NGS) identified three gene rearrangements—HFM1::BIRC3, ELF4::AIFM1, and DIP2B::WIF1 —in one case, while no genetic alterations were detected in the other. The report explores clinicopathological characteristics, molecular genetics, differential diagnosis, therapeutic approaches, and prognosis to enhance diagnostic and pathological understanding of FDCS in medical practice.
... Genomic selection technology is a powerful tool to achieve directed fat deposition in chicken breeding [13]. Transcriptomics technology mainly reveals differential gene expression across various stages or among diverse samples, and can identify hub genes that are closely related to fat deposition [14,15]. Proteomics technology can identify key proteins associated with fat deposition, which can serve as potential targets for the regulation of fat deposition and provide insights into their dynamics through post-translational modifications [16]. ...
Article
Full-text available
Excessive abdominal fat deposition in chickens disadvantages feed conversion, meat production, and reproductive performance. Intramuscular fat contributes to meat texture, tenderness, and flavor, serving as a vital indicator of overall meat quality. Therefore, a comprehensive analysis of the regulatory mechanisms governing differential deposition of abdominal versus intramuscular fat is essential in breeding higher-quality chickens with ideal fat distribution. This review systematically summarizes the regulatory mechanisms underlying intramuscular and abdominal fat traits at chromatin, genomic, transcriptional, post-transcriptional, translational, and epigenetic-modification scales. Additionally, we summarize the role of non-coding RNAs and protein-coding genes in governing intramuscular and abdominal fat deposition. These insights provide a valuable theoretical foundation for the genetic engineering of high-quality and high-yielding chicken breeds.
... During the alternative splicing, some exons can be retained or excluded, resulting in different mature mRNAs generated from the same pre-mRNA. Targeted RNA sequencing revealed that lncRNA transcript consist of exons and introns and that lncRNAs can also occurred alternative splicing [72]. For example, lncRNA GAS5 (Growth-Arrest-Specific) has fifteen transcript isoforms in mice according to RefSeq [55]. ...
Article
Full-text available
Long non-coding RNA (lncRNA) plays important roles in animals and plants. In filamentous fungi, however, their biological function in infection stage has been poorly studied. Here, we investigated the landscape and regulation of lncRNA in the filamentous plant pathogenic fungus Botrytis cinerea by strand-specific RNA-seq of multiple infection stages. In total, 1837 lncRNAs have been identified in B. cinerea. A large number of lncRNAs were found to be antisense to mRNAs, forming 743 sense-antisense pairs, of which 55 antisense lncRNAs and their respective sense transcripts were induced in parallel as the infection stage. Although small RNAs were produced from these overlapping loci, antisense lncRNAs appeared not to be involved in gene silencing pathways. In addition, we found the alternative splicing events occurred in lncRNA. These results highlight the developmental stage-specific nature and functional potential of lncRNA expression in the infection stage and provide fundamental resources for studying infection stage-induced lncRNAs.
... FPKM was applied to represent the normalized expression value. The biological or experimental noise likely resulted in low-abundance transcripts, which were not active genes involved in the biological processes [57]. A robust FPKM threshold of 0.213 was recommended to identify an active gene [58]. ...
Article
Full-text available
Teleosts have more types of chromatophores than other vertebrates and the genetic basis for pigmentation is highly conserved among vertebrates. Therefore, teleosts are important models to study the mechanism of pigmentation. Although functional genes and genetic variations of pigmentation have been studied, the mechanisms of different skin coloration remains poorly understood. The koi strain of common carp has various colors and patterns, making it a good model for studying the genetic basis of pigmentation. We performed RNA-sequencing for red skin and white skin and identified 62 differentially expressed genes (DEGs). Most of them were validated with RT-qPCR. The up-regulated DEGs in red skin were enriched in Kupffer's vesicle development while the up-regulated DEGs in white skin were involved in cytoskeletal protein binding, sarcomere organization and glycogen phosphorylase activity. The distinct enriched activity might be associated with different structures and functions in erythrophores and iridophores. The DNA OPEN ACCESS Int. J. Mol. Sci. 2015, 16 21311 methylation levels of two selected DEGs inversely correlated with gene expression, indicating the participation of DNA methylation in the coloration. This expression characterization of red-white skin along with the accompanying transcriptome-wide expression data will be a useful resource for further studies of pigment cell biology.
... Over the past decade, several new methods based on short read sequencing (SRS) partially dealt with these challenges. High-throughput RNA sequencing [14], in particular targeted short read RNA sequencing [15,16] and high-throughput minigene splicing assay [7,17] have been developed and widely used. Despite the advantage of these technologies, they are limited in exploring the complete structure of isoforms [18]. ...
Article
Full-text available
Background Solving the structure of mRNA transcripts is a major challenge for both research and molecular diagnostic purposes. Current approaches based on short-read RNA sequencing and RT-PCR techniques cannot fully explore the complexity of transcript structure. The emergence of third-generation long-read sequencing addresses this problem by solving this sequence directly. However, genes with low expression levels are difficult to study with the whole transcriptome sequencing approach. To fix this technical limitation, we propose a novel method to capture transcripts of a gene panel using a targeted enrichment approach suitable for Pacific Biosciences and Oxford Nanopore Technologies platforms. Results We designed a set of probes to capture transcripts of a panel of genes involved in hereditary breast and ovarian cancer syndrome. We present SOSTAR (iSofOrmS annoTAtoR), a versatile pipeline to assemble, quantify and annotate isoforms from long read sequencing using a new tool specially designed for this application. The significant enrichment of transcripts by our capture protocol, together with the SOSTAR annotation, allowed the identification of 1,231 unique transcripts within the gene panel from the eight patients sequenced. The structure of these transcripts was annotated with a resolution of one base relative to a reference transcript. All major alternative splicing events of the BRCA1 and BRCA2 genes described in the literature were found. Complex splicing events such as pseudoexons were correctly annotated. SOSTAR enabled the identification of abnormal transcripts in the positive controls. In addition, a case of unexplained inheritance in a family with a history of breast and ovarian cancer was solved by identifying an SVA retrotransposon in intron 13 of the BRCA1 gene. Conclusions We have validated a new protocol for the enrichment of transcripts of interest using probes adapted to the ONT and PacBio platforms. This protocol allows a complete description of the alternative structures of transcripts, the estimation of their expression and the identification of aberrant transcripts in a single experiment. This proof-of-concept opens new possibilities for RNA structure exploration in both research and molecular diagnostics.
Article
Full-text available
The advancement of multi-omics tools has revolutionized the study of complex biological systems, providing comprehensive insights into the molecular mechanisms underlying critical traits across various organisms. By integrating data from genomics, transcriptomics, metabolomics, and other omics platforms, researchers can systematically identify and characterize biological elements that contribute to phenotypic traits. This review delves into recent progress in applying multi-omics approaches to elucidate the genetic, epigenetic, and metabolic networks associated with key traits in plants. We emphasize the potential of these integrative strategies to enhance crop improvement, optimize agricultural practices, and promote sustainable environmental management. Furthermore, we explore future prospects in the field, underscoring the importance of cutting-edge technological advancements and the need for interdisciplinary collaboration to address ongoing challenges. By bridging various omics platforms, this review aims to provide a holistic framework for advancing research in plant biology and agriculture.
Preprint
Full-text available
Long non-coding RNA (lncRNA) plays important roles in animals and plants. In filamentous fungi, however, their biological function in infection stage has been poorly studied. Here, we investigated the landscape and regulation of lncRNA in the filamentous plant pathogenic fungus Botrytis cinerea by strand-specific RNA-seq of multiple infection stages. In total, 1837 lncRNAs have been identified in B. cinerea . A large number of lncRNAs were found to be antisense to mRNAs, forming 743 sense-antisense pairs, of which 55 antisense lncRNAs and their respective sense transcripts were induced in parallel as the infection stage. Although small RNAs were produced from these overlapping loci, antisense lncRNAs appeared not to be involved in gene silencing pathways. In addition, we found the alternative splicing events occurred in lncRNA. These results highlight the developmental stage-specific nature and functional potential of lncRNA expression in the infection stage and provide fundamental resources for studying infection stage-induced lncRNAs.
Article
RNA sequencing (RNA-seq) is widely adopted for transcriptome analysis but has inherent biases that hinder the comprehensive detection and quantification of alternative splicing. To address this, we present an efficient targeted RNA-seq method that greatly enriches for splicing-informative junction-spanning reads. Local splicing variation sequencing (LSV-seq) utilizes multiplexed reverse transcription from highly scalable pools of primers anchored near splicing events of interest. Primers are designed using Optimal Prime, a novel machine learning algorithm trained on the performance of thousands of primer sequences. In experimental benchmarks, LSV-seq achieves high on-target capture rates and concordance with RNA-seq, while requiring significantly lower sequencing depth. Leveraging deep learning splicing code predictions, we used LSV-seq to target events with low coverage in GTEx RNA-seq data and newly discover hundreds of tissue-specific splicing events. Our results demonstrate the ability of LSV-seq to quantify splicing of events of interest at high-throughput and with exceptional sensitivity.
Article
Full-text available
We recently showed that the mammalian genome encodes >1,000 large intergenic noncoding (linc)RNAs that are clearly conserved across mammals and, thus, functional. Gene expression patterns have implicated these lincRNAs in diverse biological processes, including cell-cycle regulation, immune surveillance, and embryonic stem cell pluripotency. However, the mechanism by which these lincRNAs function is unknown. Here, we expand the catalog of human lincRNAs to ≈3,300 by analyzing chromatin-state maps of various human cell types. Inspired by the observation that the well-characterized lincRNA HOTAIR binds the polycomb repressive complex (PRC)2, we tested whether many lincRNAs are physically associated with PRC2. Remarkably, we observe that ≈20% of lincRNAs expressed in various cell types are bound by PRC2, and that additional lincRNAs are bound by other chromatin-modifying complexes. Also, we show that siRNA-mediated depletion of certain lincRNAs associated with PRC2 leads to changes in gene expression, and that the up-regulated genes are enriched for those normally silenced by PRC2. We propose a model in which some lincRNAs guide chromatin-modifying complexes to specific genomic loci to regulate gene expression.
Article
Full-text available
This study describes comprehensive polling of transcription start and termination sites and analysis of previously unidentified full-length complementary DNAs derived from the mouse genome. We identify the 5' and 3' boundaries of 181,047 transcripts with extensive variation in transcripts arising from alternative promoter usage, splicing, and polyadenylation. There are 16,247 new mouse protein-coding transcripts, including 5154 encoding previously unidentified proteins. Genomic mapping of the transcriptome reveals transcriptional forests, with overlapping transcription on both strands, separated by deserts in which few transcripts are observed. The data provide a comprehensive platform for the comparative analysis of mammalian transcriptional regulation in differentiation and development.
Article
Full-text available
Clark et al. criticize several aspects of our study [1], and specifically challenge our assertion that the degree of pervasive transcription has previously been overstated. We disagree with much of their reasoning and their interpretation of our work. For example, many of our conclusions are based on overall sequence read distributions, while Clark et al. focus on transcript units and seqfrags (sets of overlapping reads). A key point is that one can derive a robust estimate of the relative amounts of different transcript types without having a complete reconstruction of every single transcript. In this brief response, we first revisit what is meant by pervasive transcription, and its potential significance. We then discuss the major points raised by Clark et al. in the order presented in their critique. Finally, we demonstrate that conclusions very similar to those of our original study are reached with a dataset with far greater read depth, obtained by strand-specific sequencing of rRNA-depleted total RNA from a single cell type.
Article
Since the discovery of messenger RNA (mRNA) over half a century ago, the assumption has always been that the only function of mRNA is to make a protein. However, recent studies of prokaryotic and eukaryotic organisms unexpectedly show that some mRNAs may be functionally binary and have additional structural functions that are unrelated to their translation product. These findings imply that some of the phenotypic features of cells and organisms can also be binary, that is, they depend both on the function of a protein and the independent structural function of its mRNA. In this review, we will discuss this concept within the framework of multifunctional RNA molecules and the RNA World Hypothesis.