PosterPDF Available

TSSAR: Transcription Start Site Annotation Regime for dRNA-seq data

Authors:

Abstract

To fully comprehend a bacterial cell the when and where of transcription initiation is of pivotal interest. The first informs what mRNAs exist at a given time, thus are potentially translated into effector proteins. The later describes how the substrate of translation and post transcriptional gene regulation looks like. There is a variety of different techniques to find the exact position of a transcription start site (TSS). But only a few are able to screen whole genomes for TSS in a high-throughput manner. One of these methods is dRNA-seq, which works by enriching primary transcription starts in a TEX treated library compared to an untreated library. TEX specifically degrades RNA fragments which are not protected by a triphosphate at its 5’ end, a characteristic of RNA fragments originating from primary transcription starts. Since the depletion is not infallible, not every signal represents an original TSS. Hence, a statistical analysis of the read counts in the treated versus the untreated library has to be performed. Therefore we developed TSSAR, a Transcription Start Site Annotation Regime, with the intention to set the interpretation of dRNA-seq data on a sound statistical basis combined with a user friendly interface.
TSSAR : Transcription Start Site Annotation Regime
for dRNA-seq data
Fabian Amman
1
, Michael T. Wolfinger
2,3,4
, Ivo L. Hofacker
2,5,9
, Peter F. Stadler
1,2,5,6,7,8
and Sven Findeiß
2,9
Introduction
To fully comprehend a bacterial cell the when and where of transcription
initiation is of pivotal interest. The first informs what mRNAs exist at a given
time, thus are potentially translated into effector proteins. The later describes
how the substrate of translation and post transcriptional gene regulation looks
like. There is a variety of different techniques to find the exact position of a
transcription start site (TSS). But only a few are able to screen whole genomes
for TSS in a high-throughput manner. One of these methods is dRNA-seq [1],
which works by enriching primary transcription starts in a TEX treated library
compared to an untreated library. TEX specifically degrades RNA fragments
which are not protected by a triphosphate at its 5' end, a characteristic of RNA
fragments originating from primary transcription starts. Since the depletion is
not infallible, not every signal represents an original TSS. Hence, a statistical
analysis of the read counts in the treated versus the untreated library has to be
performed [2]. Therefore we developed TSSAR, a Transcription Start Site
Annotation Regime, with the intention to set the interpretation of dRNA-seq data
on a sound statistical basis combined with a user friendly interface.
Method
To account for the different transcription dynamics in the genome, each site is
evaluated in the context of its local surrounding by a sliding window approach.
Background Modeling
An arbitrary region in the genome might be a mixture of transcribed and not
transcribed sections. For the first, read start counts can be described by a
Poisson distributed random variable, the later is expected to be uniformly zero
distributed . To estimate the parameters which describe only the Poisson
part, TSSAR applies a zero-inflated Poisson model regression [3]. All excess
zeros are believed to originate from untranslated regions and are removed from
the sample . Finally, the mean value λ of the remaining sample is calculated,
describing the background distribution of the transcribed part of the considered
window .
TSS Annotation
TSSAR aims for finding positions with a significantly enriched signal in the TEX
treated library, considering the expected variability from the background model.
Thereto, the read start count difference between treated and untreated library
for each position is considered . The derived sample of differences follows a
Skellam distribution [4]. The distribution's shape and position is characterized
by the prior deduced λ parameters. Regarding the whole sample, each value
can be evaluated how well it fits the model . Given a p-value cutoff, a minimal
difference α can be deduced above which all positions are annotated as
TSS .
Architecture
TSSAR is available both in a stand alone
version and as a RESTful Web Service. Client-
side pre-processing by means of a platform-
independent client application allows for
rapid extraction of essential dRNA-seq input (mapped reads) and avoids huge
data traffic between client and server. The statistical TSSAR model is then
applied to the data on the server. Predicted TSS are available for screening in
modern Web browsers and can be downloaded in various file formats.
TSSAR's main output lists significantly enriched positions. In addition,
consecutive TSS are clustered together to the most prominent signal. If the
reference genome's annotation is provided, TSSAR uses this information to
classify each annotated TSS according to its genomic context.
Difference
Quotient
TSSAR
0.985
0.990
0.995
1.000
0.0
0.1
0.2
0.3
0.4
0.5
0.00
0.25
0.50
0.75
1.00
Accuracy
F−measure
Recall&Precision
0 100 200 300 5 10 15 20−15 −10 −5 0
cutoff threshold
Accuracy F−measure Precision Recall
Evaluation
To assess the performance of TSSAR we used a published H. pylori dRNA-seq
data set [1]. Our approach was compared to two basic approaches. There, the
TSS annotation was done
based on the simple classifier
'Difference' and 'Quotient'
between read start counts in
the treated and untreated
library. For all methods the
results were compared to the
manual annotation from [1].
To quantify the performance,
recall, precision, accuracy
and F
1
-measure were
calculated. TSSAR shows a
higher precision and
simultaneously a less sharp
drop of the recall rate.
Hence, in terms of the F
1
-
measure, it excels the basic
approaches.
1
Bioinformatics Group, Department of Computer Science and the Interdisciplinary Center for Bioinformatic, University of Leipzig, Härtelstr. 16-18, 04107 Leipzig, Germany.
2
Institute for Theoretical
Chemistry, University of Vienna, Währingerstr. 17, A-1090 Vienna, Austria.
3
Center for Integrative Bioinformatics Vienna (CIBIV), Max F. Perutz Laboratories , University of Vienna & Faculty of
Computer Science, University of Vienna. Dr. Bohr-Gasse 9, A-1030 Vienna, Austria.
4
Department of Biochemistry and Molecular Cell Biology, Max F. Perutz Laboratories, University of Vienna, Dr.
Bohr-Gasse 9, A-1030 Vienna, Austria.
5
Center for RNA in Technology and Health, University of Copenhagen, Grønnegårdsvej 3, Frederiksberg C, Denmark.
6
Max Planck Institute for Mathematics in
the Sciences, Inselstraße 22, D-04103 Leipzig, Germany.
7
Fraunhofer Institute for Cell Therapy and Immunology, Perlickstraße 1, D-04103 Leipzig, Germany.
8
Santa Fe Institute, 1399 Hyde Park
Road, Santa Fe NM 87501.
9
Research group Bioinformatics and Computational Biology, Faculty of Computer Science, University of Vienna, Währingerstr. 29, A-1090 Vienna, Austria.
Analysis and post-processing
position strand id score class comment
42065 - TSS_000097 34 Ad
27533 + TSS_000025 30 P 24nt upstream of gene HP0027
54525 + TSS_000038 10
97139 - TSS_000150 13 O -
43184 + TSS_000036 28 IP
within gene HP0044; 59nt upstream of HP0045
antisense to gene HP0043 (3nt downstream)
Ai antisense to gene HP0054
start stop name score strand
42064 42065 TSS_000097 34 -
27532 27533 TSS_000025 30 +
54524 54525 TSS_000038 10 +
97138 97139 TSS_000150 13 -
43183 43184 TSS_000036 28 +
chrom
chr
chr
chr
chr
chr
TSS classification related to gene annotation
annotated TSS in BED format
References
[1] Sharma et al. (Nature; 2010)
[2] Schmidtke et al. (Nucleic acids research; 2012)
[3] Yee (Journal of Statistical Software; 2010)
[4] Skellam (Journal of the Royal Statistical Society; 1946)
Discussion and Conclusion
TSSAR provides several advantages over previous dRNA-seq interpretations.
Among others, bias from prior notion is eliminated, the analysis is automated
(or semi-automated, since manually inspection is still highly advised) which
reduces time and effort, and enables to shift resources from technical issues
to focus more on biological questions.
TSSAR is available online at http://rna.tbi.univie.ac.at/TSSAR.
Analysis of annotated TSS in their genomic contextQuality control of dRNA-seq data
Client:
+ bam files > 1GB
+ pre-processing
Server:
+ TSSAR Data Analysis
upload<4x1MB
+ visualization
genome Browser
+ html
+ bed/gff
+ xlsx
+ integration of
gene annotation
A
B
C
D
E
F
{
{
λ
P
zeros to remove
P
o
i
s
s
o
n
D
i
s
t
r
.
transcribed regionuntranscribed region
λ
M
A
B
C
significantly
enriched
α
0
λ
P
M
D
E
F
inter-
ested
in the
math:
flip the page
plus library:
minus library:
Gedruckt im Universitätsrechenzentrum Leipzig
[–] Lib.
[+] Lib.
This work was
partly funded by:
ebio:RNAsys an Initiative by the BMBF
ResearchGate has not been able to resolve any citations for this publication.
  • Sharma
Sharma et al. (Nature; 2010)
  • Schmidtke
Schmidtke et al. (Nucleic acids research; 2012)
  • Skellam
Skellam (Journal of the Royal Statistical Society; 1946)